 So, hello everyone. Good late morning. So my name is Anna Lapuk and I'm from the Vancouver Prostate Center. So I've been in bioinformatics for about 10 years and during these years I've been working on different cancer types including breast, varian cancers and glioblastoma and now prostate cancer. But my major research focus is using a high resolution technology such as microarrays and extrusion sequencing to develop new biomarkers and therapeutic targets and to bring them into the translation applications, bring them to clinic. So I just have to say that today I will be talking about the integration of clinical data with the high resolution high dimensional data but which you have heard during previous days of the course and I have to say that I am by no means a biostatistician. I am a bioinformatician. I never was and I don't have even a vaguest intention to become one. I think that it's a very specialized area and it's for the high professionals to represent that area and yet the material that I will be teaching today heavily depends on the basic statistical questions or strategies and basically my goal is to deliver the main meaning of those statistical strategies to you so that you understand them and you are able to read the medical literature and understand what is done from the statistical point of view in the literature and also you will have your hands on the lab during which you will learn how to use basic statistical approaches to integrate clinical data with high dimensional data and so when you walk out of this room you will have a basic understanding which is quite important of these statistical concepts. You will have some experience and then you will be able to ask educated questions to biostatisticians if you are to perform any kind of research that incorporates those statistical approaches which I highly recommend you to do. I will start with very light introduction. I know that you must be pretty tired after all these days and I will give you very light introduction, very short introduction and then I will gradually pick up and pick up speed and go over the statistical materials and the rest of the materials and I plan to spend about two to two and a half hours and lectures and then we'll have about two hours for the lab which will be important. So let's start with the overview of this particular of this part of the module. So I will start with the short introduction as I said and then in the second part after the introduction I will describe to you the basic concepts how to go for molecular profiles to classification and prediction problems and I will also introduce you into a bit more specialized area of classification studies and that is a drug response studies. So first of all let's recall what the biomarkers and therapeutic targets are. If you heard it during this course that's great we'll just recall what they are. So biomarker is a biological molecule found in blood, either body fluids or tissues that is a sign of a normal or abnormal process or of a condition or disease. A biomarker may be used to see, predict how well the body responds to treatment for a disease or condition and also called molecular marker or signature molecule. What is the therapeutic target? It's also a biological molecule, an enzyme receptor or other protein that can be modified by an external stimulus. The implication is that a molecule is hit by a signal and its behavior is thereby changed. So these are the definitions that we all need to know about. So throughout this course you have heard a lot about different types of biomarkers, which would include point mutations or genome copy number alterations, aberrant gene expression, aberrant transcript generation such as fusions. So what is the purpose of developing them and what is the ultimate use? So we can use biomarkers to classify tumors better, which may improve diagnosis. We can monitor change in certain markers and see how the disease progresses. We can see how the patient will do in the future by predicting the disease recurrence for instance. And then ultimately we can select a proper therapy for a patient and then if possible predict the response to the therapy. So indeed the ultimate goal is to select the most appropriate therapy that will be chosen on the basis of particular molecular characteristics of a tumor. However as we know molecular determinants of complex diseases such as cancer may differ from patient to patient. And therefore the sciences such as genomics and genetics that search for disease causing genes should always be concurring with a pharmacogenetic or pharmacogenomics studies that search for determinants of response treatment. So what is the pharmacogenetics? Pharmacogenetics is the study of the role of inheritance in variation in drug response phenotypes. And these phenotypes can range from life-threatening adverse effects of a drug at one end of the spectrum to equally serious lack of therapeutic efficacy on the other. And over the past half century pharmacogenetics like any other medical research field has evolved from the discipline that was focusing on monogenetic traits to become a pharmacogenetics, pharmacogenomics I'm sorry with a genome-wide perspective. So what is the origin of pharmacogenetics? The earliest experimentally validated examples of an effect of inheritance on drug response were first reported half a century ago even actually more than that in 1950s and 1960s. And back then researchers have noticed that there were large differences in response to standard doses that were given to different patients. And for example in this table I'm showing you two early examples. One is the short-acting muscle relaxant that was observed to be hydrolyzed differently in different subjects and that this variation was inherited. And then later it was discovered that the production of the dysfunctional enzyme BCHE enzyme was responsible for this differential response in patients. And these patients were at significantly different risk of prolonged muscle paralysis which is very serious almost life-threatening side effect. And then at the same time another drug which was one of the first anti tuberculosis drugs, isoniazid, it was noticed that patients who were given standard doses had totally different plasma concentrations and they were at the significantly different risk of developing adverse reactions. And then later it was also discovered that changes in the function of an enzyme which is metabolizing the drug, the enzyme NET2, was responsible for these differences in patients. So just one more example. This is actually a pharmacogenetic icon so to speak. You probably know the chromosome P450 family of genes. It's a pretty large family of proteins that are involved in synthesis and metabolism of various molecules and chemicals within cells including drugs. And then C2D6 is a member of this family and is just really an outstanding example of that era. So it biotransforms a wide range of different drugs and you can see how a wide spectrum of different types of drugs it biotransforms. And then it was found later that there is a great number of genetic variation in this particular gene in human population that can result in differential response to these drugs, these many drugs. And those genetic variations included non-synonymous coding SNPs associated with decreased activity. In some instances it was gene deletion in others it was gene duplication up to many copies. And so these alterations were linked to the different rate of target drug metabolism. And here what you see is the plot of frequency distribution of ratio of the drug to its metabolite right here in a population of patients which included poor metabolizers here, the extensive metabolizers which were the majority of population and ultra-rabbit metabolizers. So these different people had different types of, had different status of a function of C2D6. And this resulted in different velocity of metabolism of the target drug by this enzyme. And so therefore these sub-populations of patients may suffer from adverse reactions to particular drugs. So for instance the poor metabolizers may suffer from excessive drug effect with drugs that are inactivated by this enzyme. Or alternatively they can suffer from inadequate drug effect with drugs that are activated by this enzyme. And in the case of ultra-rabbit metabolizers it's on the opposite. So these are a classical examples of pharmacogenetics which we're focusing on a monogenetic trait. Now cancer is truly the ultimate example of polygenic disease with many genes involved. And the complexity of molecular biology of cancer stems from the fact that in addition to polygenic nature, mutations in a broad sense of the word can take place on all levels. It can be on the genome level, it can be point mutations on the transcriptome level, through a generation of different splice variants on the post-transcriptional level on epigenetic changes as well as on the protein level. And so modern technologies do enable screening for all types of these mutations which can represent individual types of biomarkers. And so I just want to show you this slide. This is a breast cancer amplicon, just a structure of this amplicon. And it shows you how complex may be the structure of a cancer genome. So you see that this amplicon is composed of many parts of the human genome that are concatenated together. And these arrangements can give rise to fusion transcripts and then fusion proteins, etc. And so it just emphasizes one of the greatest challenges in cancer genomics to reconstruct cancer genome and transcriptome and to be able to use this information to generate different types of biomarkers. Amplicon. So amplicon is the part of the genome that gets amplified or increased in copy number. And so originally it was thought that a particular intact region of the genome was amplified to increase the copy number of some say oncogene for example. But then when people started looking into the structure of those amplicons it became clear that these actually not intact regions of the genome but rather actually regions that are reshuffled. So they are composed of regions from different distinct regions of the genome. But yet they still refer to as amplicons. So gene expression profiles have been widely used for development of expression based biomarkers. And here you see the expression pattern of breast cancers which you probably have seen earlier during this course with a number of distinct subtypes. Here you see basal or B2 positive luminal cancers. So basically breast cancer is a very good example of presence of distinct histological subtypes that are characterized by distinct gene expression profiles. And certainly gene expression profiles are widely used to develop biomarkers to discriminate subtypes of cancers which in turn have distinct clinical outcomes. And so basically here you see Kaplan-Meier curve right there for distinct histological subtypes of breast cancer. Luminal A, luminal B, basal and erby 2 which have completely different survival experience and we will talk about the Kaplan-Meier curve in greater detail later. So in this case what I also want to point out is that many researchers have also observed that integration of different data types can increase our power to develop robust biomarkers. And in this case you can see that in particular luminal A subtype which was defined by gene expression profiles was split into two subsets of luminal As that had a large number of high copy gains in the genomes and the other subset which did not have those. And clearly these two subsets of luminal As also had quite significantly different survival experience. So molecular profiling technologies now enable very detailed characterization of cancer genomes and transcriptomes and this knowledge can lead to the development of clinical tests and potentially guide patient management. And I just want to bring one very bright example, one of the earlier successes on this front which is HER2 or erby 2 and the Trastuzumab story. So the erby 2 or HER2 receptor is a cell surface receptor tyrosine kinase a member of the erby family. Its overexpression results in activation of intercellar signaling through a number of pathways to promote cell division, cell growth and inhibit apoptosis. In 1987, Dennis Salomon has reported in science that HER2 was overexpressed up to about 20 copies in about 30% of breast cancer patients and this amplification was associated with shorter survival and relapse times. In 1990, Genentech develops humanized monoclonal antibody against HER2 receptor and it was shown to be effective only in few percent of patients even bearing HER2 amplification. At the same time, methods to detect HER2 amplifications with fish and protein levels with immunohistochemistry approaches was under the development and in 1992, the Trastuzumab, the targeted antibody against HER2 went into the clinical trials and nowadays the standard of care includes a combination of a test for HER2 expression status together with a Herceptin in combination with other drugs. And here on this slide, I want to show you the efficacy of a Herceptin drug alone in yellow and in combination with a number of other drugs. And so as I mentioned, the response rate for the Herceptin alone was relatively low and for the HER2 positive patients, here you see it was lower than 30% but at the same time, the response rate increased significantly for a combination therapy which reached around 60-70% and that was a huge success. So there is a huge number of studies devoted to the detection or development of biomarkers of cancer and this number is actually daunting. Thousands and thousands of publications you can find in PubMed. But I think it's good to ask ourselves, do we really manage cancer well and how this knowledge is affecting the efficacy of managing cancer? So this is the statistics that one can find on the Canadian Cancer Society website. I think this is very useful statistics that one that works in cancer research area would like to review occasionally. This is actually quite curious statistics. And on this slide, I give you statistics on incidence and mortality rates for men for selected cancer types. And for instance for prostate cancer, so I want to draw your attention to a couple of cancer types. This is prostate and this is lung cancer, which I will talk about briefly. So for prostate cancer, with the advent of the PSA screen in 1993, more early stage cancers could be detected which resulted in this spike of incidence rate. But it was rapidly realized that PSA is not always associated with malignant disease and therefore we can see the decline in the incidence rate in the following couple of years. So, but if you take a look at the mortality rate for prostate cancer, it is changing a bit, it's declining just a little bit, but not much really. And the reason is that we're still not doing great job at distinguishing aggressive cancers from indolent cancers. Lung cancer is another example. So, both incidence rates and mortality rates are going down over the last couple of decades from what you can see here, incidence rate and mortality rate right there. And this is most likely attributed to the recognition of the risk factors such as smoking, which influence the awareness within male population who just stopped smoking. So, what's happening in women? The introduction of the mammography was a milestone for breast cancer management, providing means to detect cancers at earlier stage, which is a way more curable. And indeed the mortality rate went down a little bit, but still we cannot say that breast cancer is eradicated. And again, the reason is that we lack biomarkers that describe aggressive disease and can distinguish aggressive disease from indolent disease. For the lung cancer statistics, it's actually quite curious, unlike men, for women the lung cancer incidence and mortality rate is steadily increasing. You see it here and then here. So, this can be explained by a vast feminization taking place during years from 1980s and on. So, despite all of the awareness of the smoking as a risk factor, women still choose to smoke. So, and this is the summary of all of the rates in men here and in women on the right with the mortality down there and the incidence rate in the top part of the slide. And so, when we correct for the aging population and population growth, we still can see that the incidence rate is growing a little bit and perhaps declining a little bit here. The mortality rate is a bit declining, but it's just really a minor change. But overall, we can say that the incidence and mortality rates stay more or less constant. So, indeed, we still need novel prognostic and therapeutic biomarkers about which we'll talk in the next part of my module. So, how can we go from molecular profiles to classification and prediction and what these problems are basically, what they are about? So, the main statistical problems in classification in general and in tumor classification in particular include the following problems. Identification of new or unknown classes using molecular profiles and that will be called unsupervised learning. Classification into known classes and that will be called discrimination analysis or supervised learning. And then identification of markers that characterize classes which is called variable selection. Identifying new classes. We had one talk where they defined 10 different subgroups and so on, but these are all statistically significant differences, but how many groups do we want? If we have a hundred patients, do you want a hundred subgroups or do you want two subgroups or what? Well, that's a tricky question. So, it has the biological aspect and it also has computational aspect. So, from computational aspect, point of view, you can choose how many classes you would like to see. So, for instance, you say, I expect within this cohort to have three, four, well, at most five subgroups and I don't really want to subdivide the patient cohort into more subgroups than that. So, this is from the computational point of view. Now, from the biological point of view, this is the question you want to answer, right? So, you would like to find how many subtypes of cancer you have in your cohort of patients and by applying different approaches, different cut-offs, different trade-offs, you can basically... you can find out what is the most probable number of subgroups of subsets of patients in your cohort. So, it's not going to be 100 patients, each patient representing an individual group, right? It's going to be some limited number of subgroups. What are we talking about treated distinctly differently? Well, if they bear certain mutations in the broad sense of the word, that would indicate that a certain therapy will work for those patients, then yes. But here, you know, we're talking about classification, so let's just abstract from the therapy and treatment of patients. It is just generally classification problem that we're talking about, okay? And so, first, the goal in the answer-provise learning is to find those novel classes based on the molecular profiles and then, you know, to identify molecular abnormalities that are characteristic to each of them, whether they have different outcomes, those subgroups of patients, and then do a number of downstream analysis with it. So, but today I will be talking about the second scenario, classification into known classes. So, Sorab and others I think have introduced you to the identification of new classes through clustering algorithms and so on and so forth. So, today we'll be talking about discrimination and prediction, okay? So, what is the discrimination and prediction and how is it done? So, we start with expression profiles of two groups of patients. Say one group is poor response to drug, another group is good response to drug, or it can be a set of cancer samples versus a set of normal samples. And using these gene expression profiles, we build a classifier, which is a set of genes, which expression pattern describes the two groups the best. And then the classifier can be applied to a new patient with its molecular profile to answer the question, what is the most likely response for that patient based on the behavior of that classifier in that new patient? So, this part is called discrimination and this part is called prediction. So, for building a classifier, we need a classification rule, which is composed of two components, classification method and feature selection. So, there's a great number of different classification algorithms out there, probably some hundred of them. And these are the most commonly used ones and instead of describing them in detail, I will just refer you to a very nice paper written by pretty prominent statisticians at UC Berkeley and they not only describe those methods in detail but they also provide you a comparison of these. What I will just say is that these are relatively simple methods, computationally simple methods and among popular ones are linear discriminant analysis and nearest neighbor classification. Whereas the bottom two ones, neural networks and supported vector machines, they are also quite robust and nice but they are way more computationally intensive and they require more bioinformatic training. So, this is the paper that I would like to refer you to. So, basically this is a paper that was comparing different methods using several publicly available data sets and the conclusion they came up with is that KNN and linear discriminant analysis were performing really nicely. They had the smallest error rates and pretty good methods for general purposes. So, now moving to the feature selection which is another important component of the classification rule. So, there are three main classes of feature selection methods. The first one is filter method, then wrapper methods and embedded methods. So, filter methods are probably the most straightforward and easy ones and they include the filtering of genes if we're dealing with gene expression that are basically irrelevant to the studied outcomes. So, features or genes are scored independently. They are ranked according to a certain statistics and then the top N is selected for being used in a classifier. And those scores may be something like T statistics of differential expression between outcome groups. It can be between group to within group variation ratio. It can be p-value. So, there is a number of problems with this method. One is the redundancy. So, features are selected independently without assessing if they contribute new information. So, another pitfall is that interactions between features are not considered and another problem is that classification does not take part in the feature selection whereas it should. But yet, this is quite straightforward method and lots of people are actually using it pretty widely. Another type of feature selection method is a wrapper method and that is the iterative approach. So, classification performance of many subsets of features is performed and then the best performing is selected. So, the problem about this approach is that it's computationally intensive and very easy to overfit. So, the last method in a class of methods, it's when classification and feature selection is performed simultaneously and this improves classification accuracy and methods such as KNN and LDA actually examples of the methods that do exactly that. So, this slide illustrates the filtering method of feature selection. So, this is a correlation matrix of some 3,500 different genes from a patient cohort that is represented by 3 subsets of patients and if you see the correlation is actually quite poor. But then, if we filter out genes that are not differentially expressed between the groups, just as one may expect, we can see a higher correlation and this is good. We basically have gotten rid of noisy genes which are basically uninformative with regard to our study groups or outcomes and they will not interfere in the classification procedure. So, in summary, what's the goal of feature selection? To improve classification performance by removing genes that are not associated with outcome? Also, it may provide useful insights into the biology of a disease and can lead to the diagnostic tests such as breast cancer chip. So, this is how it's normally done. So, now moving on to the general strategy of building a classifier. So, this slide shows basic components of this procedure. So, we start with a learning set of our samples and we build a classifier using molecular profiles obtained for samples in that set which sometimes is referred to as training set or learning set and then we evaluate the performance of this classifier how well it can discriminate outcomes in a totally independent set which is called test set and the error rates of classification with this classifier on the learning set and on the test set contribute to the overall performance assessment. So, this is the ideal case when we have two independent data sets two independent cohorts of samples for training and for testing. But unfortunately, the thing is that independent sets of samples are rarely available and almost never at the time of the current study. So, what happens more frequently is that a researcher is presented with a single set of samples and then this set is split into two. One will serve the purpose of learning or training classifier and another one will be used for testing a classifier. So, in this case, the test set is not purely independent but this is a strategy that is normally taken just due to lack of any better approach or any better selection of sample sets and then the testing, again, of the classifier on the test set and on the training set contributes to the performance assessment. So, the pitfall in this strategy is we basically effectively reduce the sample sizes for learning and for the test sets, which is set, but that's what needs to be done. The essential part of the classification procedure is the performance assessment and it consists of the following components. So, we would like to answer questions about classifier. How accurate a classifier is and for that purpose, we use confusion matrix or metrics such as accuracy, sensitivity, specificity and so on. Another question that one may ask is how well classifier worked on learning set and that will be called a resubstitution error rate. So, this is the one that I showed you here. It's right here, error rate, resubstitution error rate and since it's from the testing evaluating of a classifier on the data set on which the classifier was trained on, this is very optimistic estimate. More realistic estimate is how well classifier worked on test set. So, here, error rate that comes from here so when we test classifier on the test set. And then other methods to evaluate a classifier is cross-validation and certainly when we build a classifier we would like to compare it with other classification methods and for that purpose we use rock curves, for instance. If you only focus on the, you're essentially asking the question how well the group may be doing that. You're not really saying anything about what it means in a practical setting. Like, if you have two groups of patients and you find one group as blue patients and the other group as green patients and you find out that it's really good but unfortunately the green and blue patients if they don't live longer, if they don't respond differently to treatment, if they don't, there should be some feeling why is there no element of testing what it means. So, you're probably referring to the other aspects of this analysis. So, certainly when you look for subgroups of patients you would like to see whether they are clinically relevant subgroups. So, let me just give you this scenario. Probably this is something that you're trying to get at. So, you have a cohort of 100 patients and you do molecular profiling. Then, you know almost nothing about these patients. Just with exception that it's a breast cancer, prostate cancer. So then you do an unsupervised cluster and you find two very strong expression-based subgroups of patients. So, what does it mean? Are they clinically relevant subgroups? So, what you do next is that you take these subgroups of patients and explore the survival experience through the Kaplan-Maya curves, for instance. And then you see that these, say, two subgroups have significantly different Kaplan-Maya curves. And you conclude, yes, I have discovered a subgroups of patients that are associated with poor or good survival. What we are talking now is a details of how to perform discrimination and prediction when you already found your subgroups and you convinced yourself that they are clinically relevant or interesting biologically or otherwise. So, and this is the detailed process how to build classifier that will describe those two groups the best and will be applicable to any new patient that will come and we would like to predict the survival, for instance. Okay? So, now this is a confusion matrix that is very nice table. This is basically a visualization tool that is typically used in a supervised learning. So columns represent the instances in actual classes and rows represent the instances in predicted classes. So, for instance, we are developing a new biomarker to predict something. And we compare it with a golden standard approach such as, you know, histopathology, for instance, where patients are evaluated, the samples evaluated are diagnosed either as cancer positive or normal, negative. And then we would like to see how well our biomarker is performing for this particular task of discriminating cancer versus normal, for instance. So, a confusion matrix displays the number of correct and incorrect predictions made by our biomarker model compared with the actual classifications and the actual classification is here again. So we have, by histopathology, we have this number of cancers and this number of normals and by our biomarkers, we can predict this many of true cancers as being cancers, indeed, and then the rest is false positives and so on and so forth. So this is true negative and this is false negative. So I will give you just an example and what we can do with the confusion matrix. So, from the confusion matrix, we can derive four important statistics. And these statistics basically come, they are derived from fractions of these red colors here, from the total of columns and total of rows. So from here we can derive sensitivity and specificity and out of these we can derive accuracy of a biomarker or classifier. And from rows we can derive positive predictive value and negative predictive value, which are more important metrics than sensitivity and specificity. I will explain to you why. So basically the sensitivity is the proportion of positives who are correctly identified by the test. Sensitivity is the proportion of positives that are correctly predicted by our test. And specificity is the proportion of true negatives that were correctly predicted by our test. And then the positive predictive value and negative predictive value. So positive predictive value is the proportion of patients with positive test who are correctly diagnosed. So I won't confuse you any longer. So we'll just see the example here how the thing is calculated. So basically you see that we are comparing endoscopy with some new test here. And here we have true positives, false negatives, false positives and true negatives. And from here it's very easy to derive sensitivity, specificity and then PPV and NPV. So the PPV and NPV are mostly, are most commonly used by clinicians because they take into account the prevalence of the disease, which is basically the frequency of the disease in the population, which is quite important, unlike sensitivity and specificity. So these are the metrics that you will often find in medical literature that describes any classification studies or new biomarkers. Sorry? Do you also look at precision that we call? Precision at? And recall. And recall? And what's that? That's just like specificity and accuracy. Just you do true positives, divided by true positives, as false positives. In text money people usually use it. Oh, I see. Because when you use those things you get really high numbers. Well, what I can tell is that if you are a bioinformatician and you derive some new biomarker and then you present it in some clinical meeting and you show really nice sensitivity and specificity, almost 100%, no one will believe you. So it's PPV and NPV. That in clinical research is the most important. Is the most important. Accuracy sometimes is, actually not sometimes, quite often is used to demonstrate the efficacy performance of markers. But in the end, PPV and NPV is the most important thing. So another metric that I will give a brief summary of is the misclassification error rate that you can find in the medical literature as well. And that's basically a complement of accuracy one minus accuracy. And we can have, as I mentioned, resubstitution error rate or test set error rate. And again, resubstitution error rate is very optimistic because it's applied. It's derived from the same data set on which the classifier was trained. So a very important internal validation of a classifier is a cross-validation. And it's performed on the training set. And virtually in every paper that you can read that is devoted to new classification algorithms or new biomarkers, all of them will have one or other version of cross-validation. So the most general version of a cross-validation is V-fold cross-validation. And it's performed as follows. So cases in a learning set randomly divided into V subsets of nearly equal size. Then we leave one set out and it will represent our test set. The rest comprise learning set for building classifier. We build classifier, then we compute the error rate on the left one out. And then we repeat over all V sets an average error rate. That's the general version of cross-validation. More special case is leave one out cross-validation when the V-fold is equal to N and N is a number of samples in the training set. So we basically leave one out, train classifier on the rest and then test it on one left out and repeat it over N times. So there is a couple of notes that I would like to mention. So there is a bias variance trade off so smaller V can give larger bias but smaller variance. And also cross-validation is quite computationally intensive so this needs to be kept in mind when working on such things. So now this slide summarizes all of the common steps in this process that I've described to you of building and validating classifier. So we again start with the learning or training set. We build a classifier. We perform a cross-validation of that classifier using learning set. Then we evaluate its performance on independent test set ideally. And then all these performance assessments contribute to the overall performance of the classifier. So just one more note that learning and test sets should be identically distributed. So if there is any additional factors such as race, age and status of certain markers or hormones, this should be taken into account so they are identically distributed. So when we build the classifier we would like to compare it with other classification methods and for this purpose we use rock curves. So receiver operating curve is commonly used to evaluate performance of the method in comparison with other methods. So this is just a graphical plot of the sensitivity or true positive rate versus false positive rate right here. And so this is just the actual example of rock curves for multiple different methods in this case. And so the rule of thumb is that the curve that approaches the left upper corner comes from the best performing method. So this is the rock space here and the diagonal line is a no discrimination line. It means that any rock curve that is underneath this discrimination line so this method does not work so it does not discriminate classes at all. So everything that is above the discrimination line discriminates samples or classes and again the perfect classification is this point when we have a very high true positive rate and a minimum false positive rate. So rock curves are widely used and I think many of you have seen rock curves anywhere at some point in time but the question really do we realize how these rock curves are constructed and I think it's important to understand this in order to be able to interpret your rock curves the best. So how they are constructed is summarized on this slide. So for example we have two classes or two groups of patients and they have distinct distributions of a classifier value for example let it be gene expression and this information is given it does not change under given study so these are two subgroups of patients with these two distinct distributions of gene expression of our classifier. So very often these distributions overlap and it's important to find a threshold to call say gene expression above normal that would be an indication of cancer for instance. And so the way the rock curves are generated is that we fix this threshold somewhere and then based on this threshold we may have a certain number of true cancers and true normals and false cancers and false normals based on the expression of that particular gene. And so with this fixed threshold we construct the confusion matrix for this one instance of a threshold and this confusion matrix gives us one point on the rock curve and then we reiterate this process again so we move the threshold here or there and we construct the next confusion matrix with we get corresponding true positives true negatives and we get another point on the curve. So that's how through multiple iterations by shifting the threshold of our classifier we build the rock curve. So how do you choose the best performing method? So the best performing method is basically chosen on the basis of the rock curve and here I'm sorry I didn't spell this out so you can summarize the rock curve with a single metric and that would be for instance area under the curve so there are two purposes of the rock curve so first is to select your method so you select it based on the metric such as area under the curve or some people prefer to use the complement which is area above the curve and you pick the method with the largest area under the curve and the second purpose is to select the threshold for your classifier so for instance you have distribution of a gene expression so where do you want to set a threshold above which you will say ok my gene is over expressed right so this is for the ultimate clinical test that will be using this gene expression ok and that's how that's what you can use rock curves for so for instance you can say ok I will use this threshold here because I would like to achieve this true positive rate and I can still tolerate this false positive rate and you know this trade-off between true positive rate and false positive rate is basically clinicians need to decide in terms of disease management side effects of treatment what is the best threshold to choose for a particular clinical test so two main purposes of the rock curves thank you for bringing this up so I am finishing this part of my talk and I would just like to mention that if you plan to to move into the classification and discrimination research project I think it will be very helpful to read these landmark papers that were published over the last decade or so so these concern different cancer types and they use slightly different methods so I think it's very useful to read through them if you are to embark on the research project like that so these are useful references for you so we're done with the with the major strategy how to build classifiers how to evaluate the performance and now I would like to introduce you to a special and very important research field which receives a lot of attention and it's a very important part of translational research in conjunction with clinical trials and that is predicting the response to therapy or drug response studies as they are often referred to so this area of research does involve the same classification studies as I described to you such as whether the patient will respond or not to the given therapy but in the reality there is a number of subtleties there is a number of challenges there that make this research field a special case and I would like to highlight this for you so patients with clinically similar diseases often respond differently to the same therapy and actually as of yesterday you heard quite a bright example of a patient who has the same or cancer type that is characterized with the same mutation which indicates a particular targeted therapy that works for other cancer types or other patients but still does not work for a particular for a given patient and so this is certainly due to the number of factors that have been described throughout this course for cancer patients in particular this is due to different molecular determinants of the disease or different subsets of molecular abnormalities that happen in each individual tumor so cancer treatment will be effective if it will target disease-causing genes in a given patient and also the treatment is associated with toxic side effects so it is very important in the post-diagnosis phase of patient management to choose the appropriate therapy and to be able to predict whether it will be effective and this involves the development of markers of drug response and then predicting either whether the patient respond at all or not and in that case it will be classification problem or it can be a regression problem when we would like to predict the actual amount of the response which is also possible it's a bit more challenging but it's also possible so this diagram shows the ideal scenario that I already showed to you before when we have two independent sets of samples and these sets of samples are clinical samples so they come from patients and ideally these patients in two independent cohorts are treated with the same regimen with the same therapy the same drug with the same regimen but unfortunately this rarely happens in the reality what's more common that researchers are presented with is that we need to build classifier or predictor of response using patient samples I'm sorry cell lines and then evaluate the performance or test the performance on patient samples it can also be another scenario when we're dealing with mixed cancer types and then we need to apply to predict response in a single cancer type also there is a possible scenario of building a predictor of response using cohort that is treated with combination therapy and then evaluating it on a cohort that was treated with monotherapy and vice versa so different scenarios are possible also when it comes to the response different metrics can also be used so for cell lines it can be something simple such as GI 50 growth inhibition by 50% TGI or ELSA 50 etc and then the researcher would basically set a certain threshold to call resistant lines and sensitive lines based on these metrics for patients it may be a bit more complicated it can be time to the progression it can be overall survival it can be arbitrarily assignment based on post treatment tumor volume it can be pathological complete response it can be residual disease etc etc so what's important to have in mind is that performance evaluation of predictive markers should be so the evaluation of the performance should be based on the same on the same response metrics so whenever you have your training set and then your test set the same metrics for response have to be chosen for individual sets of patients and that's quite important because the performance of a classifier will depend on the chosen metric so I will... yeah so it's growth inhibition by 50% so the cells are treated with the drug and then the inhibition of growth of cells is measured and usually it's the 50% that is used as a metric so I will say a few words about cell lines as model systems that are widely used for drug response studies this is invaluable resource there are pros and cons in using cell lines in such studies so the pros includes the following things so cell lines are more readily available than patients cohorts and they're very easy to manipulate and they provided limited supply they also often very thoroughly characterized models and so a number of different molecular profiles are usually available for cell lines then cell lines enable prediction of response to novel therapies single or in combination whereas it's not always possible for patients also they allow high throughput screening of thousands of new drugs so these are model systems and I will show you what resources are available for that also we can identify new uses of established agents which is quite challenging because in order to go across cancer types we need FDA approval which is very costly and a very time consuming process and also cell lines can be used to provide useful insights into mechanisms of drug action so the cons of using cell lines certainly exist and among them is the fact that cell lines represent a small minority of heterogeneous tumor populations and that are not available for all of the cancer types and cancer subtypes also the pitfall is that cell lines usually selected and adapted for cell culture conditions and they may not respond similarly to drugs as cancer cells growing in human host also drug exposure in vitro does not mimic the kinetics of drug exposure in human tumors that are influenced by a number of physiological factors and then validation of signature in patients may be actually quite difficult because most clinical efficacy trials are carried out with drug combinations and you know very often researchers start with single agent treatment of cell lines I will show you some examples this is a very good question I will show you so there are resources that provide valuable information on the response to wide range of drugs and so one of those resources is the developmental therapeutics program DTP that was established at NCI in 1990s and the main goal of that program is to identify and evaluate novel chemical and biological mechanisms of the action so this resource is built on the NCI 60 cancer cell lines panel which represents a certain number of cancer types and so basically this project is designed to screen large number of compounds some 3,000 compounds per year for potential anti-cancer activity so I also have to mention that recently this year Novartis Pharmaceutical in collaboration with Broad Institute has released another extraordinary resource which is called cancer cell line encyclopedia which is comprised of some thousand cancer cell lines that are treated with a large number of widely used anti-cancer agents and this is a public resource that everyone can use to perform drug response studies so this is really invaluable resource so various gene signatures and sequence alterations in target genes have been obtained for prediction of drug response in patients in many of these markers and we're both trained and tested on primary tumor samples which is great and they provided nice reproducibility and so if their reproducibility is proved to be stable across many independent data sets then they can provide new ways to design and carry out clinical trials and this table shows many studies that have been published using patient cohorts at the same time the disadvantage of these types of studies involving clinical specimens is the limitation in the number of chemotherapeutic drugs that can be tested as well as the dependence on well established and improved drug therapies so to circumvent this many groups have been using preclinical model cell lines to make use of you know cell lines and sometimes xenographs to investigate gene expression profiles that are associated with sensitivity to certain drugs and this can be done for hundreds or even thousands of different drugs so and this table shows you just a tip of the iceberg of the studies that have been published in this area so there is no golden standard at this point how to perform drug response studies and so what I will do I will show you few examples of drug response studies so that you can see what challenges people are presented with and how they try to address them so one example is from the M.D. Anderson Cancer center group which has a pharmacogenomic market discovery program which studied treatment of breast cancer with combination therapy I have published a number of papers here that you can see I will highlight just a couple of them for you so the first one so here you see the study summary so the group embarked on developing predictors of a pathological complete response in response to combination therapy using 82 breast cancer patients and validated on some 50 independent cases then they tried to evaluate use of different classifiers such as SVM, canons and linear discriminant analysis and then they tried to verify the number of features that they used in classifier with the goal to pick the most appropriate size of a classifier and they examined effect of training set size and model performance and they also compared genomic predictors versus clinical predictor so here you see that they have tried a pretty large combination of classifiers or classification methods with a varied number of genes that they include into classifier so here along the x-axis there are a number of genes that they included into classifier and here you see the area above the curve which as I said has to be the least possible for the best performing method and here each individual curve represents the area under the curve for each individual classification methods and so they have concluded that KNN is performing the best with size of a classifier of some 30 genes so that's how they approached this question so now I have to say that it's also great to have large training data set but it's not always readily available and the big question is what should be the size so that you can get nicely performing markers so what is the minimum size of a training set that one should work with and so here they try to look into this tough problem so they basically put both sample sets together set aside 20 samples for test set the rest they used for subsampling and training and they did some 50 iterations and then computed the median of prediction accuracy which is represented by red dot and as they increased the size of the training set they could see that the performance was steadily improving and then by extrapolation they concluded that based on the on this data the training size of some 200 cases would be only marginally than set of 80 cases well I have to say that this is quite creative approach and this is probably the best that they could do with the data that they were presented with but I have to say that you know it should be taken with a good grain of salt because you can't really the modeling of the arrows here was not modeled well enough and so it's really yes it is an approximation and very optimistic one but in the reality in order to estimate the actual performance of the markers one needs to be starting not with reasonable training set but also markers need to be evaluated on multiple independent test sets in order to see how it convinced themselves that the classifiers really work well but this is what people did and I think it was quite creative so another thing that they did is they compared genomic versus clinical predictors so they used the LDA method for both types of data and for genomic classifiers they used 30 probe classifier that I showed you before so basically from what you can see here on the sensitivity so this is clinical variables and these are the genomic predictors so the sensitivity specificity PPVs and NPVs so from what you can see unfortunately the genomic predictor was no better than clinical variables so for the sensitivity it was a bit better but you see the confidence interval is pretty large and then for the PPV the PPV was even worse NPV was a bit better but overall we can conclude that genomic predictors were not unfortunately better than clinical predictors and this highlights how challenging this area of research is so giving you yet another example from that group published in clinical cancer research in 2010 the motivation for that study was to investigate to what extent the cell line derived markers can predict outcome to drug response in patients this will be answering your earlier question so the way the study was designed is that they had breast cancer cell lines measured for response to individual drugs the ones that are listed here so monotherapy and then the derived signatures were evaluated on clinical specimens from breast cancer patients but treated with a combination therapy highly challenging task so this is how they proceeded through the study so they assigned cell lines to either sensitive or resistant to drugs and then they found informative genes so a feature selection step in two ways so correlation with GI 50 or differential expression with significant p-values between sensitive and resistant and then they constructed multi-gene classifier using the LDA approach to predict clinical response and these metrics were used as a response metric in patients so what was the conclusion they could not develop cell line derived predictors to four individual chemotherapy drugs that could predict response to combination therapy in patients and probably this is not surprising now it was very painful experience back then but now basically what's becoming clear is that drug response studies have raised more questions that they given answers so it's still unclear whether human cell lines can be an adequate start to develop markers of response that can be later applied to patients then what metrics should be used for cell lines is it IC50, is it GI50, is it some combination of all the metrics is the response curve should be taking into account all together and then another very important question what is the best measure of response in vivo as I described to you for patients it can be multiple different metrics that one can use and probably this is very tough question and I would just say that the way to go is just to be consistent between data sets and just to stick to one response metric and that will be a solution for now and also a bit of a more of a fundamental biological question I would say is that whether signatures of response in a particular specific to particular cancer type or can it be you know cross cancer types so can it be applied to other cancer types so these are the questions that still exist and people are still trying to address them what do you say breast cancer cell lines are when we say here breast cancer cell lines are used what breast cancer cell lines are so many with different genes so breast cancer cell lines will provide different results yeah certainly there is only limited collection of breast cancer cell lines and this is true for any cancer type and so the more comprehensive your collection the better so you can find more significant associations between response and molecular profiles if you have comprehensive collection but at that point I think they have a collection of some 30 cell lines I think 30 or 50 maybe cell lines about that range and this is all that they had so you can look at the expression in the profile of the cancer and you can see whether it is a better model for this so yeah exactly so if you know your patient cohort pretty well and you know that for instance it is represented by certain subtypes then you would like to have the same subtypes represented in your cell line collection as well to account for differences in response in different subtypes so but if it's possible then it's great to take into account if it's not possible and often it's not possible people just start with whatever they have and that in part explains why it is so hard to go from cell lines to patients but people really need to do that because there is no other patient cohort and they need to start with something probably it is better if you have a collection of clinical samples and of 30 you can't really imagine splitting this data set into two smaller sets and using one for training and another for testing so it's a trade off that researcher has to resolve and decide what would be the best strategy any other questions? yeah in the study you brought up I think with respect to clinical approaches and so I'm wondering if if you take a specific might expect the use of cell lines in their application to actually change a bit more because now there's a lot more therapies that are going to be part of it and it just seems like that maybe you would expect the response of extrapolation to change it's not just general safety there are few stages so I'm not sure if I'm following your question so is it regarding whether it will be more appropriate to stick to the monotherapy and cell lines may be more promising? I guess with the study that you profile I'm not familiar with a lot of these agents but they're chemotherapeutic agents which means and how I guess have to hide from that with chemotherapy they're not targeting maybe specific pathways so I'm wondering with a lot of the drugs now that are being developed they're very targeted and so for instance I'm wondering if the potential to use cell lines might be a little bit more promising if for instance different pathways or something well you know it depends on the drug actually it depends on the drug on the particular compound so even if it's targeted it's not always the case that the response is actually defined by very limited number of pathways or maybe critical nodes in that pathway so for instance the PI3 kinase inhibitors there is a wide range of isoform specific PI3 kinase inhibitors and the response to those highly targeted compounds maybe quite complex and it will be defined by a particular genotype and spectrum of molecular characteristics of a given cell line or tumor and those things will define the response and very often it can be through totally different pathways so well PI3 kinase inhibitors are quite complex in this regard so this is a very complex pathway anyway another example for instance the chaperone inhibitors such as HSP19 inhibitors which are widely used by Novartis for instance and other pharmaceutical companies so these may be even more complex because chaperones they basically maintain a structure of many client proteins which may be part of vastly different pathways and again response to those inhibitors may be even more challenging to predict but I mean this is an area of research that receives lots and lots of attention and I think with more people devoted to this particular topic and resources such as encyclopedia, cell line encyclopedia released by Novartis and Broad I think I think it will be better off in a few years to answer these questions so what I also want to highlight for you is that this is really very challenging field and certainly it's not devoid of errors and sometimes these are actually quite sad errors and I will tell you this story from the Duke University research program this is from the statistical and department of bioinformatics and genomics at Duke University who has published a number of papers devoted to development of predictors of response from cell lines and patient samples so as you see they have published a number of papers over these years and often in actually very high end journals such as Nature and Nature Medicine so just to give you an idea what it was about so in 2006 this group developed a method that was able to build promising predictors of response from NCI 60 cell line panel and validated on patient tumors to these individual drugs and from what you can see sensitivity and specificity, PPV are actually quite awesome so then the approach was named by Discover Magazine as one of the top 6 genetic stories of 2006 in 2007 they used the same approach to develop response to other drugs and this spawns clinical trials in U.S. to assign subjects to either this arm to use a combination of pymotrexate with gemcitibin or cisplatin with gemcitibin using genomic based platinum predictor to determine chemotherapy sensitivity then in 2007 they provided a so called validation of the combination approach using it to predict patient response to other alternative therapies and this report was a sub study of yet another clinical trial this time in Europe in 2009 they used the same approach to construct a signature for yet another drug and so overall over these years the method that this group has introduced and has been using provided good predictions in independent test sets has some biological plausibility appeared to be giving stable results over years of application and in principle was applicable to guide treatment in patients so what happens next is that already in 2007 papers have started to emerge suggesting that the Duke University studies may be flawed because it seems to be very hard to reproduce the findings and the method itself seems to be quite convoluted and in 2008 and 2009 a group of statisticians from MD Anderson has published a number of reports that were a complete rebuttals of the Duke University research so they have published devastating details of flawed experiments and flawed study designs and it was just devastating so among the details from what you can see it's just extraordinary so the labels were reversed sensitive versus resistance incorrect use of duplicates mislabeled samples off by one indexing error and all the genes reported and it very often happens when you move from windows-based applications to Linux-based applications and then it was found that they could report on drug A to include a heat map for drug B and then gene list for drug C and so these were basically evident things just simply on the visual inspection of the data and this list was just going on and on so this has certainly resulted in termination of clinical trials and a retraction of papers so it was actually quite painful experience and quite shameful experience and this event has led to the emergence of a special field which is now called forensic bioinformatics when prominent biostatisticians review public research papers and they try to reproduce findings that are reported in papers and they find pitfalls and they report the findings which I think is really great thing to do given such a sad example of Duke University long standing research over years and so what was interesting to see for those statisticians from MD Anderson is that if they took some 18 quantitative papers that were published in nature genetics within some couple of years they found that their reproducibility was not even achievable in principle in 10 cases out of 18 this is really really very disappointing and one major theme has emerged that most common errors were simple such as row or column offsets or offset by one and conversely the most simple errors were also very common and this could lead to really horrible consequences so it is obvious that there is a number of challenges in developing markers and specifically in the marker validity and people recognize that even though many multi gene predictors have been published and some with very high very robust deformments from the first class accuracy is high, sensitivity is high and yet very few are in clinical trials right now and even less in clinical practice and the reason for that is poor reproducibility due to the problems in the study design and interpretation as I've given you a few examples you could clearly see that one thing to keep in mind when designing study like this is that only one factor should be changing between compared groups such as treatment or infectious organism and all the rest should be equal so age or race etc and then only then we can say that the difference between the groups can be attributed to the agent and not to the bias and then the general guideline is that the method description should be detailed enough for reproducibility and correct interpretation and people recognize three main threats to the validity of markers, I will just mention them these are chance generalizability and bias and generalizability refers to very important aspect of biomarkers that they can be generalizable to other data sets to other patient cohorts so this is very important aspect that everyone is talking about and thinking how to address that there is one nice paper here in 2005 that describes these challenges to the market validity and explains what they are and gives a possible solution to these problems and I will refer to these to this nice review I won't go in detail over these things now what I also want to mention is that even such a sad example of the Duke University research NCI has actually mandated Institute of Medicine to come up with a set of regulations and guidelines as to how to proceed from biomarker development through validation and then advancement into the clinical use and so in March of this year Institute of Medicine has published a 300 page report on this topic which is called evolution of translational omics and so if you are interested you are welcome to download this very detailed report with very nice guidelines how to proceed from discovery all the way to the clinical application and this diagram very nicely summarizes the whole strategy that people trying now to regulate pretty tightly to avoid problems such as the one with Duke University and so basically you can think of it as two major phases one phase is the discovery and validation and then what is that sorry about that but I will go to sort this out so before proceeding beyond the bright line the bright yellow line that she saw there to evaluating the clinical utility of markers through clinical trials a researcher and the institution itself in which the researcher works now is responsible for conducting the discovery and validation of biomarkers in the correct way which is outlined in these diagram so and the painful lessons from Duke universities has been learned and they are widely recognized now so it is clear that very often in research studies of this sort it's unclear what lines of accountability exist also very often we have lack of strong data management and in particular it's very important to look down the specific computational method that underlies the marker classifier development and very often researchers tend to tweak the computational methods as they go and the method evolves so this basically should not happen and one of the specific instructions is that the computational methods have to be locked down at the stage of the validation of biomarkers and then the adequate validation should take place and basically validation of a test that is potentially future clinical application that will need to be approved by FDA needs to go through a very rigorous validation at the laboratory that is clear certified and clear is a clinical lab improvement act so this is basically set of regulations how laboratory has to be set up what reagents and what equipment it should be using and so the validation should take place in clear certified laboratory and only after these two phases are correctly completed then we can cross the bright line and go into clinical trials there are different types of clinical trials that are outlined here so these are the guidelines so this research area is now becoming actually quite closely regulated and monitored to avoid problems and this is great thing to see so here in the next slide I will just show you very briefly the list of tools that are available out there for research projects like this this is just for your reference and this slide summarizes what we have learned during the lectures that you have heard so far so just to summarize let me go through this very quickly so despite the progress in marker development field we still need good new biomarkers we have a choice of many classification discrimination methods and it's a matter of personal choice and very often training that one may have then feature selection is an important part of classification procedure also another point is that performance assessment is absolutely crucial for classifier development it consists of several important components and it has to be performed correctly and in that report there are certain guidelines as to how to approach this task as well also it's important to recognize that there are threats to the marker validity and try to work out the study design to minimize these threats so what can we do now with the knowledge that we have just acquired so what I would like to do is just to give you a few examples of incorrect questions and analysis and then I would like you to contemplate over this just a little bit and perhaps come up with your own correct analysis and we will work together on this so the first question for instance does molecular profile show clusters by survival so what you see the researcher is doing so select subset of genes with significant differences between survival groups patients who survive long or short periods of time and they are often referred to as long and short survivors and then cluster profiles for these genes only and get clusters by survival status so is this the correct type of analysis no this is incorrect so why genes were already selected by differences between survival groups so they should cluster according to these groups so this is an expected result and another example so let's build a classifier for a rare subtype of cancer with disease prevalence of 0.2 and again prevalence is a fraction of the disease in a population so 20% of population may contract this disease or have this cancer and then assess its performance using cross validation so the approach is the following select equal number of patients with cancer in common subtype and that will make a disease prevalence of 0.5 develop a classifier and classifier shows high sensitivity and specificity and to conclude from this that it's applicable as a diagnostic test so again this is an example of incorrect approach so first of all sensitivity and specificity do not depend on the population distribution remember I mentioned to you that for that purpose we need to take into account the prevalence and one needs to pay attention to the PPVs and NPVs positive predictive values and negative predictive values to begin with and then certainly the distribution of our sample set of distribution in the population so the disease prevalence right so another example is to for instance compare responders versus non-responders with respect to survival experience so in clinical trials often patients are defined as responders according to whether tumor shrunk and non-responders tumors did not change in volume so tumors didn't shrink so and then a researcher compares survival experience of responders versus non-responders using Kaplan-Maya curve and then the researcher concludes that the treatment is useful so this is yet another example of incorrect approach so there is a number of biases in here so first patients had to survive a certain period of time to achieve response and then we do not know how long would a patient survive without therapy so maybe longer than with therapy so such a conclusion cannot be made and this specific type of analysis has actually been banned from you know by many journals so now let's think of correct analysis for these tasks so what would you do for the first question when we want to answer the question does molecular profile show clusters by survival any ideas please note peeking into the following slides because they have solutions so can we just pause on this slide for a second and just think you can run any future selection without looking at the labels and then class it out but the key is that you are not allowed to look at the labels exactly did you read the following slide very good so yes that's exactly what we want to do so we perform unsupervised clustering of genes that are differentially expressed across all of the samples with you know no respect to survival groups or any other outcomes right just they are differentially expressed across all of the samples and then we see if there is a difference in survival between observed clusters this is actually similar to what I was giving you as an example so you start with unsupervised learning and then you ask a question okay my molecular my subgroups of patients based on molecular profiles do they have a clinical meeting so let's think about second okay first one can you also do the randomly assigned patient or training set like random assignment patient with two groups and see if there is a survival group well but the question is you know do molecular are molecular profiles associated with survival somehow so that's the question so for that purpose you do not really answer the question of what would be the best molecular descriptors of your different survival groups for that you would need training set and test set right but just to answer the question whether molecular profile may be associated with survival then you need to do unsupervised learning first and then see if your de novo discovered subgroups based on molecular profiles have clinical meaning will have different survival okay so what do we do for the second thing so I'm showing you so we should kind of normalize for the prior probability of distribution distribution so either we need to have the same distribution for the training or we need to apply the prior knowledge in posterior exactly so we need to make sure that our sample set is representative of the population that's for sure so what are we going to do for the third one compare responders versus non-responders with respect to survival experience well to begin with this is incorrect question what would be again more correct question to ask so for instance can you envision comparing patients that are treated versus untreated yeah what would you see do you think in principle you would see a better survival for patients who are treated versus ones who are given placebo for instance right and if you do see that difference then you may conclude that you know treatment probably is helping these patients right so this is type of question that you can ask which circumstances and then certainly when it comes to comparing responders versus non-responders what can we do with it this speaks to the material that I was talking through earlier a bit but markers of response so we can always use molecular profiles of patients who respond and those who do not respond to develop markers of response so what you can do if you present it with you know with the data or collection of samples like this okay so if you don't have any questions we can finish with this part so here I'm giving you a list of references this is just for your reference these are the ones that I used in my presentation these are quite good papers to go over so now we will switch to the next part of this module which will be devoted to statistical methods that are commonly used for analysis of clinical data and more specifically it will be devoted to the survival analysis and in the first part I will go over the clinical data and survival analysis theory and it will take us just a little bit of time about 20 minutes or so and then we will start our lab where you will be applying your knowledge to a specific data sets cancer patients with associated clinical data so when we are dealing with clinical samples we have different kinds of data associated with clinical samples so we have whole genome whole transcriptome molecular profiling data about which you have heard lots before and then we also have clinical data which is very tightly linked to the whole genome and transcriptome data and yet it's intrinsically different so clinical data includes a number of clinical characteristics such as the ones that you see on the slide so it can be patient ID race, family history of disease nodal status involvement so for instance whether there are any metastasis in the lymph nodes whether patient has gone through radiation or chemo or hormone therapy and then what is the stage of the tumor what's the size of the tumor what's the age of the diagnosis etc etc so another type of data that is part of the clinical data but significantly different in nature so this data arises when interest is focused on the time taken for some event to occur one of the most common sources of such data is when you record the time from some fixed time point such as surgery for instance to the death of the subject or disease recurrence and these are referred to as a survival times or survival data and they require special set of statistical tools to analyze so usually we refer to this type of analysis as survival analysis survival analysis has three main components or applications that are often used by researchers and are commonly seen in the literature so one goal may be to estimate the probability of individual surviving a given time period for instance one year given a certain characteristics of its tumor and physiological condition for instance to answer this question people use Kaplan-Meier survival curves or life tables another goal may be to compare survival experience of two different groups of individuals for instance patients treated with drug and patients given placebo for this type of analysis we use log-rank test which compares different Kaplan-Meier curves you have a question I'm sorry for more than two yes you can use it for more than two yes another goal of survival analysis could be to detect clinical, genomic epidemiologic variables many variables which contribute to the risk or associated with poor outcome so the other two methods cannot handle many variables at a time and to do so there is a cox regression model which is going to be multivariable if we investigate the contribution of multiple variables into the overall risk so in clinical studies survival times what are they and how they defined so they are defined as the time from a fixed point to a fixed end point so the examples of starting points could be a surgery and then the end point can be death or recurrence disease relapse another example of starting point could be diagnosis and then the end point can also be one of those things that I have on the table another example is treatment so the time when the treatment started and then the time till the disease recurred so there is one intrinsic feature of survival times that make it unique and unsuitable for any common statistical methods and that is we almost never observe the event of interest in all of the subjects so in some subjects the disease may recur in some of them the disease is not recurring by the end of the study and in statistics these are called incomplete observations or censored observations so I have censoring of data here and so this kind of data needs special analytical techniques so what are the censored observations it's important to understand what they are so they arise whenever the dependent variable of interest represents the time to a terminal event and the duration of the study is limited in time and so as I mentioned the censored observation in statistical point of view is incomplete observation the event of interest did not occur at the time of the analysis so what could be the examples of censored observations it's given here in the table so for instance the event of interest to which we measure time is death of the disease and the censored observation does not occur in social studies the event of interest survival of marriage and then the censored observations will be still married so the event of interest did not occur another example drop out time from school for a student so student is still in school that would be an example of censored observation for that kind of studies there is type 1 and type 2 censoring I will just briefly refer to it by saying that type 1 is the most common type when the time of study is fixed rather than the proportion of subjects is fixed so in clinical in medical research type 1 censoring is the most common and then there is right and left censoring as well as interval censoring as I have described to you in a second I'm sorry it says after type 1 and 2 censoring is time fixed so this is a different type of censoring so it refers to different design of studies so usually for the type 1 censoring we follow up patients a long time and then we fix the end point of the study and we say beyond that point in time we are not going to follow up patients any longer and so whoever was alive at that point in time we did not know when we are going to reach the event of interest in the future so this is the type 1 censoring there is another type of censoring for instance the study is terminated at the point when say 50% of subjects have reached the event of interest and that will be based on the proportion of subjects fixed here and this is type 2 censoring but what is important to remember is that in medical research we are mostly dealing with type 1 censoring so in the type 2 you are measuring what the half of the weight of the subject instead of fixing time we are fixing the proportion of subjects reaching the event so so now left and right censoring and interval censoring here so for instance the examples of left censoring so left and right censoring refers to the time continuum where the incomplete observation actually taking place so with the left censoring for instance patient comes to the clinic and with with with the disease already and the time that researcher is supposed to measure is the time from when disease actually emerged in a patient but the patient has come to clinic with ongoing disease already so the disease happened sometime in the past and when exactly the researcher cannot say so this is an example of left censoring an example of the right censoring is what I just described to you so when the study is terminated at certain point in time say for instance during clinical trial some of the patients are still alive and we do not know when in the future or whether they will reach the event of interest so whether they will die of the disease or whether the disease recur in those patients we do not know this at the end of the study and that's why it is called the right censoring of the data and the interval censoring which is right here is when the event of interest has occurred within some interval of time and we do not know exactly when so the example of this would be for instance a patient was followed up with regard to certain disease, cancer and then the patient went abroad and stayed there for about a year and during that time the disease recurred and the patient comes back sees the doctor and the doctor can say the disease has recurred but when exactly the doctor cannot say he is just able to say that it happened within this time frame of one year and this is called interval censoring so let's see how patients proceed through study and that will help us to understand Kaplan-Maya curves how they constructed and how to interpret them so this left diagram shows the time continuum and patients in the study so it shows the first six months during which patients were recruited to the study and then 12 months during which they were followed up so then the patients are observed for different period of time for each patient has been followed for different time so the most recently accrued patients being observed for the shortest time and that's what you can see here for instance this patient was observed for the shortest amount of time and not what we do so after the end of the study in the end of those 12 months here we sort all of the patients according to the time that they were followed up so the length of this line is basically the time how long the patient was followed up in this study and here what you can see is that some of the patients have reached the event of interest which are black circles so these patients have died of the disease now these patients clear circles are censored observations that they are still alive and these ones have dropped out of study so they just moved to another country or something and they weren't followed any longer so these are also cold censored observations because we do not know what happened to them and then what is the survival probability for a patient the probability can be actually calculated as a fraction at each time interval when we have exactly one event of interest occurring we can compute the proportion of patients who have reached the event of interest of all of the patients that are still at risk and that will give us the probability of surviving this period of time this one and then in the similar manner we do the same for next time interval when another event of interest occurred and then the next one okay and so so now survival probability for a given length of time so should be calculated considering these time intervals as I just said and then probability of survival month two is the probability of surviving month one multiplied by the probability of surviving month two so it's basically written here the notation so it's basically a conditional probability you removed the dropout from the probability calculation I'm sorry dropped out the two with the stars there mm-hmm yes oh I see so we compute these probabilities for time intervals when when there is at least one event of interest that had occurred so censored observations are not taken into account this is how it's calculated so and in this kind of calculations proportions at every time interval calculated each time any single patient dies or reaches the event of interest those black circles on the previous slide and then the series of such calculations make up a so-called life table which I'm not showing you here but it is used to generate a Kaplan-Meyer curve which is shown on this slide so the Kaplan-Meyer curve is drawn as a step function so this drop here indicates that at this point in time exactly one patient had reached a event of interest died for instance and this is a step function because these proportions basically do not change between these time intervals so that's why here it's basically horizontal so it's a step function so the survival probability formula is right here so it's a geometric progression of probabilities at each distinct survival time point excluding censored observations and the time of censored observations are usually indicated by ticks on the curve which just show at a glance the survival times of surviving objects so these small tick marks are censored observations so how can we use Kaplan-Meyer curves so for example we have two groups of patients so the red curve represents one group of patients and this is usually referred to as survival experience of this group of patients and the survival experience of the red group differs significantly from the survival experience of another group here what else you can notice is a different shape of these two curves so this is a bit more smooth than this one and this is because we have significantly different number of subjects into different groups so here we have a way less than there so how can we use this Kaplan-Meyer curve so the question is what is the probability of a patient from red group to survive 2.5 months so this is time this is survival probability so the answer is 0.5 what is the probability of a patient from the green group to survive 2.5 months my question to you what's the probability of a patient from the green cohort to survive 2.5 months exactly right here about 80% and you know the Kaplan-Meyer curves separate the more differences there is between groups of patients okay so this represents relatively good outcome this is not very bad sometimes you can see that they actually drop very abruptly and this is poor survival experience group I'm sorry so sensory observations of these tick marks they are not taking into account when computing the survival probability but they just represent the curve so they are taking into account but not directly and these are these tick marks here so for studies in which we aim to compare the survival experiences of groups of patients we can construct 2 Kaplan-Meyer curves for each groups or for multiple groups for that matter in theory we can answer the question are survival experiences significantly different by looking at the Kaplan-Meyer curves right but how is there any other way to come up with a p-value or some number that will describe the differences in survival experience of 2 groups and the answer to this is log-rank test which is a non-parametric method to test the null hypothesis that compared groups are samples from the same population with regard to the survival experience so in other words there is no difference in survival experience so we are testing this null hypothesis so it will tell you whether there is a statistical significant difference in survival experience with attached p-value but it won't tell you how much different survival experiences are for that there is another test so how we what's behind the log-rank test so it's basically based on the idea that we can take into account the entire shape of Kaplan-Meyer curves into consideration when we answer the question whether they are different or not so and that is done by dividing the survival time-scale into intervals according to the distinct observed events ignoring censored observations again as I show you here so and now we pull those intervals from two groups are pulled together so that's why you see many more of them than on the red curve so they come from green curve and from the red curve and then we compare proportions at every time interval and then summarize it across all of the intervals and the comparison of proportions is very similar to Kaesquare test if you remember this basic statistical test so that's how the log-rank test is done and here's the formula for it and you may recognize this notation it's similar to the Kaesquare test and so basically here log-rank test gives you a statistics when you compare k different groups of patients it can be 2, 3, 5, 7 and then you have these terms here observed proportions and expected proportions which have been already summed over all of the time intervals on the time continuum and then v is variance and then we get a statistics here which we compare with the Kaesquare distribution just similar to the Kaesquare test with chi minus 1 degrees of freedom and then we get p-value which will tell us whether there's a significant difference in survival experience between groups so the log-rank test is widely used for comparing survival and it gives you p-value but it's only a hypothesis test so it does not tell you how much different the survival experience is so for that purpose we use hazard ratio which compares 2 groups differing in treatments or prognostic variables etc so it measures relative survival in 2 groups based on the complete period studied along the entire Kaplan-Meyer curve so to speak and so this is a simple formula for that and basically how we interpret hazard ratio so what is the 0.3 so I'm sorry 0.43 would be a relative risk or hazard of poor outcome under the condition of group 1 is 43% of that of group 2 so group 1 is actually doing better how much? like this much ok and then on the opposite another example it has a ratio of 2 says that the rate of failure in group 1 is twice the rate in group 2 so group 1 is a way higher risk group so neither of the methods that I have described to you can take into account the distribution of many variables into the risk and for this purpose we use Cox proportional hazard model which specifically designed for this task so it's used to investigate the effect of several variables in survival experience and so this is a multi-variable proportional hazards regression model it's also called proportional hazards model because it estimates the ratio of the risk hazard ratio or relative hazard so there are multiple predictor variables and the outcome variable that are in the Cox regression model analysis so let me show you this one more slide so this is the formula of a hazard function it's actually quite closely related to the survival curve and represents the risk of dying in a time interval after a given time and so it's a cumulative function so we can add all the hazards from 0 to time t to get the risk of dying at time t and so here what you can see in the formula so we have x's are independent variables of our interest that we put into this model and then b's are regression coefficients that are estimated by the model then there is a certain assumption for Cox regression model that the effect of variables is constant over time and additive by the scalar scale and so another note that hazard function is a risk of dying after a given time assuming survival thus far so it's conditional like the survival function for the Kaplan-Meyer it's a cumulative function as I mentioned and then survival probability can be expressed through this hazard function in this manner and we can basically plot Kaplan-Meyer curves out of Cox regression model so the Cox model is actually quite complex statistical method and it must be fitted to using an appropriate computer program so the final model will yield an equation for the hazard as a function of several covariates or covariables so how should we interpret the results of this Cox regression model so here you see the results of the analysis of the certain clinical trial comparing placebo versus drug treatment and then the chosen model included six variables that are shown in this table so how do people usually decide which particular variables to use in a multivariable model they usually do a univariate analysis with each individual contributing factor first and if they turn out to be significant then they put it into the multivariable analysis and do Cox regression model multivariable Cox regression model so p value so what's the outcome of this regression model here so we see that the regression coefficient here that has been estimated by the model then we also get a standard error of it and then we have a very useful transformation exponential of the regression coefficient so two important things about these coefficients one is the sign and then magnitude of a coefficient so sign tells you the association with poor survival so positive is positive association with poor survival negative is negative association with poor survival and then magnitude refers to the increase in log hazard of an increase of one in the value of a corresponding covariate so in other words if we get an increase of say I don't know here an increase of one then we'll get an increase in hazard by this much and in this case because it's a negatively associated with poor outcome it will be 0.95 or 95% of the previous level so let's see for instance I have expressed it in percent here so for instance here this is negatively associated with poor outcome right and so if we increase the level of serum albumin by one by one then we get an increase in hazard actually we get a decrease by 5% right because it's negatively associated and here for instance it's actually quite surprising but so if we increase this by one and in this case this is a continuous variable and this is categorical right so it's 0 or 1 therapy perhaps and or present so it can be both ways so in this regard we see the increase of 168% so it's important to remember sign and magnitude and the way we can interpret the increase and stuff is given here so if we just take two time points and just express it through here then you can see how we get to the values here so this is just for your reference it's not really necessary to understand it that well so therapies is positively associated with survival so and then so this is the survival function that can be expressed through the hazards function that comes from Cox regression model and then we can build the Kalpana Maya curves here so just a few notes here that power of the analysis depends on the number of terminal events for instance deaths higher power requires longer follow up times well alternatively one can choose to use more frequent end points such as recurrence disease recurrence and then estimation of a sample size to achieve required power is actually quite hard task so and I would encourage you to seek the advice of a biostatistician in this regard so this concludes the theoretical part of the survival analysis I will just would like to take a short break before we move on to the lab