 Okay. Hello everybody. Again it's a pleasure for me to be here. It's an excellent experience for myself. I've been involved in Canadian Black Medical workshops for three years now and it's been really a very educating experience. So I welcome you to the last day of the course which will be devoted to the clinical omics and yes I will describe a little bit who I am and I am a research scientist at the Vancouver Prostate Center at this moment. I've joined the Center a couple years ago. I have moved from California where I work for about eight years on breast and ovarian cancers. First with the University of California in San Francisco and then Lawrence Berkeley National Laboratory. And so my focus is the analysis of high throughput, high resolution data including microarrays and more so recently the next generation sequencing technology. In particular RNA-seq data with a focus on the development of biomarkers and to new therapeutic targets. So what's interesting about me, I don't know why I actually particularly find it fascinating to change fields and change directions. I think it's really a shame to stick to any single direction and not explore anything else around this world. So during my professional career I have changed my direction two times and my background is in chemistry. I have graduated a chemical department of the Moscow State University and then I switched to molecular biology and then I decided that I want to switch to another thing and that was the bioinformatics which was completely different in terms of the set of skills and paradigm and understanding of things and background and applications. It was actually a way bigger shift for myself than switch from chemistry to molecular biology. So that's who I am and just to emphasize, I am a computational biologist as I like to call myself, not a bioinformatician but a computational biologist and yes there is a distinction between the two. Bioinformatics involves a heavy duty algorithm development whereas competition biology focuses more on the application of algorithms to resolve any biological questions at hand and so that's what I do and thus you will hear my presentation today and you will see that the focus of my presentation will be more on to the application and the impact on the research question and further downstream translational impact and so on. So we'll start with, so this is the module 8 and it consists of two parts. The first part will be devoted to clinical data and development of markers, discrimination and classification problems and introduction to the clinical omics and the second part will be devoted to the analysis of clinical data and in particular survival analysis. So this is again just as I said module 8 part 1 will have two parts again, the introduction and then the part 2 which will include from molecular profiles to classification and predictions and markers of prog-response as a specific area of research under this umbrella. So I will start with this first slide which basically gives you a definition of biomarkers and therapeutic targets. So biomarker is a biological molecule found in blood, other body fluids or tissues that is a sign of a normal or abnormal process or of a condition of disease. A biomarker may be used to see or predict how well the body responds to treatment for disease or condition also called molecular marker and signature molecule. Now to remind you what the therapeutic target is. It's a biological molecule an enzyme or receptor or other protein that can be modified by an external stimulus and the application the implication is that a molecule is hit by a signal and its behavior is thereby changed. So the pharmacogenetics is the study I will start actually with the pharmacogenetics and pharmacogenomics because this is basically this is the roots of what we will be talking today about and pharmacogenetics is the study of the role of inheritance in variation in drug response phenotypes. Those phenotypes can range you can see here there's a whole spectrum oh it's all faint can you see my okay that's good so we have the whole spectrum of response to a drug treatment and it can range from life-threatening adverse effects at one end of the spectrum to the lack of the therapeutic efficacy on the other side of the spectrum which is equally serious. Over the past half century pharmacogenetics has like all of medical genetics evolved from a discipline with a focus on monogenic traits to become pharmacogenomics with a genome wide perspective. So the earliest experimentally validated example of an effect of inheritance of drug response were first reported a long time ago more than a half a century ago in 1950s and 1960s and then researchers have noted the large differences in response to a standard drug doses. So the story of the short-acting muscle relaxant here sorry here which is hydrolyzed by an enzyme BCHE served as a early stimulus for the development of pharmacogenetics and it was observed that a common genetic variation in drug metabolism I'm sorry could result in striking differences in the half-life and plasma concentrations of drugs metabolized by that particular enzyme. So and this enzyme actually metabolizes a number of different drugs and it was found that approximately 1 in 3,500 individuals there is a homozygous state for a gene and coating an atypical form of that enzyme which is translated into relatively inactive enzyme which is unable to hydrolyze this drug and thus the drug-induced muscle paralysis is prolonged and this is a serious side effect. Almost at the same time the drug as an isot one of the first effective drugs for treating tuberculosis was developed and it was followed by the observation that plasma concentration of this drug showed a bimodal distribution after administration of the identical doses to different subjects and then it was found that it was due to the different activity of the enzyme which is encoded by gene NAT2 that was responsible for the spectrum of response in different patients. So these examples served as a stimulus for a series of further studies which mostly focused on alterations in drugs for macokinetics and this resulted basically from the pursuit of clinical observations of an adverse drug responses. So and basically this was a typical phenotype to genotype research strategy widely used in human genetics but following examples that I will show you are the landmarks of the field which marked the transition from the biochemical to molecular for macokinetics. So one of the specific examples of this era that have become basically for macogenetic icons is and the example that also bridged this transition from biochemical to molecular for macogenetics was the story of the CYP2D6 which is a cytochrome B450 family drug metabolizing enzymes. CYP2D6 catalyzes biotransformation of many of drugs including antidepressants and antirhythmic drugs and it also activates analgesic pro-drug code which is widely used. Then it was found that a genetic variation within this gene was basically influencing the response to this drug in any given patient. A number of variations were found within this gene including non-synonymous coding SNP that was associated with the decreased activity of the enzyme gene deletion and on the other spectrum it was found that gene was also duplicated up to 13 copies resulting in the increased activity of this enzyme in a particular patient. So this slide now shows the frequency distribution in more than European population that included a group of four metabolizers they are sorry they are shown here in red. Poor metabolizers and this is measured as a ratio of a drug to its metabolite. Then another subgroup was the extensive metabolizers this was a major group and the group of ultra rapid metabolizers and so it was subsequently shown as I mentioned that poor metabolizers contain inactivating coding SNP or gene deletion and resulting in a decrease in activity of this enzyme and the ultra rapid metabolizers often contain gene duplication increasing the activity of that enzyme significantly. So now it is clear from the story that poor metabolizers may have excessive drug effect with drugs that are inactivated by this enzyme and inadequate drug effect with drugs that are activated by this enzyme for example codeine that is transformed by this enzyme to morphine so codeine will be ineffective in this group of patients and on the other side of the spectrum ultra rapid metabolizers would show inmatic with therapeutic response of drugs inactivated by this enzyme and at the same time excessive drug effect with side effects with drugs that are activated. So for instance in the case of codeine that can lead to a very serious side effects including respiratory arrest after treatment with standard doses of codeine. So now it's becoming increasingly evident that most human diseases and response to inappropriate therapy are governed by many factors and in the era of pharmacogenetics the main focus now is on individual drug metabolizing enzyme most of the times as I illustrated in my previous example. However the prudent fact is that even if we do find an aberration responsible for any given response still it is almost never explains all hundred percent of cases and in regard to the other contributing factors and with the following examples I will show you the necessity of evolution from pharmacogenetics to pharmacogenomics. So this is a warfarin story that I will highlight for you. Warfarin is the common drug that is most widely prescribed oral anticoagulant and it has serious adverse effects which include hemorrhagic and undesired coagulation. Warfarin is predominantly metabolized by cytochrome P450 family member Cyp2C9. It was found that two common polymorphisms are associated with decreased activity of Cyp2C9 and those polymorphisms were responsible for reduction in activity of that enzyme down to 12 and 5 percent of the wild type respectively which is a huge decrease but then the frequency of those polymorphisms were only about 10 percent each of them. So obviously this pharmacokinetic genetic variation did not explain most of the variants in response and the target for this warfarin or anticoagulant drug had remained pretty much unknown for quite a long time until recently in about 2004 when the gene V-core C1 was detected as a target and it was cloned and resequenced by many laboratories. And in that case, non-synonymous coding SNPs were observed but a series of haplotypes were observed that were associated with a final dose of warfarin. So now this is an example. This example represents probably in a simplified form the type of polygenic model that we expect to observe in the future with increasing frequency. And as I said, the focus in pharmacogenomic studies unfortunately is on pharmacokinetic pathways which are the drug metabolizing enzymes and dissociated pathways such as C2C9 in case of warfarin and C2C6 as well. And also the transporters that can influence the final concentration of drug that might reach the target. So this is called the pharmacokinetic factors. But also there is a pharmacodynamic pathways that include the drug target as well as signaling cascades downstream of that target and in this case this would be exemplified by the V-core C1 gene for the warfarin. So cancer, which is a topic of our course, is certainly the ultimate example of a polygenic disease with many genes involved. The complexity of molecular biology of cancer stems from the fact that in addition to polygenic nature, mutations in a broad sense of a word take place on all levels. And this includes genomic aberration with mutations, transcriptional changes, splicing changes, post-transcriptional modifications, epigenetic changes, changes on the protein level. And so basically modern technologies enable screening for all types of mutations from all those levels, all of which can be used as potential biomarkers. So this slide demonstrates how complex may be a single amplicon in cancer. We all know that the most typical genomic aberration in cancer are amplifications together with deletions. So what are these amplicons and what's this structure of those amplicons? And this slide shows a frequent and high-level amplification on chromosome 20 Q13 from breast cancer. And as you can see, this amplicon, when it was sequenced pretty thoroughly, was shown to be comprised of many segments from distant locations in the human genome, which were concatenated together. You see that there are pieces of chromosome 3, several pieces from chromosome 20 from different bands from QR, and they were all concatenated together. So the important outcome of this is basically an alteration of gene expression. It can be a result of either juxtaposition of regulatory sequences of one region of the genome with the gene that originates from completely different regions of the genome. And this may change the expression level of that particular gene. Or it can cause creation of fusion transcripts, which may have altered functional role. Or it can result in inactivation of a certain gene by creating a breakpoint and say introducing a premature stoke codon. And so eliciting the nonsense mediated decay, for example, of that gene, a truncated transcript. So that basically this of this example emphasizes one of the greatest challenges in cancer genomics to reconstruct cancer genome and transcriptome. This is now becoming possible with the wide application of high throughput sequencing technology. So gene expression profiles have been widely used for development of expression based biomarkers. This is the expression pattern of breast cancers, which probably you have seen from the early days of this course with a number of distinct subtypes identified, including basal, luminal subtype, and erby 2. Expression signatures of these subtypes can be used to discriminate subtypes of cancers, which in turn have distinct clinical outcomes. And yes, there is a number of examples of successful application, of classification to gene expression profiles of a number of cancer types. But then it is also been observed by many groups that integration of different levels of mutations, I mentioned above, such as gene expression and genome copy number can increase our power in development of biomarkers. And in this case, this is again about breast cancer. In addition to the expression subtypes of breast cancers with different clinical outcome, which is shown here, this is a Kauffman-Meier curve for all those expression based subtypes, including luminal A, B, basal, and erby 2. Basically, it was found that also in addition to these expression based subtypes, there is also genome copy number based subgroups of luminal A cancers. Amplifiers, which are shown here, this is a heat map of copy number changes within cancer, luminal cancers. These are amplifiers versus non-amplifiers. And it was shown that they also have different clinical outcome, as shown by Kauffman-Meier curve here. So it seems to be beneficial to integrate molecular profiles for biomarker discovery, which now is basically an actively developing area of bioinformatics research. So now I want to highlight a HER2 and TROS stuzumop or Herceptin story for you, because this is probably one of the brightest examples of going from molecular profiling of cancers to clinical applications. So erby 2 or HER2 receptor is a cell surface receptor, tyrosine kinase, a number of the erby family. It was shown that ovexpression results in activation of intracellular signaling through the RASRAF, MAP, and ERC cascade in P3 kinase, AKT cascade, to promote cell division proliferation and inhibit apoptosis. In 1987, Slamman published a study in science on HER2, where the authors have shown that this gene was ovexpressed up to 20 copies in 25 to 30 percent of RAS cancers, and it was associated with shorter survival and relapse times. So soon after that, in 1990, Genentech develops a humanized monoclonal antibody against HER2 receptor, and it was really an incredible work that was accomplished within a year, and it's really outstanding achievement. And then Genentech has tested this antibody in a cohort of samples with no regard to the erby 2 status. And what they have found is that it was effective only in a few percent of patients. And then it occurred that basically the antibody against erby 2 should be more effective in patients with erby 2 amplification, or so-called erby 2 positive. And so after realization of this, in 1992, clinical trials have commenced, and now it is a standard of care, a combination of tests for HER2 expression status and an application of HER7 in combination with other drugs. So this graph shows the efficacy of the HER7 drug alone in yellow color, and in combination with other drugs. This is a statistic from a number of clinical trials over the years. So what you can see that the response rate right here in the y-axis for the HER7 alone was relatively low, even for the HER2 positive cancers. But at the same time, in combination therapy, the rate has significantly increased, and this is a boo bars. So the response rate was achieved that was as high as 69 percent there, which was a significant achievement and huge success. So the number of studies devoted to the detection of development by markers of cancer that we see being published every day is really daunting. Over the last decades, hundreds and hundreds of studies have been published showing a wide range of promising biomarkers with really reasonable and relatively high accuracy, nice sensitivity, and so on and so forth. So how well can we actually manage cancer today? I think it is very educating to look at the statistics on cancer incidents and mortality, and this data comes from the Canadian Cancer Society, this is in particular the statistics for 2009. So here in this slide, you see a dynamic of the incidence rate on the left and the mortality rate for individual cancer types in men. And I just want to highlight two examples for you of prostate and lung cancer. So for prostate cancer, with the event of a PSA screen in 1993, more and more early stage cancers could be detected, which resulted in this spike of the incidence rate. Then it was rapidly realized that the higher PSA is not always translatable to a malignant disease, so hence we can see a slight decline in this incidence rate for the next couple of years. However, the mortality did not change much. It's shown right here, so it's from 1980 to 2009. So the mortality is not really changing, and the event of the PSA test didn't really change the situation much. And the reason for that is that we still are not doing a great job in detecting really aggressive cancers that are lethal. So for the lung cancer, this is another example, which is particularly curious, I think. You see that both incidence and mortality are going down over these years, and this is most likely attributed to the recognition of the risk factors, such as smoking, which influenced the awareness within men's population, who just simply stopped smoking. And so that's why we see the decline for lung cancer in particular. So this is the situation in women. So the introduction, and I will just highlight a couple of cancer types, including breast and lung. So the introduction of the mammography was a milestone for breast cancer management, providing means to detect cancers in the earlier stage, which is a way more curable than the late-stage disease. And indeed, the mortality rate went down a bit, but still breast cancer is really far from being eradicated. And this is, again, because we're still missing a great proportion of highly aggressive cancers. We do not have really robust markers for aggressive cancers that we can apply in clinic and predict patients with particularly poor outcome. So, and again, the funny statistics, I find it funny, actually, that lung cancer, as you can see, unlike men, for women, lung cancer incidence, both incidence and mortality, is increasing over years, and quite significantly. And this can actually be explained by more or less a social factor, by a feminization that has been taking place since 1980 and on. And so despite all of the awareness of the smoking as a risk factor, women just choose to smoke. So now this is the summarized statistics for men and women for all cancers. And what you can see here on top, now you see the incidence rate. And at the bottom, you see the mortality rate. So again, this is the actual curve, the very top one is the actual curve. But then we need to adjust for aging population and for the population growth. And the lowest curve would be the resulted corrected curve, which will show us how we do with management of cancer. And so what you can see here is for men, you see a slight, you know, an increase in incidence and a minor decrease in mortality. And again, this most likely is attributed to both prostate cancer and lung cancer, which I illustrated in my earlier slides. For women, the incidence has gone up over the years, and the mortality stayed about the same over these years. So obviously, we still have not conquered cancer. And yes, there's a keen need in novel prognostic and therapeutic biomarkers, which I will be covering in the next section. That's why we have this course. So I don't know how would you like to proceed? I have my break. Oh, lunch at 12.30. Okay. So then I think we should go ahead and continue. So now we are moving to the next part of the module eight, part one, which is devoted to the development of biomarkers using molecular profiles. So there are three main statistical problems in tumor classification. And that includes identification of new or unknown classes using molecular profiles. And that would be called unsupervised learning. Another type is the classification into known classes. And that will be the discrimination analysis or supervised learning. And the third task is the identification of markers or genes that characterize classes. And that would be a variable selection. So for example, just to give you an example, for the first bullet, the unsupervised learning, you are given a set of samples with molecular characterization with, say, gene expression profiles using mathematics, microarray. And the goal is to find possible molecular subgroups that can be characterized by a specific molecular signatures of gene expression. And the usual way of going would be to perform an unsupervised hierarchical clustering that you have probably heard about during this course, find noble subgroups, and then proceed with the downstream further analysis of some kind. So that will be an unsupervised learning. Now, what we will concentrate in this part of the module is the classification into known classes and identification of markers that characterize classes. So for example, you are given the group of patients who respond to the given therapy, and the other group of patients who do not respond to therapy. And now we need to find a markers of response to this therapy. So we need to do a good job at discriminating these two subgroups of patients using the molecular profiles. This is called the discrimination analysis, or supervised learning. And then we will need to find a features or genes that best characterize each class. That will be a variable selection or building our predictive markers or building classifier. And then once we have this, we can apply those biomarkers that we have built on our given groups of patients to an independent group of patients and try to predict their response without knowing the actual response. So that will be the focus of this part of the module. So what is the discrimination and prediction and how it's done? So basically, we start with expression profiles of group of samples or patients that you can see here. Some of them exhibit poor response to the drug and the group of patients who have good response to a given drug. So now using these expression profiles, we build a classifier, which is now a set of genes, which expression pattern describes these two groups the best. So this is the classifier. So this would be basically it will be summarized that the expression will be summarized in this way. So it's more of a, you know, black and white. So say these genes are upregulated and these genes are downregulated. So again, the classifier is not a single gene. It's a set of genes. So and then this classifier basically is applied to a new patient for which we can predict what would be the most likely response to the same drug. So now this part of this procedure is called discrimination. And this part is called prediction. So now for building classifier, we need a so called classification rule. And this classification rule is composed of two components, feature selection and classification methods. So there is a great number of different classification algorithms out there, probably about 100 of them. Now, these that I'm showing on this slide are most commonly used ones. And I will not describe them in detail. I will show you rather a comparison. And if you want to understand the details behind these algorithms, I will refer you to a very nice detailed paper in the end that I would suggest you to read if you are to move into this research direction. So the common algorithms include linear discriminant analysis or commonly referred to as LDA. Then there is a maximum likelihood discriminant rules and nearest neighbor classifiers came in, which is very popular as well. There is also classification trees or carts. And then there is an aggregating classifiers which involve bagging and boosting approaches. And in the end, in blue, you see the neural networks and support vector machines which are slightly different from the rest in terms of the complexity. So these are a way more computationally complex approaches, which require more extensive bioinformatics training. But people mostly use simple methods that I describe here and show in black font and with KNN being one of the popular ones, including the support vector machines. So we need to build a classifier. How do we do this? We take a molecular profiling data. We need to reduce the dimensionality because we have some 35,000 genes. So we slightly reduce the dimensionality and I will show you that there is a specific strategy for doing this step. And then whatever you end up with is your matrix of data. And then you need an algorithm that would take the data and build a classifier for you. And the methods that I listed down there are basically the classification methods, which are used on the selected features. And together, they comprise a classification rule. This is something that you will be hearing and seeing in the literature. Yes, it comes to that we did in my first presentation. So we already got some questions. Oh, great. Okay. Oh, yeah. Yeah, yeah. Sure. Yeah, so yeah, so filtering the noise and etc. Yes, this is this is done under the future selection. Was it referred as a future selection as well? No, I see. So unfortunately, there's, you know, there is no common set of terms that are commonly used. And there are different terms. So it's good that you ask this question so that you can put it in the context of the material that you have learned already. So so there was this very nice paper published in 2002, that compared a number of discrimination methods for classification and tumors using gene expression data. And I will suggest you to read through this paper if you want to familiarize yourself with with details of these algorithms. But what I think is important is the message of this paper, as you can see there, that can and nearest neighbors and LDA approach have the smallest error rates. These are the simpler methods. And then another one was the aggregation improved the performance of cart classic parts. So these pertain the comparison of different methods. And before you move into this research direction, you should read through the paper. So now I will describe to you the first component of a classification rule, the feature selection. So now classification algorithm should decide how well a feature discriminates classes. So the goal now of the feature selection is to reduce noise. And there are three main classes of feature selection strategies. The first one is the filter method, where usually one scores features or genes independently, then they ranked and top and are selected. Scores maybe between group two within group variation measure, it can be statistics, it can be p value. So again, you start with two groups of your patients, and then you score each individual gene on your list. But there are a number of problems associated with this type of feature selection. Problem one is the redundancy. And that is because features are selected independently, without assessing if they contribute new information. Now problem two is the fact that interactions between genes are not considered. And yet another problem is that classification method itself does not take part in feature selection, whereas it should. Now wrapper methods is another class. This is an iterative approach. classification performance of many subsets of features. That's what constitutes this and then select the best performing ones. The downside is is that wrapper methods are computationally intensive, and they are easy to overfit. Yes, yes, that's correct. So after we reduce the technical noise, we still end up with quite big number of genes, right? And so not all of them are certainly associated with the truckers with with any given class, right? So and basically, we want to remove those that are not associated. And that's yeah, this is very good point. So this is kind of noise I refer to. Thank you. And the last one is the embedded methods, which perform classification and feature selection simultaneously, and they certainly improves the classification accuracy. Examples of such methods include KNN and LDA. So now this slide in part answers the question of yours, Michelle. It illustrates the filtering method of feature selection. So this is a correlation matrix that you see of a cohort of samples, which are represented by three groups, one, two, three here and one, two, three there. So now on the left, you see the correlation for all of the genes. In that case, it was 3500 genes. And as you can see, the correlation is pretty low. Then if we filter out genes with low differential expression between groups, then as one may expect, we get better correlation. And that's the graph on the right. And this is the subset of genes that we want to subselect for building our classifier. These describe groups the best. They are associated with characteristics of these groups of samples and are most informative in describing these groups. So now what's the purpose of the feature selection component to improve classification performance by removing genes that are not associated with outcome. And in this case, this is noise, unfortunately, the same term. Now, they may also, the filtering step may also lead to useful insights into the biology of the disease. So differentially expressed genes between groups may infer on possible pathways involved and biological processes. And then it can also lead to the diagnostic tests, such as breast cancer chip that will contain gene that are significantly differentially expressed between say, aggressive cancers versus non aggressive cancers in ideal case. So now this slide now shows the basic components of the procedure of building and evaluating the classifier. So what you can see here is that we start with the learning or training set of samples with a molecular profiles. And then within that learning set, we have say, two subgroups of patients say responders versus non respondents to any given therapy. So now using this training set, applying classification rule, we end up with a classifier, which then has to be tested on independent test set, which is completely another set of samples with similar molecular profiles, ideally with the same platform. And basically, we evaluate the performance of this classifier on the independent test set. And we do the same actually for this for the learning set. And from here, we get an error rate. And from testing classifier on the independent set, we'll also get an error rate. And these fit into the performance assessment. So now this is the ideal case, where we have independent learning or training set, and independent test set. But unfortunately, independent sets of samples are really rarely available. And almost never at the time of the current study. So what do we do in such circumstances? A way more often what researcher is presented with is a single set, which we have to split in certain proportion into training set and test set. And in this case, this test set will not be ideally independent. Although, although this will be a, you know, sort of a surrogate of independent test set. And then we go through the same procedure of building classifier using the learning set, and then evaluating its performance on the test set that we set aside. And then we get error rates from here and evaluate the performance. Now the downside of this scenario, unfortunately, is that we end up with a reduction of effective sample size. So for example, we have a set of some 50 samples, and you're forced to split them into test set and training set. And so you are effectively reducing this to say 25 here and 25 there, or 30 here and 20 there. So this is a downside, but this happens quite often. Now the essential part of the classification procedure is performance assessment. And it consists of the following components. How accurate is classifier? And for that purpose, we use a confusion matrix and a number of metrics that can be derived from confusion matrix, including accuracy. Another question is how well classifier worked on learning set? And that would be a resubstitution error rate that I showed you on the diagram on the previous slide. And how well classified work from test set? That will be a test set error rate. Another very important components are cross validation, which I will describe in a minute. And then another component is how do different classifiers compare to each other. And for that, we would be using rock curves. So now what is the confusion matrix? So confusion matrix is a visualization tool, typically used in supervised learning, such as our case. columns represent the instances in the actual class here. So this is our actual class. This is our predicted class. These are rows. So just to tell you upfront that one benefit of the confusion matrix is that it is very easy to see if the system or our classifier is confusing two classes. That's the name of that confusion matrix, confuses or incorrectly labels samples as one class or the other class. Now, they the matrix is has a dimension of n by n and n meaning the number of groups. In this, this is the simplest case where we have, for example, the, the group of patients say that we're evaluated by the pathologist. We regard to the presence or absence of cancer. So within this group, which will be say our golden standard diagnostic technique, we have this number of positive patients and this number of negative patients. Then we apply our biomarker here to the same board of patients. And and then we get this number of positives and this number of negatives, according to our classifier, our test. And then we see what's the overlapping number between these between golden standard, the actual class and between the predicted class. So these would be true positives. These are patients who were correctly diagnosed as positive. And these would be false negatives because they were predicted as positive by our test, but in reality, they're negative. And likewise, true negatives that were correctly predicted as negative and false negatives that were incorrectly predicted as negatives. So now, how do we use this confusion matrix? A number of statistics can be derived from from this matrix. And there are at least four statistics that I will refer to. And they are shown in blue here. Just to tell you, up front, there is a common, but it's actually quite simple to understand how we derive this statistics from the table. These are simply a ratios of these red colors. I'm sorry, a fraction of the number of true positives here, for example, to the sum of this column. The same for the specificity. Specificity would be a fraction of true negatives of the sum of this column. And then likewise, positive predictive value, or rate and negative predicted value would be calculated as a fraction of true negatives as a sum of this row. And positive predictive value, a fraction of true positives from the sum of this column. Is it clear now? Yeah, so actually, the confusion matrix is quite handy in this regard. You do not need to remember the formula for these metrics, but you can just construct the confusion matrix and easily derive your metrics from here. Now, a very important metric is the accuracy. And the accuracy is defined by this formula. It can be defined through true positives and true negatives, or it can be defined through sensitivity and specificity. Just to mention that positive predictive value and negative predicted value are clinically relevant metrics, and they are more valuable in evaluating other performance of classifiers compared to sensitivity and specificity. So now this slide shows you an example with actual numbers. So in this case, we are applying a new test, full test to a cohort of patients with bowel cancer that were diagnosed with endoscopy. So here you see the number of positives and negatives are not shown here, but this is the overlap. So the number of true positives is this, false negatives is one. So the sensitivity would be true positives divided by the sum of true positives and false negatives. And this would be 66%. Specificity, similarly, is true negatives divided by the sum of false positives and true negatives. And this is 91%. And similarly, here the positive predictive value is this one. And negative predictive value is this one, 99%. So now, what is the misclassification error rates that I have shown you on the diagram before? The error rates that are used to evaluate the performance of classifier. So it is simple as one minus accuracy for any given confusion matrix. So you build a classifier, and then you apply a classifier to a learning set, the same learning set. And you get a confusion matrix. And out of that, you can calculate accuracy. And one minus accuracy would be our resubstitution error rate that I showed you before. And then when a classifier is applied to our test set, again, we get a confusion matrix from which we get an accuracy and one minus accuracy would be our test set error rate. Now, I will tell you about the cross validation, which is an important internal validation of a classifier. And it is performed on a training set. So there is a default cross validation, which is a general type of cross validation. And that is when cases in a learning set randomly divided into V subsets of nearly equal size. Then V one set out, and it will constitute a surrogate test set for us. And the rest are used as a learning set for building classifier. Then we come up, we evaluate the performance of classifier on one subset that was left out. We compute test set error rates. And then we repeat this exercise over all V sets. And then we average the error rate. This is the essence of cross validation. Now the V one out cross validation is a special case of a V fold cross validation, where V is equal to N, and N is the number of samples in our set. So basically we, at each given iteration, we leave one sample out. We use the rest for building classifier. And then we apply classifier to the one that was left out. And we reiterate this process and times. So this is the essence. Now, there is a one note that I want to say is that there is a bias variance trade off. So the smaller V can give larger bias, but smaller variance. And so it is computationally intensive if you increase V. This one. So is this part clear to you? The other part is clear for you, right? So the lead one out cross validation is a special case of just V fold cross validation where V is big. So the subset is basically composed of a single sample. So you have 100 samples. You take one out, you build classifier and 99 of them. And then you evaluate it on the on the hundreds one that you left out. And then you reiterate this process. You leave out each and every sample in every given iteration. And you do it for 100 times. And then you average the performance of a classifier over these 100 times. This is called lead one out cross validation, which is a bit called intuitively lead one out. So now this slide summarizes all of the steps that I just described to you. This is a more detailed strategy for development of classifier. So we have our training set, we build a classifier, we do a cross validation here using our training set. Then we evaluate performance of classifier on independent test set. And all the error rates that come from these three directions here shown with arrows basically define our performance assessment. So and certainly I have to know that resubstitution error rate is a bit more optimistic because you are testing your classifier on the same training set that you have build it on. And so the more realistic one would be the test set error here. And cross validation error rate is also relevant. So usually when you read the literature, you would see both cross validation and test set error error rate noted in the literature, which are most relevant. And another comment here, that both learning and test sets have to be identically distributed with regard to representation of tumors, so with any given characteristics and distribution of characteristics of patients. So if for instance, you are in your learning set, you have say the ratio of responders versus non respondents of one to three, it would be good if you would have the same ratio in independent test set. And it is very often hard to achieve, but it is possible. And then other factors such as age and race and status of certain hormones that can be important for any particular disease, or a mutational status of certain on the genes have certainly have to be identically represented in both learning and test set. And that's very important point, because if that is not true, if these conditions are not met, then the classifier evaluation will be inadequate. And we certainly do not want to do that. If the classifier is to be moved further to clinical applications, trials and so on so forth. So now I will describe to you rock curves. I don't know, have you been introduced to rock curve during the course before? No, okay. So the rock curves or receiver operating curve. These curves are commonly used to evaluate performance of classification method in comparison with other existing methods. And so basically, this basically originates from signal detection theory. And originally, the name comes from the receiver operating characteristics, we call it just rock curves. And so basically, what are the rock curves? It's a graphical plot of the sensitivity, or a true positive rate versus false positive rate, or one minus specificity. For a binary classifier system, as a discrimination threshold is the ride. Now I will explain to you what this means. So now the classification model, classifier or diagnostic test is basically a decision of where to put any given patient into which class, whether the patient responds to therapy or doesn't respond, or whether this is a tumor or benign tissue, for example. Now, the classifier or die or diagnostic results can be a real value continuous output, such as gene expression value, in which the classifier boundary between classes must be determined by a threshold value. For instance, to determine whether a tumor has expression of corresponding gene, and where to put this threshold of calling the gene say over expressed or normally expressed. So this threshold can can vary. Now a rock space for rock curve here is defined by the false positive rate and true positive rate as x axis and y axis. So this is false positive rate here and true positive rate there. Now this is called a rock space. And basically, this is the diagonal line is a no discrimination line. So everything above is a classification method that works. Everything that is goes below the no discrimination line means that the method does not discriminate classes. Now the point of interest is this one, which would be a perfect classification. So, which means basically, you know, a 100% true positive rate and virtually 0% of false positive rate. And now this figure shows you an actual data that compared a number of algorithms. And you can see that this is the best, the best performing classification method compared to the rest. Now how do we use this rock curve? So the purpose is to find, first of all, the best threshold for discrimination. So say, for example, value of our gene expression, our classifier. And then the second is to compare performance of different classifiers. So now I have described to you how we compare the forms of different classifiers, I will elaborate on the threshold to be used with the next slide. Another point to make here is that the usual summary of the rock curve that people often use is the area under the curve, which is this area. So basically, the larger the area, the better, and the maximum is one. So that's what you will commonly see in the literature, the comparison of area under the curve. Some weird folks prefer to use the area above the curve, which is the less, the better. But the most common thing is the area under the curve. So if you're using area under the curve, I guess that's good to see the general measure. But if, you know, I'm not sure it's not looking at that. It's like sort of one of the curves kind of overlaps goes out, so it might not perform better depending on how you choose your threshold. Yes, yes, yes. And so then the threshold, basically, is also an important thing to select. Because then by constructing a rock curve, you can choose a threshold that will give you a best combination of true positive rate and false positive rate. Because, you know, in some applications, you may favor, for instance, false positive rate. So you would like to minimize it. And with a certain, with a trade off of losing some of the true positive rate. And so and I will show you how this is done on the next slide. So now this slide shows you how the rock curves are constructed. So basically, we have two classes, two groups of patients, responders versus non responders, for example. And they have the distribution of a classifier value. In this case, gene expression, for example, I apologize, this is an additional slide that I just recently put, because I thought that the just one slide and rock curve is not really explaining how the rock curves are interpreted. And so I just added this slide today. So my apologies, it will be posted on the on the site of the course. And so you will be able. So this is just a simple cartoon that explains the usage of a threshold behind the process of constructing a rock curve. So we have two classes, and they have distribution of classifier value. Okay, so for example, gene expression. And this is shown here. So we have one class here, and one class there. And they certainly may overlap. So now this is a given data to us, which does not change under given study, say, right? So now it is important to find a threshold for a given classifier measured value to call the sample either class one or class two. So where do we put this threshold? Do we put it here? Do we put it there? Or there? It's it's really hard to decide. And so what we do is that we put a threshold here, for instance, which gives us this number of true negatives, this number of true positives, false positives and false negatives. From this iteration, we construct the confusion matrix with our numbers there. And then we can compute our false positive rate and true positive rates out of confusion matrix, and we get a single data point on the rock curve. Then we go back again here, we shift the threshold to bed to either side. So basically, you can start all over from here to there, depending on how comprehensive you want to be. So you move the threshold, and then you get different numbers of true negatives true positives, and you construct another confusion matrix, you derive another set of false positives and true positives, and you get another point on the rock curve. So and that's you iterate this process many, many times, until you get your rock curve of desired resolution. And basically, then you will understand how you can use the rock curve here, to decide basically what false positive rate you can tolerate, and what true positive rate you want to get. Is it clear now how the rock curves constructed? Have questions? Just disregard them. Yeah, disregard them. No, no, no, it was just an annotation for another figure. So just disregard this. The main point from here is the discrimination line and important points here. Yeah, you have a classifier that you know what given setting and you describe an interactive process to get some other point so that yeah, how do you figure out in other classifiers? No, let me show you this. Yeah, here. So see, when you build a classic wire, basically, you binerize a value for a classifier. So you decide, okay, I'm gonna apply this threshold, and I will call this gene up regulated if it's above threshold, or down regulated, because it's, you know, below a certain threshold. So and see, so but at the same time, these two groups, this and that group, see, they have variation of the expression of those genes, okay, which results in that distribution that I showed you on that explaining slide. Okay, here. Okay, so originally, you start with a setting threshold somewhere, you build a classifier and you get your accuracy and classification rate on that. But then you construct a rock curve. So first, you can compare your method with another method and see how the overall curve behaves. This is one purpose of a rock curve. And then once you satisfy with your method, you can use your rock curve to choose the best threshold for your classifier, where to put the threshold, when we're talking about gene expression, that's a threshold of say, over expression. So the gene should be expressed at this level, in normal tissue, and it should be about the threshold in cancerous tissue. So where do we put this threshold? I can, I can figure out that for root basements, that uses it. But in case of, um, um, let's, let's say, but we know that, for example, when it's just a configuration, the samples are classified into, for example, cancerous ones or else in the body. So there's a classification model that just contradicts itself. No flux threshold in such configuration. Are you repeating your algorithm again? Yes. Yes. And you are constructing your confusion matrix. You're changing the expression of your genes in your classifier. You're changing, I'm sorry, not the expression, I'm sorry. So you have, you have a, so, so again, so you have your group one of patients which are say, normal. And then you have your cancer samples. And this is say, the distribution of expression of your classifier gene. Okay. So certainly you don't have all the samples within group one showing exactly the same value. They show, normally they show distribution. And the same for group two. Sometimes these two distributions overlap. This is a very often the case. And basically the, the amount of overlap results in the shape of the curve. So, and then, and it is a question, where do I put this threshold? If I put this, so most importantly, if I change this threshold, it will change my false positive rate and true positive rate. Because it comes from, from this part, see? If you have multiple genes in classifier, you can sort of do them all. You summarize the classifier data in some way. And then, and then you use it for this, to build a rock curve. So you basically consider the distribution of same signal of gene expression signal for all of the members of your classifier. Okay, so now, if you plan to move into classification and discrimination research project, I think it would be helpful to read these landmark papers that were published over the last decade. These are more or less correctly made studies. And they concern different cancer types. And as you see, they use slightly different methods. So you can get familiar, familiarize yourself with what kind of research strategies one may take with regard to the classification and discrimination questions. So these are the papers that I highly recommend you to look through. So this course is devoted to cancer genomics. And a big component of it is development of biomarkers of different types. Genome copy number, gene expression, and so on and so forth. You have heard about classification of cancers in desktop groups, using all aforementioned molecular profiles, which is probably the most common research question in this field. Now, what I want to do now is to introduce you to another equally important research question, which has been receiving more and more attention recently. And it's certainly an important part of translational research in conjunction with clinical trials. And that is predicting the response to therapy or drug response that is commonly called. This often involves the same classification problem, as I described before, such as whether the patient would respond or not to the given therapy. But in the reality, there is a number of subtleties in this kind of research study that makes this research field a special case. So this slide summarizes the idea behind this research question. So when drug is applied to a cohort of patients, what we normally see is the spectrum of response. And the spectrum is composed of response values as measured by the response evaluation criteria, which can be binarized, as it's shown here. So instead of the spectrum of values, we get a binarization of resistant tumors, say, versus responsive tumors to any given therapy. And this is most often done in the research. And then application of the classification rule gives markers of response as shown here that are further tested on the independent test set of patients, which is shown here. So now this can be a classification problem. When you work with the binarized data here, and the prediction is also a binary prediction, whether the patient respond or not respond to any given therapy. And in that case, this will be a classification problem. Now, this can be also a regression problem, where we will try to predict the actual, well, we predict the amount of the response or response value. So this involves, this does not include the binarization step of the data. And it deals with the spectrum of response values without binarization. So now this diagram shows again, the ideal scenario, as I mentioned to you when we have two independent sets of patients, their samples, that one is uses for training classifier, and then tested on independent set of patients. So this is certainly an ideal scenario, which doesn't happen that often, especially for the drug response studies. So what can happen is that for building predictor, we may have patient samples. And for evaluating the performance of our predictors, we may also have patients samples and that is ideal situation. Now, we can build out classifiers using cell lines profiles, representing a given cancer type. And then we can use patient samples to evaluate a performance of predictive markers. Very often people are given with a mixed cancer types. As they training set, they built classifier on the mixed cancer types cohort, and then they test the performance of classifier on a single cancer type. And that's unfortunately the reality that's what people are. That's what they have in their hands. Then another scenario that happens often is that classifier is built on the data set of patients treated with combination therapy. And then they're trying to evaluate the predictive performance of a markers on the patients treated with mono therapy, and vice versa, try to build the classifiers on patients treated with mono therapy, and then evaluate the performance on patients treated with combination therapy. And this brings in a lot of complications and challenges with regard to this type of research study. Also, there is another complication. When it comes to the response, one may use different metrics. And so for cell lines, it can be, for example, the GI 50 or CI 50, IC 50, then TGI or LC 50, these are usually concentrations of the drug that result in, say, 50% of inhibition of growth of cancer cells, or reduce the activity of a target by 50%, for instance. Then this is what's commonly used for cell lines, which is a common model system for preclinical evaluation of a given drug. Now, when it comes to patients, the criteria also can be quite different. It can be time to the progression of the disease. It can be overall survival. Then it can be arbitrary assignment based on post treatment tumor volume. It can be pathological complete response. It can be residual disease, whether it exists or not. It can be complete response. Or it can be even incomplete response. And if you're curious, the definition of these are given on this table. So obviously, different metrics of response can be used in patients in particular. And it's important to, first of all, choose the most appropriate one for any given drug or any given patient cohort. And very often, it's not the same metric that can be used throughout many studies. And then what's also important is to be consistent throughout individual sets and across studies when one is to compare the performance of a classifier. So for instance, when somebody uses a patient's cohort to build a classifier, and they use a time to the progression as the evaluation of response. And so when they test the classifiers on some other patient cohort that is published by someone else, it is important to use the same metric. So that's very important. And so that's the major note from here is that what's important for their performance evaluation of predictive markers, especially when one goes across the studies is to be consistent with this in the first place. So cell lines are widely used for preclinical studies of drug response. And they are invaluable resource. There are certainly pros and cons related to cell lines. So the pros include the fact that cell lines are more regularly available in patient cohorts. They are pretty easy to manipulate. And certainly they have a limited supply. And very often cell lines are quite thoroughly characterized models through molecular profiling and other aspects. Then cell lines enable prediction of response to novel therapies, single or combination because basically we have only limited number of patient cohorts in clinical trials, for instance, that are treated with a certain drugs. And so if one is to test the response to a new drug, then certainly we need to resort to our model system, the cell lines. Then cell lines allow high throughput screening of thousands of new drugs. And that's what the NCI 60 panel has been created for that will mention a bit later. Then it enables identification of new uses of established agents, such as to go cross cancer type and test if any given drug approved for one cancer type will still work for another cancer type. And also it certainly provides a way to inform us about possible mechanisms of drug action. So we can manipulate genes that we find to be associated with a drug response. And we can investigate the functional role of those genes and effect on the drug response itself. But also cell lines are associated with a number of cons. And these include the fact that cell lines represent a small and highly selective minority of heterogeneous tumor population. And they're not available for all the cancers. Then cell lines, especially selected and adapted for cell culture conditions, and may not respond similarly to the drugs cancer cells growing in a human host. And just to give you an example, for instance, we do see differences in expression of transcripts in cells grown in 2D versus 3D event, not to mention the microenvironment of tissues as it happens in a human body. So this is one of the very important drawbacks of cell lines. Then drug exposure in vitro does not mimic the kinetics of drug exposure in human tumors that are influenced by interstitial pressure, blood flow, drug metabolism rate and other host factors. So we're not addressing these questions here. And then validation of signatures in patients are difficult because most clinical efficacy trials are carried out with drug combinations. And it is not always possible to find a ready to use data for cell lines that have been treated with this particular combination of drugs. Usually it is a screening for a single drug agent. So now this is a slide that shows a model system for screening of drug response. This is an NCI 60 panel. I'm sure you've heard quite a lot about it. I will just tell you that basically this is under the developmental therapeutic program at NCI who has established a collection of 60 cell lines representing a handful of different cancer types. And this initiative started somewhere around 1990 and has been widely used as a high throughput resource for high throughput screening of multiple drugs. And so basically the capacity is that it is possible to screen up to 3000 drugs a month with this platform. And at this point, there's a very rich data available on this site, which includes more than 100,000 individual drugs screened with this panel and is available as a public resource for everyone to explore. So now various gene signatures and sequence alterations in target genes have been obtained for a prediction of drug response in patients. And there are a number of examples that include the EGFR inhibitors and a number of other well studied drugs. And this table shows you the number of studies that have been published out there devoted to the development of predictive markers of response in patients using patient cohorts. Now the big disadvantage actually of these types of studies using patient cohorts relates to the limitation in the number of chemotherapeutic drugs that can be tested. As I mentioned, there is only this number of drugs that are being evaluated in clinical trials. So now to circumvent this, many groups have been using preclinical model cell lines to make use of human tumor cell lines and xenografts to investigate gene expression profiles that are associated with a response to therapy. And this can be done, as I mentioned, to hundreds or even thousands of drugs. So now what I'm going to do, because this is a highly developing area in bioinformatics and in the clinical research, there is no established set of rules that that one can follow to make the study successful. So a number of algorithms are still being developed, and people also working on the processing of the drug response curves themselves to find the best suited way of preprocessing the data and application of a classification algorithms in a more effective way. So what I will do, I will just show you a few examples of research papers that have been published out there. And you will see how diverse the clinical research can be when it comes to the drug response studies. So for example, the group at MD Anderson Cancer Center, a program on pharmacogenomic marker discovery, which goal is to study the treatment of breast cancer with combination T-fab therapy. And there was a number of paper that originated from this group since 2004, until up until recently. And I will just show you a couple of these to illustrate what people can do with this type of data. So for example, for study one in 2006, the goal was to develop predictors of a pathological response in response to combination therapy T-fab, and they used some 80 breast cancer patients and validated on 51 independent cases. Then they evaluated use of different classifiers that you have heard about a bit earlier today, such as SVM, Canon, and linear discriminant analysis, and also the right number of features in the classifier predictor. I've gotten a question about this so you will see how people approach this issue. Then they examined the fact of training set size on model performance. And that is a very sensitive point because we are always limited in training and validation set sizes, unfortunately. And so it is important to understand what should be the minimal size of our training sets so that we can still do a good job with building our classifiers that will give us a acceptable performance. So they try to take a look at that too. And then they also compare genomic predictors versus clinical predictors. So this is the data from this paper where they have performed a comparison of some 780 classifiers, which included 20 classification methods with 39 different sized gene sets or selected features, if you remember the slide that I showed you before. So this pertains to classification rule. So the workflow there was that they ranked genes by p value differential expression. And then they do 100 randomized iterations. So what you can see here is the area this time above the rock curve. Remember, this is a complement to the area under the curve. So we want to minimize it for and this number of different classification methods with this varying number of genes that were included into classifier. And as you can see, there is a improvement in classification as you increase number of genes for pretty much all of the classification methods. And then as you actually continue to increase a number of genes, you can see a decrease in performance of your classifiers. And that's because you you reach the point where where you have likely to have included all of the most informative genes. And then whatever you add in on top of that are not really informative, but they rather introduce noise, which is just irrelevant to the drug response phenotype. And so that's why the performance goes down a bit for some of the classification methods. So so here's what they have decided is that this was the best performing classification method, and the number of genes that they found to be best performing in a classifier was around 40, I think, yeah, around 30 or 40. So they plot to around 30 or 40 genes. So that's how they basically approach this problem. Now, another aspect is I mentioned that they try to explore is the effect of a training set on on the prediction performance. So what they did is that they put both sample sets together, they set aside 20 samples for test set, and the rest they used for sub sampling and training. So they did some 50 iterations, and red dot represents the median of prediction accuracy. And so what they could show, as you can see, so this is the again, area above the rock curve, which we want to minimize. And this is the size of a training set. And as you see, with the increase of the training set, we do see a improvement in the performance of classifiers that we build on that training set. So and they were extrapolating this to the size of some 200 cases. And they decided that this would be only marginally better than the available number of samples in the cohort that they had at hand. So this is another way of looking at what training size training set size you are supposed to have in order to make sure that your classifier will yield an acceptable performance. So now here's the evaluation of the classifiers that they have built in comparison with the clinical predictors such as age, tumor, volume and ER status. So they used LDL method for both types of data. And what they have found, and you can see I have framed it with with red box for you to see that the 30 pro classifier that they have chosen showed significantly higher sensitivity, but not significantly better NPV, which is negative predictive value, than a multivariate clinical predictor. So even if they have done such a thorough work with exploring the effect of different parameters on the performance of the classifiers, still the clinically relevant NPV was not really a way better for their classifiers compared to clinical predictors. So this is to emphasize that this is actually quite a tough goal, tough research question. So here's another study from the same group, a recent study, and the motivation there was to investigate to what extent the cell line derived markers can predict outcome to drug response in patients. This is still a big question, as I mentioned to you. And actually, I would say that there is still a split in the research community by probably by half, half of people do believe in the utility of her clinical models such as cell lines and xenografts. And the other half, well, maybe slightly less than half, do not really believe that and say that tumor microenvironment play really a pivotal role in biology of cancer cells. And so this is not directly translatable, the knowledge that we get with model systems to patients. So they tried to investigate this question. And so the way they did it is that they had a cohort of breast cancer cell lines, which were screened for response to individual drugs, such as those ones that have the tax of their uracil, doctorate, and encyclophosphamide. And then they do rise signatures, well, were evaluated on some 130 breast cancer patients now created with a combination therapy T FAP, which included all of those individual drugs. So let's see what they've gotten there. So the method they have applied was as follows. So they assigned cell lines to sensitive or resistant cell lines, based on the GI 50. And unfortunately, only faculty tax so I had clear the HODMS GI 50 distribution. So for other three drugs, they arbitrarily selected cell lines at extremes of a distribution of GI 50 values for cell lines. Then how did they find informative genes? In two ways. They tried to see the differential express genes, and they were followed by significant values. And then they tried to correlate gene expression with GI 50. And then they constructed multi gene classifier using LDA to predict clinical response. What clinical response did they use pathological complete response as one and residual cancer burden as the other one? So what did they get result is that they did not develop cell line derived predictors before individual chemotherapy drugs that could predict response to a combination therapy in patients. As you can see, I'm sorry. As you can see here, what you can see is that basically the accuracy here is not that particularly high as you see. So it's only some 60%. So what's the basically the outcome of all of these studies that have been conducted over the last many years? And the answer to this is unfortunately, they have raised many more questions that they have provided answers. So for example, there's still a huge question. Can human cell lines be used to predict response in patients? Is the commonly used IC 50 the best metric or should methods be developed to account for all data points in response curve? Now, what is the best measure of response in vivo? And I have listed a number of different metrics that one can use. So there is still no consensus as to what should be the most appropriate thing. What is the best way to build predictors of response? What is the best classification rule? And are the signatures of response to cancer type? Are they cancer type specific? Or can they be applied to other types of cancers? So unfortunately, this is still a question that still need to be answered. And here, I would like to familiarize you with a very famous study from the Duke University, a controversial study that was conducted by bunphematics group at the Duke University, which has resulted in a number of publications over these years. And basically, I will show you what was happening with this. It's actually quite interesting and quite educating. So in 2006, the the group has developed a method that was able to build promising predictors of response from NCI 60 some lines, and validated on patient tumors to the following drugs, for your cell, at ribosin, psychophosphamide, and a couple of others. And as you can see, the sensitivity, specificity and PPV and MPV are really extraordinary. So then this approach was named by Discover magazine is one of the top six genetic stories of 2006. Now, in 2007, they use the same approach to develop signatures of response to cisplatin and pymotrexid. This spawns a clinical trial to assign subjects to either of these two regimens of therapy using a genomic based platinum predictor to determine sensitivity to chemotherapy. Now, in 2007, they provided a validation of the combination approach. And they tried to predict the patient's response to two alternative therapies, TET and FEC. And this report is becoming a sub study of the European clinical trial. Now, in 2009, they use the approach to construct a signature for yet another drug. And basically, what you can see from all these publications is that here's the method that gives you good predictions and independent test sets has some biological plausibility appears to be giving stable results over years of application, and it's consequently can guide treatment. So now what happens next is that already in 2007, we start to see a certain criticisms of the Duke predictors. And back at that time, it was put very mildly that the method such as the analysis that comes from the Duke University has to be absolutely clear to the reader so that others can check on these derivations. And then it was noted that the methods seem to be performing pretty well, but it's rather convoluted. And not many researchers can actually reproduce it. So now in 2008 and 2009 year, there was published a devastating rebuttal by bio statisticians at, I forget now where it was, I think at the same empty Anderson, and they have found a number of fundamental flaws of all in all of these publications. And these were very often absolutely simple flaws, such as both sensitive resistant labels were reversed. The incorrect use of unduplicated samples, such as off by one indexing error affecting all genes reported, the inclusion of genes from other sources, including other ways. For example, the outliers that have been said to be excluded from analysis, and then they would report and draw the A to include a heat map for drug B, and a genus for drug C. And these results are, you know, were basically evident upon the visual inspection and just simple counting. And this is just a tip of the iceberg. There were a lot of other problems related to this. So, and then certainly it resulted in the termination of the clinical trial, and papers from science and land sad oncology have been retracted. So this is actually quite a painful experience. So now, but what's important is that now this has basically resulted in the emergence of a very interesting area, which is called forensic bioinformatics, where people are now trying to evaluate the studies that are published out there and to monitor the correctness of the conduct and interpretation of the results from these studies. And so, for instance, these people have evaluated a number of publicly available studies. And what they have found is really astonishing of 18 quantitative papers published in nature genetics in the past couple years of reproducibility was not even achievable in principle in 10 out of 18. This is really, really devastating, I should say. And what they have been able to find is that most common errors are indeed simple. And unfortunately, most simple errors are very common. So now, there are really a number of challenges that are associated with the development of markers, which I have illustrated to you by these few examples and problematic controversial studies. And that's truly a problem, which is still being addressed. So people are trying to find the best way to address these problems. But one should be aware of these before approaching the research project like this one of all of the hidden pitfalls and try to avoid them if possible. So now, as I already said, many predictors have been published and with the very bright performance evaluations such as sensitivity, accuracy, specificity. And so the problem about them is because they are poorly reproducible due to the problems of poor study design and not incorrect study conduct and interpretation. Now, only one factor should be changing between compared groups, such as treatment of infectious organism. And that is something that I mentioned already before, that training sets and test sets have to be equally distributed. So you have to have all other parameters more or less constant and equally distributed into individual sample sets. And then certainly the method description should be detailed enough for reproducibility and correct interpretation. And so people now are saying that there are three major basically threads to marker validity. And I will just touch upon that very briefly for you to be aware of. And this is a chance, generalizability and bias. So chance. So all of them are summarized in this table. And I will just briefly walk you through this. And if you want to get more details, you can read through this paper, which is a very nice paper, I think. So the chance is a type one and two type arrow. Type one arrow can cause the erroneous or false positive conclusion that there is a difference between compared groups when there is no difference in the reality. Type two can result in the false negative conclusion that there is no difference when a difference in the reality exists indeed. But another problem, which is quite a common problem is the overfitting. When we assess a large number of possible predictors, and find a certain pattern that fits perfectly, but it can, it can be found just simply by chance. Because now the model basically models the noise more than the actual data. So now the generalizability, so what's the solution to this thread again, is to increase your sample size. As I keep telling throughout the talks, this is very important. The greater size you have the better and there are certain ways of trying to estimate your power when it comes to the selection of an appropriate sample size. And these are nomograms that you can see, for example, prostate cancer that can help you with this. And then another very important thing, as I mentioned to you, to reduce this threat of a chance is to check the reproducibility on independent set. And certainly, if you have more than one independent validation sets, is a way better than just having one. Because you can always have a certain bias within a certain, any given patient sample set. Another problem is the generalizability, and it concerns to whom the results of the comparison can be applied. For example, comparison of test results in people with cancer and people without cancer. So the generalizability of a study depends on characteristics of participants and how they are selected, regarding age, gender, symptom status, hormone status for reproductive cancers and so on. So what's the solution for this? Markers should be assessed similarly to clinical trials in three phases. And in particular, here in this paper, there is a proposal that suggests that initial studies should involve tissues and animals, whereas later ones involve symptomatic and then asymptomatic people. So this should account for the threat of generalizability. Now the bias is very important factor. And is basically it's unintentional and unconscious. There are so many hidden variables that can basically influence the phenotypes that you see. And so the bias broadly defined as systematic, erroneous association of some characteristic with a group in a way that distorts a comparison with another group. So for example, compare a group with patients with cancer versus patients without cancer. And the cancer patients are older people. They are all say over 70 years. And all of the non cancer patient groups are younger people, all of them 35 years. And this is certainly a bias that should be corrected for. What's the solution for this? The solution is complicated. And it involves making everything equal during the design, conduct and interpretation of the study and reporting those steps in a thorough explicit way, so that everyone can review it. So now briefly, there are a number of tools. Yes. No, I think not. So, so you think that they should be a fixed size for a validation set? Yeah. Yeah, you know that that's a good point. And actually, honestly, I do not remember. So yeah, but that's a very good point. Yeah. Okay, so this slide shows you the tools that one can apply for this kind of research studies. And the R package, which is called current package, which includes a large number of classification regression models. And here on the right, you see the table, which lists different classification methods, some of which you are already familiar with now. And here on the right, you see in the table, what kind of research question they can be applied for, classification and regression, either or both. So most of them can be extended to the regression problem. Sometimes it is, it is a problem for particular classification method. So it's not always simple thing to do. Some of them do not are not able to do the regression. And these are neural networks, for instance. So this is something that you can use technically for this kind of work. So now what I want to do is to summarize what we have learned. So despite the progress in market development field, still there is a key need for new bar markers. There is a large number of classification discrimination methods, and generally LDA, Canon and SVM methods perform well across studies and in comparison studies, then feature selection is an important part of classification procedure, which can improve classification performance, may provide useful insights into biology of a disease, and lead to diagnostic tests. Another important thing is the performance assessment, which is crucial for classifiers development, and consists of several important components. And the last point is that there are threats to marker validity for classifier development and consists of, I'm sorry, been transparent documentation is crucial to the study reproducibility and clinical relevance of biomarkers. So this is the material that we have covered. Now this is not over yet. So what I would like you to do next is try to apply the knowledge that you have just acquired to a specific biological questions. So here what you see is the question, the biological question that was asked in the first column. And there is a design and result. And I also have an explanation. So let's walk through them here. So the first question does molecular profile shows clusters by survival. And now what the person does is select subset of genes with significant differences between long and short survivors clusters profiles for these genes only and gets clusters by survival status. Now is this the correct analysis? No, this is incorrect. Why? Because genes were already selected by difference between survival groups. So they should cluster according to those groups. Right? Another example, build a classifier for a rare subtype of cancer with disease prevalence of point two, and assess its performance using cross validation. What what the person does selects equal number of patients with rare subtype of cancer and common subtype disease prevalence of point five. Now, classifier that the one builds shows high sensitivity and specificity. And then the person concludes that it is applicable as a diagnostic test. It's incorrect because sensitivity and specificity do not depend on the population distributions. But NPVs and PPVs do. So sensitivity and specificity are not clinically relevant. We rely on PPVs and NPVs. They should be used for performance assessment. The third one compare responders versus non responders with respect to survival experience. So in clinical trials, for example, often patients are defined as responders. For example, by the shrinkage of a tumor and non responders. Now compare survival experience of responders versus non responders. Using couple might occur and conclude that the treatment is useful. This is not correct analysis because there is a bias. Patient had to survive a certain period of time to achieve response. And we do not know how long would a patient survive without therapy, maybe longer. So the conclusion cannot be made. And journals actually have banned this type of analysis. Now, what I want you to do is try to come up with a correct analysis with regard to this biological question. So does molecular profile shows clusters by survival? How do you think it would be correct to approach this question? Okay, those, the beginning is correct. So here's the answer. Yes, we do perform answer-by-vice clustering of genes that are differentially expressed across all the samples with no respect to the survival group. And then when we get some kind of clustering, then we can see if there is a difference in survival experience between observed clusters. That would be the most correct approach. Okay, so then the next one. Build a graphifier for a very subtype of cancer and assess its performance using cross validation. Any ideas? Just have to make sure that you use a incidence of 0.2 when you're learning. So you try to correct for differences in distribution and representation of samples in both your training and test set. And they should be representative of a population distribution. And then you test the predictors on the dependent set, which is also identically distributed. Now, the last one, compare responders versus non-responders with respect to survival experience. What would be the approach to explore this biological question? Any ideas? So we have subgroups of options, some of which respond to therapy and another group which doesn't. What can we do with these two sets? So here's the answer. We can compare the molecular profiles of these two subgroups, right? And we can build our markers in response to the therapy. This is one thing. And another one, we can compare survival experience of patients treated versus untreated. And this will give us an estimate of usefulness of treatment, for example. Okay, so this concludes this part of our module. And this is the list of references that I have used throughout the talk. And you are welcome to use them and read through them if you want to get a bit deeper grasp of these topics. So what we'll do is that we'll take a break now.