 Hello, good afternoon. So my name is Anna Lapuk and I have been working in the cancer bioinformatics field for probably since 2001. And originally I come from Russia, Moscow, and I did my Ph.D. in molecular biology studying the evolution of a human genome. And then I moved to California and worked in the University of San Francisco Comprehensive Cancer Center as a postdoc for a number of years working on the genomics of breast and ovarian cancers. And then I spent a few years in Lawrence Berkeley National Laboratory working as a computational scientist there and really focusing on the development of genomic and transcriptomic biomarkers for breast and ovarian cancers. And in 2009 I moved to Canada to Vancouver to join Vancouver Prostate Center which goal is to find the best cure for the prostate cancer. And since then I've been working in prostate cancer field and mostly again focusing on development of biomarkers. And I do as a scientist I do have a special interest in alternative splicing of messenger RNAs. I'm not going to talk about it today at all but basically my research is more translational research. So the Prostate Center is focused on development of new clinically useful biomarkers and new therapeutics including small molecules and antisense, oligonucleotides and advance them all the way through the clinical trials. So today I will be talking to you about the integration of very important data, clinical information data with genomic and transcriptomic and other omics data that one can get from the modern technologies. So the module will be organized in three parts. In the first part I will introduce you into the clinical data and biomarkers. In the second part I will describe you the statistical aspects of development of biomarkers. And the third part will be the lab where you will have your hands on the most important part of the analysis, which is called survival analysis. So the learning objectives of this module is to basically understand what the clinical information is and how it can be used to understand the process of its integration with high resolution data. And then we will review the current advances on the biomarker development front. Then the last two bullet points will be addressed in the lab that we'll have in the end of the module. So you will basically learn to evaluate the biomarker and perform survival analysis, very basic type of it. And also you will be able to play with a tumor cohort high resolution data, gene expression and copy number data and to learn basically how to deal with the question of where do we go next after we get the high resolution profiling of tumors. So what is the clinical data? Clinical data is a set of information about the patient and his or her disease. It usually includes a number of parameters that you can see listed here on this slide. So the important pieces of information is race, family history of cancer, the nodal status, whether cancer has spread to regional lymph nodes. Then the presence of treatment, it's radiotherapy or chemotherapy, hormonal therapy. Then these results of staining with some important proteins that are involved in etiology or progression of the disease such as hormonal receptors for cancers of reproductive system. Then size of the tumor, stage of the tumor and so on and so forth. There is also a very special type of clinical information data which appears right at the bottom of this list. It's an italic bold font and that is an information that has something to do with a certain time. So it's, you know, sometimes it's time from diagnosis to the recurrence from surgery or other treatment to the disease recurrence from onset of the disease to the death of the patient of the disease. And this type of data is a very special type of data. It's called survival data and it requires special analytical approaches and we will be learning about them later in the module. Clinical data alone has been used for quite a long time and more or less effectively to aid clinical decisions to predict the behavior of the disease in patients. And this is normally done through a development of enomograms such as the ones that I'm showing you here for the prostate cancer here and breast cancer here. So this is a certain score that is being developed based on a set of clinical parameters such as age and hormonal status so on and so forth, tumor grade and, you know, family history so on and so forth to calculate a certain score that will reflect the risk of the disease recurrence. And normally these, there are quite a few enomograms that have been developed for prostate cancer that use slightly different scoring system and different set of clinical parameters and they are more or less accurate. So the accuracy ranges actually from somewhere 0.6 to all the way 0.9 and this is how the clinical data has been used alone. From the other hand, now we leave in the era of new technologies that, you know, make it possible to profile cells, tumor cells at unprecedented level of resolution and so we do have a huge amount of omics, cancer omics data that's available to us including transcriptome profiling, genome profiling, methanol profiling, proteome profiling and then this data gives us hundreds to thousands of different operations that somehow are believed to contribute to the disease. And so the ultimate goal of having all of this technology and all of this data is to be able to use it to improve the clinical decision. So this slide shows you an example of how single nucleotide variants, for example, can be used to help clinical decisions. So we profile the genome and we identify all of the variants and you had a specific module focused on that specifically and just to remind you this stage is composed of three steps so we line the sequence to the human genome assembly, we find things that are different between tumor and normal tissue and then we do molecular annotation of those variants and these variants are actually our biomarkers but then potential biomarkers but then the question is what is going to be the clinical use and the clinical use is sometimes called clinical interpretation. So it means basically what usage we can derive from this information. For example, is this mutation diagnosis a certain subtype of tumor? In that case it will be diagnostic marker. Is this mutation capable of predicting response to certain therapy? In that case it will be predictive biomarker. Is this variant associated with patient's outcome, for instance, shorter survival or longer survival? In that case it will be prognostic marker. And so once we identify its potential clinical use and then we prove that this clinical use is indeed very robust and there are a number of steps that are needed to be gone through in order to evaluate the clinical significance of a biomarker, then this biomarker can aid clinical decisions. So now let's recall what the biomarkers are and what the therapeutic targets are. People very often confuse the two but actually they're not always the same. So biomarker is a biological molecule found in bodily fluids and secretion or in tissues that is a sign of a normal or abnormal process. Therapeutic target is also a biological molecule but it is a molecule that can be modified by the external stimulus and that external stimulus in our case is drug. So once the molecule is hit by the drug its behavior is changed. For example, we have a signaling molecule, for example, terzincinase receptor and it transduces signal and then once we inhibit that receptor then it stops transmitting signal. So that's the example of a target. Sometimes therapeutic targets are the same as biomarkers but not necessarily. Very often there are different molecules. So biomarkers have certain features. So first of all they can come from different sorts of aberrations. It can be mutation both somatic and germline. It can be an amplification of a gene region or a certain focal region of the genome involving several genes. It can be transcriptional change. For instance, transcriptional up-regulation or down-regulation without any underlying genomic change. It can be also post-transcriptional modification. For instance, it can be production of the alternative RNA isoform or it can be post-translational protein modification such as excessive phosphorylation of a protein. And because this is a biological molecule it can be any kind of biological molecule. It can be protein, nucleic acids including coding or non-coding. It can be cells actually. It can be not only molecules but it can be individual cells and I'll give you an example of that. It can be peptides and historically biomarkers were considered to be single entities such as single gene or single protein. But now more often biomarker is a panel of units such as genes. So it's for example the gene expression signature or proteomic signature or metabolomic signature. So the biomarker goal is to screen, right? And so the most attractive feature of a biomarker is that that is the least invasive. So if it can be screened in the secretion or in bodily fluids plasma, urine, blood, the whole blood then it's a way better biomarker because it's it's not invasive and it can be repeated over and over again along the progression of the disease so that we can monitor this the disease state and the evolution of the cancer in patient. Less desirable it can be screened in tissues and that implies taking a biopsy usually through needle biopsies and then staining for a certain biomarker for protein level or mRNA level and then followed by imaging. So as I mentioned biomarkers may or may not be the same as therapeutic targets and here I give you a few examples. Her two in breast cancer or Urb2 Tarazen kinase now so it's both a biomarker and a therapeutic target at the same time. So Herceptin is a therapeutics that specifically targets Ur2 and it's effective in patients who have Ur2 amplification and so in that case when there needs to be a decision made with regard to whether to treat a breast cancer patient with Herceptin then the patient is tested first for the status of her two. So this is a companion test for her two status and which is used in combination with a treatment regimen. PSA in prostate cancer, prostate specific antigen is widely used biomarker for prostate condition actually in a bit broader scope but it's a prostate cancer biomarker but it is not a target because it's a effector molecule so it's downstream of the androgen receptor which is asteroid hormonal nuclear receptor which drives cancer and which is a target for specific treatment. Now another example in colorectal cancer the KRAS mutations which is a biomarker of sensitivity of patients of fat tumors to EGFR treatment so again biomarker is different from the target. So I already mentioned that it is important to understand and to establish the clinical use of a biomarker and I have already gone through that so here again I'm giving you the definitions and examples of different types of clinical use of biomarkers. Diagnostic marker is used to diagnose or sub-classify the disease and examples are in leukemia for instance very characteristic fusion BCR-Able. There is another example that appeared just a few years ago in prostate cancer very specific fusion tempers to ERG that is present in more than 50% of prostate tumors. Prognostic biomarker is used to make prognosis about a clinical outcome of the patient so for instance survival or recurrence and the example of these kinds of biomarkers would be the gene expression panel on the type DX for breast cancer recurrence estimation then predictive markers for instance to predict the response to a given therapy and as I already described to you her too is one of the examples or KRAS mutations for EGFR treatment is another example and then there is also a so-called companion diagnostic markers which are a subtype of predictive biomarkers that are used in combination with with the treatment itself but they do not provide an independent prognostic and predictive strength so and the example of that would be the activating mutation in BRAF for the treatment for sensitivity or response to treatment in melanomas with BRAF inhibitors. So this table now expands on examples of clinical use of biomarkers and I hope you can see it clearly here so again this is the example of a germline mutation that can be used to to estimate the risk of developing cancer during lifetime then the screening biomarker already described to you PSA then differential diagnosis and the fusions are very specific for prostate cancer for instance yes. Can you explain the difference between a predictive marker and a companion diagnostic again because BRAF that would actually predict response to the to therapy but if you just take that mutation in that particular cancer alone it's not it's not a biomarker by its own by itself so it's only with regard to the treatment so that's why that's why they're used in combination there's this is a very subtle difference but they are called companion yes yes in light of treatment yes mm-hmm it's it's it's very subtle difference yeah but basically the thing is that they're not they're not powerful enough and they're not useful in in the context outside of the treatment with specific specific therapy so and here we can predict response I already mentioned that and then we can monitor the disease recurrence these are set of markers that exist out there and then we also can monitor response of progression of metastatic disease so now this slide summarizes the biomarkers that have been approved and are used to aid clinical decisions currently so they have established clinical utility and actually you know it's it's good that we have you know several markers for several cancer types but yet cancer research field I mean it's it's it's actually you know decades long and a lot of research has been invested into this field and so the reasonable question would be why there are so few of biomarkers that have established clinical utility and I actually have to you know note that there are a number many more biomarkers that are under development currently but the clinical utility has not been established yet and so the reason for that is actually that the process of developing a biomarker is very tricky and very prone to bias so as I mentioned multiple types of aberrations can give rise to biomarkers and so one of them are fusions as I mentioned and so for example here in lung cancer the elk fusions are a type of a fusion transcript biomarker so these fusions that involve different partners very often it's eml4 but it may be different partners but it's always elk that is being fused to another gene and as a result elk is constitutively active and this happens in approximately 2 to 7 percent of non-small cell lung cancers in adenocarcinoma type and mostly non-smokers and so it is considered to represent a certain subclass of lung cancers and elk inhibitors are such as chrysotinib are effective for elk positive tumors and so now the FDA they exist a companion test for elk fusions with done with fish as a companion test for treatment with chrysotinib and so here for instance you see that this fish staining is a type of staining which is called break apart probe so because the fusion is basically breaks apart 3 prime and 5 prime of the gene then two probes are designed for 5 prime of the elk gene and 3 prime of the elk gene and you can see that if these dots are separate in space then it means there is a fusion there so the two parts of the genes are brought apart at the same time if you see coaxialization coaxialization of these signals it means that the gene is intact so and here you see the waterfall plot which indicates the level of response of patients to chrysotinib and these yellow bars are elk fusion positive tumors so it means that most of the responding tumors are elk positive so another type of a biomarker is a gene expression based and here I'm giving you an example of a gene expression signature developed by Oncotype DX which is composed of some 21 genes so the problem in breast cancer patients that are treated with hormonal therapy is that only minority of them recur so it means that more than 85 percent of them may not need additional chemotherapy and so there is a need to predict the risk of recurrence of patients and so the development of this gene signature has started from accumulation of some 250 candidate genes from the literature and then evaluating them in three independent studies involving a total of 450 patients and development of a RT-PCR quantitative assay to measure expression level from FFP tissues which is a special type of archival tissue and as a result this test now is approved for evaluation of recurrence in no negative ER positive breast tumors who were treated with hormonal therapy so this is yet another type of a marker which is based on the epigenetic signature in colon cancer and this signature is actually a whole genome signature or transcriptome signature pattern so one of the molecular aberrations observed in colon cancer are you know differential epigenetic events and so this heat map shows that in colon cancer patients the tumors have different levels of methylation of CPG islands in promoter regions which leads to transcriptional silencing of those regions and so in normal tissues basically there is no hypermethylation of a certain subset of genes but in cancer patients there are two subtypes of colon cancers the ones that have low level of methylation in promoter regions and these were called SIMP low and they alone were associated with poor outcome there is also another subgroup that was called SIMP high which had very high frequency of hypermethylation of promoter regions and that subgroup was very controversial so it was not clear what's the association with outcome and other factors were suspected to be contributing to this including micro satellite instability and BRAF mutations and then the paper in clinical cancer research in 2010 has clearly demonstrated that a combination of SIMP high status with micro satellite stability actually is associated with the with the most adverse effect right there and this is a hazard ratio which I will describe to you what it is later it's a metric of how strong the effect of a certain marker is on the outcome and a hazard ratio of one means that there is no effect hazard ratio greater than one means that there is an effect and the higher it is the stronger the effect so now this is an example of using individual cells rather than biological molecules as potential biomarkers and CTCs are cells that are thought to be shed by the tumor so they detach from the primary site from the primary tumor and then they enter the blood stream and then they travel in the circulation and then they find another niche in the distant site and they believed to be able to see the mass study clone in a distant site so and these CTCs have received much attention over the last few years and in particular as a potential biomarker and it turns out that just a simple count of those cells in plasma of cancer patients is indicative of of the disease aggressiveness and can be used to predict the the outcome so in breast cancer for instance right here the count of more than five cells per mil means that the patient will be doing very poorly as opposed to less than five cells where there was a heterogeneity but this cut of five cells seemed to be the the best discriminator of the two outcome groups another example in your undercontumers where there was a single CTC cell count that actually mattered so patient who had at least one CTC did way poorer than the ones who didn't have CTCs at all and for the neuroendocrine tumors the comparison was was done with a conventional biomarker which is Chroma N.A. it's very specific to neuroendocrine tumors and CTCs were shown to be even better discriminating biomarker than that established biomarker so the development of the biomarker includes two main steps first identification and then validation so the identification is normally done using high resolution profiling nowadays such as microwave sequencing or mass spectrometry and biomarkers are identified through comparison of molecular profiles between subgroups of samples it can be tumor versus normal it can be different sub types of tumors and the important consideration there is that there should be very careful study design to avoid bias in biomarker discovery and I will give you some examples why that's so important for example when it comes to tumor specific biomarkers development it is ideal to have matched tumor and normal samples from the same patients it's not always possible but that's the ideal scenario once the biomarker is developed then it needs to be validated and there are different components each of each of them are very important that that compose the whole process of validation so first there needs to be analytical validity and that means that the assay that is developed for screening for biomarker status needs to be reproducible sensitive and specific and then clinical validity and clinical utility are very important aspects of validation so clinical validity is answers the question how reliably the biomarker divides the populations of tumors and very important aspect is that validation always needs to be done on the independent cohort that was not touched at all at the stage of the biomarker identification and then clinical utility I already described it to you so it means what type of the biomarker application we're dealing with prognostic diagnostic predictive and then how well how strong the association of a biomarker so the statistical significance and then there are other considerations such as size of the effect so for instance if you have a specific mutation how much it affects the outcome if it's just only slightly then it's not a good biomarker if the presence of the mutation means that the you know survival drops dramatically then this is a good biomarker so an example of the established clinical utility is KRAS mutations in colorectal cancer so just to go through this so the EGFR is a transmembrane receptor of tyrosine kinase family and it's very important in signal transduction and it affects a number of molecular molecular mechanisms in the cell in tumors there is a frequent upregulation of EGFR and against EGFR a number of treatments have been developed however not all the patients who have upregulation of EGFR respond equally well to this treatment and a number of mechanisms have been implicated in the resistance either intrinsic or acquired and they include EGFR mutations or alternative pathways and importantly the activation independent activation of downstream signaling pathways such as PI3 kinase and RAS and RAF axis and what's interesting about this RAS RAF axis is that mutations in KRAS are very frequent in colorectal carcinoma it's from 40 to 45 percent and so the mutations in KRAS are associated with poor survival on their own and so it was noted that in patients who were treated with EGFR the ones who were responding did not have any mutations in KRAS whereas those that were resistant to EGFR had some sort of activating mutations in that gene and then in vitro studies further demonstrated that indeed KRAS mutations contribute quite a lot to the resistance of tumors to EGFR treatment and then they were for prospective clinical trials commenced to investigate the effect of KRAS mutations to the treatment and they all gave consistent results that mutant KRAS confers resistance to the EGFR treatment and now there is a companion test of mutations together with the EGFR treatment so here I'm showing you the summary of these clinical trials and here you can see the response rate in KRAS wild type patients it's pretty significant whereas there is virtually no response in KRAS mutant tumors and it was recapitulated well in other clinical trials in this one for example so now the example of little or no clinical utility are genomic markers in prostate cancer so here I am showing you I already mentioned to you that nomograms for prostate cancer are quite popular and one of the reasons is that prostate cancer is quite unique in this regard because it is it has very very long natural history so patients have usually live with cancer for more than decade and then another complication is that there is a single histological subtype that is predominant for prostate cancer so we can't really even start with with identification of histological subtype specific molecular signatures and yet even though there is one most common histological subtype the level of response may and the outcome may be very different and so here what I'm showing you is a summary of accuracy of a number of nomograms that have been developed by different groups and you can see that the accuracy ranges from something like 0.7 to 0.9 and for example in this paper what I'm I'm just showing one of the examples this is not the only study that has been published but they were trying to derive a gene expression signature that would add to the nomograms and would improve the accuracy of prediction and even though they do show that nomograms alone have less separation just a second less separation than nomograms in combination with gene expression data so this is a survival experience of the two groups still the accuracy of a combination marker is not significantly higher than the nomogram alone and this is this this paper is just one of the examples there are many many papers like that who have demonstrated basically a failure in case of prostate cancer unfortunately yes nomogram nomogram is a it's a sort of a scoring system that takes into account a number of clinical parameters of a patient's disease and then based on that on that set of parameters it calculates the score that is a risk assessment of of cancer progression or recurrence or poor outcome and this is basically you know nomograms are you know a set of statistical formulas that take into account binary or continuous data from different clinical parameters they sum it all up they weigh individual clinical parameters in some way and then they speed out a certain score isn't that the glisten score no glisten score is completely different no no no it's it's a different glisten score is a is a grade of tumor it's a grading system so this slide gives you an example of a very rigorous and very good clinical validation of biomarkers which unfortunately is a rarity these days so the majority of biomarkers that you see published in the medical literature are not validated in the proper way unfortunately so this was a really great effort on the part of several instant several research institutes they have collected more than 400 lung cancers at different sites and four in four institutions profiled independently these tumors with the same microwave platform and then the data processing was unified across all of the centers and centers have had an opportunity to develop their own biomarkers and they needed to develop the biomarker on completely independent cohort and then validate those biomarkers on two independent cohorts of reasonable size and that validation was made blinded so it means that researchers were not allowed to know what outcome is the real outcome for those patients in the validation cohort which is like the best of the ideal validation and then so here i'm just showing you different biomarkers which were mostly gene signatures that they have analyzed and they tried it with without clinical information in combination with clinical information and then they did validation using kaplan-meyer and then using rock curves and basically they have selected the best discriminating biomarker which is shown right here which was just it's actually irrelevant at this point would kind of chance there were there but just to make a point that really it's very important to to do a correct validation of a biomarker so that it will perform as expected and will be clinically useful now this paper is an example of a very poor biomarker very biased biomarker so what the authors try to do is to develop a signature serum peptide signature for discrimination of cancers from healthy states and they have compiled three cohorts for different cancer types and then controls and all of these samples were from completely unrelated patients and so as a result for instance for prostate cancer they had a cohort of some 30 patients males certainly with average age of 66 and the controls were 33 healthy individuals with average age of 34 which is way younger and then most of those were females so now this biomarker that they have claimed they have developed are related to the age and sex but not prostate cancer at all so now I would like to move on to the second part and so would you like to to take a five minutes break okay so let's resume so now I will tell you about the statistical aspects of biomarker development I will describe to you the survival analysis what it's for and how it's done and then we will proceed with lab where you will be trying to do survival analysis by yourself so so the identification of biomarkers can be done through either supervised analysis or unsupervised analysis so in supervised analysis we know the outcome subgroups for example responders versus non-responders and then we perform molecular profiling and we identify biomarkers that are unique to each of the subgroups and the example that would be the K-RAS mutations in responders versus non-responders to EGFR as I described to you earlier in the unsupervised analysis there is additional step that precedes this process and that step is the unsupervised identification of the outcome subgroups so we first do the unsupervised identification of subtypes for example of tumors and then we proceed with marker identification so the example of the latter scenario is is this meta-brick paper published in nature in 2012 which was fantastic paper involving 2000 breast tumors collected both from Canada and Europe and here you see a survival curves for individual genomic subgroups that they have identified based on the genomic profiling and you can see that those subgroups were indeed very much associated with different outcomes and for those who are not familiar with the Kaplan-Meier curves I should have said that actually earlier that the lower the curve the poorer the survival so patients die or recur in shorter period of time and so they have identified genomic subgroups they have associated those groups with outcome and then they have validated the signatures that they have identified for each individual groups in independent cohorts so how is a biomarker or classifier as it often called developed so there are two distinct distinct stages discrimination and prediction so the discrimination so we start with molecular profiling for example expression profiling of a group of patients two groups of patients one of them for example respond to treatment others don't respond to treatment then we we compare these two subgroups with the help of a so-called classification rule and then we get a classifier which is a set of genes with a certain classification rule so they for example they need to be up regulated in responders and then those genes need to be down-regulated in resistant tumors and so this part is called discrimination and then once the classifier is identified when the new patient comes in we do molecular profiling for the for his or her tumor and then we try to predict the response to treatment based on the behavior of that classifier and that part is called prediction this slide summarizes all common steps in the process of building and validating a classifier so basically we start with a learning or training set and we develop a classifier as I just just described to you so this is a set of tumors again right we develop a classifier which we we need to test on the independent test set and evaluate its performance on the independent set so independent set means that these tumor samples come from totally different patients which were never included into the training set so these two sets are completely independent however one needs to make sure that these sets learning and test set needs to be equally distributed with regard to the number of parameters race age family history and so on so forth so they need to be equally distributed so they need to be very similar yes I will do that just in a second so then yes the more the better the more the better so you want your learning set to be so if you have a choice you know it's it's hard to say there is no there's no golden rule that you need to you know to select your set which is larger as a learning set and then a smaller set set aside you know sometimes people do that because you want to build as robust biomarker just in a second as robust biomarker is possible and then you can do a validation on the smaller cohort with the hope that you have built the strongest biomarker possible and that's actually how the V cross validation works I'll describe to you in a second I guess it also depends on how big the how much data you have if you have enough data exactly that's what that's what very often is done because there is no luxury of having completely independent tumor cohorts and then what people like to do is they split they only one tumor cohort that they have in two and then they use one as an training set and the other part as independent test set that's what they do as well my question was independent set you said you need to have so for the test that you need as various as possible like different races different no I'm what I'm what I said by equally distributed right that's your question so it means that they should be similar distribution of those parameters into data sets so so for example you can't really take your as your training set in case of prostate cancer for instance men who are older than 80 years and then your independent test set would be people who are less than 60 years you can't do that so do you what's the procedure do you choose them by hand or you do a random sample or you do a stratification and then run sample from so so these days actually it's more often done with manual curation so very carefully these sets are compiled from available patients and you know it's it's very important step and you know people who actually do this right they spend substantial amount of time on compiling the appropriate training and test sets and unfortunately there is no luxury of having hundreds and hundreds different patients you know tumor banks are quite limited in numbers because there are a number of other criterias where they've got to the collection of samples and the quality and the tumor content and you know classification of a patient's disease so a lot of criteria are used in order to select a particular sample into and take it into consideration so it's a manual curation which is very careful and as you've really done by vast statisticians yeah so when you're using a training set in a validation a lot of tables are coming out where they're using multiple training sets from the literature and there you're almost restricted to what is there in the literature so how do you go there well uh so just because I mentioned that you know some uh there is a you know multitude of bioinformatic papers that um you know originate from bioinformatic groups who are not affiliated with hospitals and they don't have access to tumor banks and so they leverage what's published out there and then they try to develop some you know uh unique classifier based on some machine learning techniques or some you know sophisticated algorithms um and then they leverage multiple independent data sets to show that they classify at works and then they may have their own small tumor bank to validate on so it's just so people trying to deal with whatever they have and so that's why I showed you an example which is one of the very few examples in lung cancer where the development and validation of a biomarker was done in completely correct way so and that's why I'm I'm trying to bring this up to you so that when you read the medical literature you will be able to raise questions and evaluate the strength of a biomarker whether the study was correctly designed whether all the validations took place and you know so and so forth so yes I will continue doing so so um classifier that was developed on the training set the the best estimation of its validity will come from here when it tested on the independent test set and it gives a certain statistics from here but it can also be validated so called um on the same training set and um so how it is normally done is through cross validation right here so instead of using the entire training set for the development of a classifier the entire set is um split into five equal parts for instance at random and then one set is set aside one subset and the rest four are used for the development of a classifier and then once the classifier is developed then it is being tested on the remaining subset and in that case it will be five fold cross validation it can be tenfold cross validation depending on the size of the cohort depending on and so basically then this procedure is repeated so five times for instance or more times you can you can make multiple random uh subsets of five from your given a training set it can the um subtype of this cross validation is leave one out cross validation which means that you just leave one sample out then you develop classifier on all of the rest of samples and then you test classifier in the remaining single sample and then you repeat it all over for all of the samples and then you average your performance across all those iterations anyone yes so is there a standard to pick which cross validation no there is no standard no no no there are again there are no golden rules for that so um so basically the more iterations you do the better but at the same time you may be limited with very small cohort size so for instance 30 patients you split them into five and then each individual subset is very small and so even if you do multiple iterations with random some subsamplings still you end up with quite you know quite wide range of your biomarker performances and that gives you a huge error rate for that so you know there are always trade-offs that one needs to consider when doing this kind of stuff and so basically all of those performances estimations they're called different ways and all of them contribute to the final assessment of the performance of the biomarker so basically here different metrics are used for performance assessment so the how accurate the classifier how well it classifies the learning set and this would be the resubstitution error rate so this is just for your reference and then one thing that I will mention and show you how it works these are so-called rock curves receiver operating curves which are used to compare different classifiers together and pick the best one so I will tell you about the confusion matrix because I think it's it's very simple by itself and then if you know it then you will be able to understand what are those metrics that are normally found in the medical literature describing the performance of biomarkers so you probably have heard many times the sensitivity specificity positive predictive value negative predictive value and it's basically very simple so we construct this confusion matrix this table comparing for instance the performance of our novel biomarker relative to the golden standard and the golden standard for instance will be the pathology evaluation whether the sample is benign or tumor and here's our marker that also tells us on the molecular level whether the sample is normal or benign or tumor and basically we know how many which of those samples in our cohorts are positive according to the pathology and this many of those are also predicted by the biomarker and then these would be our true positives and likewise these would be true negatives that are predicted by both methods to be negative for cancer and in the rest two columns we have false positives and false negatives and so basically these metrics sensitivity specificity and predictive values are fractions of those red components to the sum of the rows or columns and that's it and then the accuracy that you also often find in the medical literature is a combination of these and can be formulated through sensitivity and specificity in the following formulas so now you will know where they come from so just giving you one example here with a CA125 levels this is a marker of the disease progression in a variant and the metrol cancer here as well so and here you see the rock curve I will describe to you how they are used and how they may be interpreted a couple slides further and here you can see that very often these metrics of a biomarker are reported and these are in percentages sensitivity specificity and so based on different cutoff value of calling the tissue cancers or benign so see this is a high concentration this is low concentration they have decided that this cutoff is the best it discriminates well enough with the specificity the sensitivity this accuracy so what are the rock curves so rock curves are receiver operating characteristics curves and it's a graphical plot of a sensitivity versus one minus specificity or it can be formulated through true positive rate versus false positive rate and so basically this rock space is divided in half by this diagonal line which is a non-discrimination line so if the rock curve is below this diagonal then there is no discrimination for that marker by that marker something that is above the discrimination line is a working biomarker and the closer it is to this upper left corner which is the perfect classification the better so this is a real example here where you can see multiple rock curves for multiple methods and you can see that this one is the best so that's how how the rock curves are used and so basically the shape of the curve matters because we normally like to use the area under this curve and the more the area under the curve a you see the better if it approaches to one then this is a perfect classification yeah so these are these this is just an abstract example right it just has you know nothing to do with with our topic today but there are different methods to to perform the same classification right to call say either molecular profile being coming from behind tissue or from a tumor tissue right so you don't know you receive your molecular profile and you have your biomarker you have another golden standard biomarker you have another third biomarker that you have developed you have five more biomarkers that you have developed and now you want to select the best one and so you do the rock curve you make the rock curve and then you pick the best one based on the rock curve instance so if you have a gene expression profile biomarker expression profile and you do a couple of my error and see there's a difference in a validation biomarker what gets fed into an ROC curve from there to an ROC curve how do you transition from there to an ROC curve so what you do is with a ROC curve so once you find an association with that with outcome it's good right but then it does not so it tells you that a biomarker is associated with outcome but you don't really know how sensitive and how specific that biomarker is and so rock curve gives you a an opportunity to estimate that right so for instance you can say that you at this point here you will have this much of false positive and this much of true positive rate so and then you can say okay you know I can estimate the true positive rate and false positive rate for the biomarker and then whether whether it is reasonable for the setting of a particular cancer so for instance I am very happy with this true positive rate and I can you know I can risk this false positive rate which is fine I will accept that and so this biomarker will work for me well I have an extra slide in the very end which describes how ROC curves are constructed so it's an extra slide I think it's not included in here oh so they are printed so they are right in the end and I would encourage you to go after I'm finished yeah I think yeah I think it's third to the last it describes how the rock curve is constructed so yeah so so this rock curve is constructed based on the false positive rate and true positive rate as the threshold for the biomarker changes and that's what that slide is actually is about it's actually quite nice I love it but you know last year when I was when I was giving this module people weren't really interested so I took that out as a supplementary slide but you're welcome to refer to that okay so now let's come to the survival data and survival analysis so I mentioned to you in the beginning that there is a special type of clinical data that has something to do with the time for a some event to occur so these are these are listed here they are in the red frame so it is a certain time from certain fixed point in time to from one fixed event to another fixed event for instance time from the surgery to the recurrence of the disease if there was no surgery there were there was some a new adjuvant therapy time time from the therapy and for example when there was a regression of a tumor and then to the relapse so this time from fixed event to a fixed event is a some sort of a survival time this data is called survival times very broadly it can be time to death time to the recurrence it's very broadly called survival times and it requires very special analytical approaches which is called survival analysis so survival analysis has three main components that are very often used by researchers and you can find these type of analysis reports pretty much everywhere in the literature so one of the goals is to estimate the probability of individual to survive for a given period of time given that it has this set of characteristics of its tumor of his or her tumor and for that purpose we use Kaplan-Meier survival curves or light tables which are used to construct the Kaplan-Meier curves another question that might be addressed is to compare the survival experiences of two different groups of individuals and you know so basically it's it's just a different test so you can think of it as a comparison of the two Kaplan-Meier curves and just to have a certain metric a p-value that will tell you whether this is indeed statistically significantly different and for that we use log rank test and finally if we want to to identify multiple variables that collectively contribute to the risk to the poor outcome and evaluate a contribution of each individual variable then for that purpose we use a multivariate cox regression model so survival times as I mentioned is the time from a fixed point to a fixed end point and here are the examples so from surgery to death or recurrence or relapse from diagnosis to death recurrence relapse and from time of treatment to death recurrence or relapse so from from here you can see that actually survival time is a very broad definition and it's up to the researcher up to the biostatistician and up to clinicians to decide which survival time is most informative for example for prostate cancer it is a biochemical recurrence so once the PSA rises then it's a surrogate of a recurrence of the disease and we normally use biochemical recurrence as our survival time death from the disease is another frequently used survival time so one inherent feature of the survival times that make it so unique and unsuitable for any common statistical methods is that almost never we observe the event of interest in all of the subject and what I mean by that usually the studies the clinical studies are performed within the limited period of time and we collect the results and we publish the results right so for instance patients have been followed up for five years for ten years for five years and we had hundred patients in the beginning so they were all followed up by five years by the end of five years some of them may have recurred or died some of them may not have recurred or died so it means that our event of interest well from the statistical point of view certainly recurrence or death has not occurred for a fraction of patients and this is a problem for common statistical methods that's why a distinct set of analytical approaches have been developed so this kind of data is normally called censored data and um censored observations um basically arise whenever the dependent variable of interest represents the time to a terminal event and the duration of the study is limited in time right so again as I said we perform the study within limited period of time five years and the event of interest has not happened yet so we just know that it will probably happen sometime in the future we don't know when the only fact that we know is that it has not happened yet by the end of five years so from the statistical point of view it's considered to be in complete observation and these are the examples of that so death of the disease still alive in social studies survival of marriage still married and drop out of school still in school um so these are um uh examples of censored observations and there are different types of censored observations I'm not going to get into that right now what I will just say is that in the medical field uh most frequently we use type one um censoring and it can be either right or left censoring so this is a Kaplan-Mari curve um that is um as I mentioned used to evaluate the survival experience of two groups and then also can be used to estimate the risk or the potential the potential outcome or survival of any given patient that has certain characteristics of his or her tumor so the way it is constructed is that patients are followed for a certain period of time and for instance we have two groups here based on certain status it's irrelevant at this point based on what status right so we're comparing the two groups of patients and for each of the groups of patients at any given point in time we we calculate a fraction who are still say who are still alive right who have not died yet but once a given once one patient dies then we need to recalculate that fraction again and so that's why um these functions are step functions so once you see the drop right there it means that at this point within that group one patient died so this just explains to you why this is a step function and so basically we compute proportions of patients who have reached the event of interest who have died for example at um each time point when we lose one patient and then this is called a life table and then that life table is visualized as a Kaplan-Meier curve right here so now how do we interpret these curves so first of all this function is called a survival probability function and it's a just geometric progression of those fractions that I described to you here and these tick marks are censored observations so it means that you know these people have not reached the event of interest and they have um for example fallen out of the study so how what's the main usage of the Kaplan-Meier curve first of all to see how different the survival experiences are visually but not numerically number one number two if we have a patient who shares all other characteristics molecular clinical characteristics with the red group what is a probability of that patient to survive for two and a half months and then the probability is 50 percent what is the probability of the patient which shares characteristics with that group to survive for two and a half months and then the probability would be 90 percent okay so that's that's how basically these Kaplan-Meier curves are interpreted and used but then these Kaplan-Meier curves do not they tell you whether they different the survival experiences of the two group but they don't tell you how much different so we don't get any specific p-value and for that we use a log-rank test which compares the survival experiences of the two groups and gives you a p-value so this is a formula which is very similar to a chi-square test comparing fractions and so basically here I'm giving you an example that I already showed you before the association of KRAS mutations with outcome and yes the separation of the two curves means that indeed there is different survival of patients with or without mutations in KRAS and now the log-rank test gives us a p-value it tells us how much so how significantly different they are so so again this type of censoring is usually type one and if you know if you want to know the details I would refer you to the statistical books so I was you know I don't think that you really need to get into that level of detail unless you're a biostatistician so but it's normally dictated by the type of study that you have and the the way of recording your survival times right and the way that you know when patient was accrued into the study what do you mean assess censoring well sometimes like patients drop out for a reason and it's not randomly just oh is that right well um I normally like to include censored observations because they do contribute to the Kaplan-Meier curve however only to the left part of it right so the trick about these censored observations is that well they have participated up to this point right so they do contribute to the left hand side to these proportions that were calculated here but once they drop out they stop contributing to the right hand side that's why you see a number of these of these censored observations here but there is no more drop so they don't really contribute so you know I I tend to see that that censored observations are helping they are improving the power there is just sometimes the way in which patients drop out there's a pattern and if like you don't account for that pattern then you you're biasing your results what kind of pattern I I'm curious I you know I've never I've never experienced any specific pattern yeah well that would be lovely I would love to know about that because that would be something something very interesting normally people just move out to different city or I mean they may die from you know the disease right and they may be lost to follow up with some other reasons but I've never seen the pattern I would love I would love to read something about that that would be interesting I'm sorry can you repeat that again here I'll pick it up ah so they may drop out because they okay well yeah that could be but uh yeah so uh that they maybe they don't tolerate the treatment that much right yeah yeah yeah it could be it could be um will you do something like this is important to make sure that you have it's not necessarily that you deal with the same number of patients in two groups okay but like the step the step down is yeah proportionate to how many patients that's a very good point actually that's a very good point exactly exactly that's a very good point are you mathematician no um yeah so this is exactly the reason why we have such a big steps in this curve and small steps in that curve because we have way fewer patients in here and the fraction is very much influenced when you lose one patient so is that the yes it is it is taken into account yeah compared to groups it's not the same uh it's not necessary no it's not it's taken into account so it basically goes you know at each and every step it goes and calculates the fraction and then the significance of the difference of those fractions and then it sums it up in geometric progression and it's the the variance term in the model there is a variance term uh variance uh what what do you mean variance the variance term is the function of the uh uh yeah the yeah yeah yeah yeah that's right yeah okay so now it tells it tells whether they're different yeah sure first of all it's it you know these kelvin wire curves are not really related you know to treatment that much right so you would really compare um but um so so one thing to do is to make sure that you um that you um have a good representation of clinical parameters in two groups of yours right so for instance if cancer is age-related that you don't really have you know um um your cohort very biased in terms of age of patients so um but but then you know but then again it may be related to age the fact that you didn't really know right and then when you start following up those patients and you finally see that they fall into two very significantly different subgroups and you identify age as a significant contributor to the survival then that's your finding so you're poorly behaving clinically behaving subgroup of patients are older patients and then your report that you know age is very uh you know very strong contributor to the overall survival and then I will talk very briefly about the cox regression model that um is designed to evaluate the relative contribution of multiple parameters into the overall survival and then using that model you can assess the age or some other parameters in terms of the contribution. I think ideally in the group everybody is exactly the same except for like one trait in the two groups and then the survival of the criterion yeah yeah yeah but the more differences there are the more they'll kind of found what you're finding yeah and that's actually almost always the case that there is a combination of factors that sometimes dependent sometimes independent that contribute to the overall oh okay yeah I think so I think we better finish now because um okay um so as I said loggering test it gives you p value so it gives you the significance of difference and then another metric which is called hazard ratio tells you how much different so it basically gives you one um one value uh greater than one or less than one and it is a hazard ratio um which is a um you know comparing the risks in two groups group one and group two so um if hazard ratio is less than one then it means that your group one is um less associated with poor outcome or or weaker associated with poor outcome than group two and on the opposite when you have it greater than one it means that group um that uh the group that is is in your numerator here is associated stronger with poor outcome than the other group and so here for instance I'm giving you an example from the Oncotype DX gene signature for breast cancer so they report the hazard ratio there and they compare it in the multivariable cox regression model with other parameters such as age and clinical tumor size and they recurrence score based on the gene expression signature was the most powerful the most informative so it means that the group that has high recurrence score is three times more likely uh to recur um sooner than the group with a lower recurrence score and compare it with clinical tumor size the effect is much less and the age is really not that much contributing at all yeah well they don't yeah well this one is not significant right and neither is this yeah yeah okay so to evaluate um again the contribution of multiple variables into the overall survival um we normally use cox proportional regression model and um so basically this is the formula here it's quite complex statistical um approach um this is a hazard function which is um related to the survival function that I showed you in Kaplan Meyer here so it's just the survival would be an exponential of the negative of this hazard function so they hire the hazard function the stronger association with poor outcome and basically takes into account multiple variables which may or may not be independent and it will tell you whether they're independent or dependent um so these are axes our variables and then the regression coefficients um bs b1 to bp these are the coefficients that will be estimated by the regression model and um the magnitude of those coefficients actually tells you uh how much of a contribution individual variable has so how do we interpret the cox model so this is the table that shows you some examples and basically two main facts so first of all we look at the p value of course number one and then number two we look at individual coefficients from here b b1 b2 b3 coefficients we look at the sign and the magnitude so sign um pertains the positive or negative association with poor outcome so if it's negative then it's associated with good outcome if it's positive then it's associated with poor outcome and then the larger it is the coefficient the stronger association with poor outcome so for example here so here uh the coefficient the regression coefficient is negative and it's um it's um now uh transposed to the percentage of a relative risk and so you see that um if we increase serum albumen by one then it means that the risk is not increasing but it's actually decreasing it will be 95 percent of that um um of the lower level of serum albumen as opposed to here the serum bilirubin so it has very um large coefficient here and it translates to this much increase of of relative risk if if uh this is increased by one so and by one it comes from this formula i'm not going to go into this level of detail so basically what you need to know is the p value number one and sign and magnitude of a coefficient so and here um i'm showing you this is our last slide i am showing you this again uh from lung cancer biomarker development example that i gave you earlier so again the higher the hazard ratio the stronger the association with poor outcome and one thing that i wanted to emphasize here is that when you do a um i'm sorry multi variable cox regression it becomes more powerful than univariate so for instance here you can see that hazard ratio is quite low in the univariate analysis so it means that only one parameter that gene expression signature was taken into account so um uh so it was thought that only the status of the gene expression was contributing to the um overall survival and then if we take that gene signature together with multiple clinical parameters that can be dependent on the gene signature and gene signature may be dependent on those clinical parameters then you suddenly see a higher hazard ratio um unfortunately higher confidence interval but still um it is more powerful than univariate cox regression so this is quite complex statistical approach and i would highly advise you not to do it yourself unless you're a biostatistician so it is it is just absolutely a must to involve biostatistician into running this so this was my last slide so we'll take a coffee break if you'd like to do so