 Okay, so let's start the next part of Module 8, which will cover the survival analysis and the way we handle clinical data. So this slide shows you the overview. So we will go through the clinical data and survival analysis theory. It'll take us about half hour, maybe 40 minutes, depending on your questions, and then it will be followed by a lab on the survival analysis using the files that you're supposed to download from wiki. So I will start with this slide and emphasize that the disease characterization involves both the molecular profiling, such as whole genome or whole transcriptome profiling of tumor samples, and the clinical characteristics of given tumors. And the clinical data is intrinsically different types of data, which has to be integrated somehow with molecular profiling data. So clinical data includes a number of clinical characteristics, such as race, family history of cancer, and involvement of lymph nodes, which is often referred to as node status, radiation the patient may has gone through, hormone therapy, chemotherapy, the level of pertinent protein stage and size, age of the diagnosis, and so on and so forth, and grade of a tumor, for example. Another type of data, clinical data, arises when interest is focused on the time taken for some event to occur. One of the most common sources of such data is when we record time from some fixed point in time, such as surgery, for instance, to death of the subject, for example, or disease recurrence after surgery. And these are referred to as survival times or survival data. And you can see here, this is an actual example of survival data in the red frame. And they require a special set of analytical tools to be analyzed. And this is usually referred to as a survival analysis. Survival analysis has three main components or three common goals that are often used by researchers and are commonly seen in the literature. So one goal may be to estimate the probability of an individual surviving a given time period, for example, one year. Here comes a patient to the clinic with breast cancer. And we do evaluate the patient with regard to a number of characteristics. And we would like to make a choice between, say, application of a quite serious chemotherapy versus just watchful waiting, for example. So if a patient has relatively good prognosis, there is no need to apply really heavy duty chemotherapy, which is a component usually with a large number of side effects, which adds additional morbidity onto the patient. So this is a very often question. And for this, we use Kaplan-Meyer survival curves or a life table. Another common goal is to compare survival experience of two different groups of individuals. For example, patients who are taking the drug go through chemotherapy versus the ones who are not or placebo. And for that goal, we use the low-rank test, which compares basically different Kaplan-Meyer curves for each individual group of patients. And the third common goal is to detect clinical, genomic, or epidemiologic variables which contribute to the risk of patient. So which variables are associated with poor outcome? And for that, we use multivariate Cox regression model, a univariate in case of just a single variable. But usually Cox is applicable to a multivariate question where you have a number of parameters that you want to test whether they contribute to the patient outcome. And this is called a multivariate Cox regression model. The essentially the same Kaplan-Meyer is a graphical representation of the data that can be represented in a life table. So in clinical studies, survival times are often defined as the form defined from the fixed point to an end point. And there may be different kinds of events that may constitute that can make up a survival times. So for instance, the starting point can be surgery and the end point can be death or recurrence or relapse of the disease. Another combination is diagnosis as a starting point and again death, recurrence or relapse as the end point. Another thing is the treatment that can be treated as a starting point and again the end point can be recurrence or relapse or death. There is one inherent feature of survival times that make them unique and unsuitable for any common statistical methods. And that is we almost never observe the event of interest in all of the subjects under study. For example, in a study to compare the survival of patients having different types of surgery for breast cancer, although the patients have been followed up for some, I don't know, 10 years, there will be many by the end of the study who will still be alive. So for these patients we do not know whether they will die. We only know that they're still alive nor do we know that the survival time from surgery, right, because the event of in question death or relapse has not occurred. We only know that it should be longer than the period of the study. And so this is an incomplete observation and we call such survival times censored to indicate that the period of observation was cut off before the event of interest occurred. So these are censored observations. So now again this kind of data requires special analytical techniques which I will describe now. Before I will do that I will just elaborate a bit more on the censored observations so that it's clear what I'm talking about. So they arise whenever the dependent variable of interest represents the time to terminal event and the duration of the study is limited in time. Now another intrinsic feature is that incomplete observation. Censored observation is incomplete. The event of interest did not occur at the time of the analysis or say the end of study. So what would be the examples of censored observations? For instance the event of interest, death of the disease. So this is our end point. Now censored observation will be still alive. Now for instance in social studies survival of marriage is the event of interest. Censored observation is still married. And then for instance drop out time from school is another example of event of interest. Still in school would be, what are you smiling at? I knew that would be a big surprise but still it happens you know. Sometimes it does. Still in school would be our censored observation. So we still need to understand that there may be multiple types of censoring and it depends on the time continuum. I will briefly touch upon that. So there is type one and type two censoring and there is right and left censoring. So now type one censoring describes the situation when a test is terminated at a particular point in time so that the remaining items are only known not to have failed up to that time. So in that case the censoring time is often fixed and then we basically see what's the number of subjects have reached the event of interest. Whereas in type two censoring we rather fix the proportion of individuals that have reached the event of interest and then we measure the time for this to take place. Now usually type one censoring is used in biomedical research. Now what is left and right censoring? So this slide shows you a situation for a right censoring up here. Left censoring and another type of censoring which is called an interval censoring. So here we have individual patients lined up along the y-axis and time. So consider an experiment for instance when we start with n number of patients and terminate the experiment after a certain amount of time. In this experiment the censoring always occurs on the right side because we do not know when exactly on the right side because we do not know whether the patients who are still alive will one day will basically die of the disease. So we're not certain about the event that is most likely will take place in the future and this is the right censoring here. So now the left censoring is then for example we start with a number of patients and we follow up them for a number of months or years and we may not always know when cancer actually started in that particular patient. So when did the disease actually occur in that patient? So and in that case we're dealing with the left censoring and then the interval censoring pertains when we know that the event of interest has occurred within a period of time but we do not know exactly when. So for example the patient under the study went say abroad for a year and the disease relapsed sometime during his travel. So a patient gets back and sees his doctor and it's obvious that the disease has come back but now it is impossible to say exactly when which is know the interval of time when the event occurred. So and just to know that right censoring is the most common type of censoring in clinical research. So now yes. So I guess intuitively you would just think it would be the left censoring because you've never known when the cancer ever occurs in a patient. Yeah so that's why this is not usually used for a starting point. So as I said for instance the starting point would be yeah usually what we can do is we can use the date of admission to the hospital. Right usually it is a surgery or a certain treatment or putting a patient on some certain regimen as a starting point because indeed it is very hard to say unless the patient has a family history and the family history resulted in a frequent you know screening of a patient with regards to a certain parameter and we know exactly when the patient got the cancer. Okay so yeah. Sorry so perfect with that. So if you're doing something like that where you're setting it at say the time of diagnosis so you might have sort of a prostate cancer you might have like a wide range of gleason scores for those patients so would you in a greater study perhaps based on the gleason score like obviously it might be difficult to compare someone with a very high I mean in terms of treatment say very different for someone with a very high gleason score there's a very low. So do you take that into account and have studies in three years? Well I'm not a gleason score though but I can imagine that PSA level maybe one of that but you know as you know PSA is not a perfect predictor and so there are some aggressive cancers with low PSA and there are some you know completely the opposite situations when there is a really high level of PSA but there is there is no evidence of any malignant disease in the patient. So and gleason score no I don't think that it may be used in any of this way. You know what I mean by principle but if you start choosing like it's kind of an artificial start what do you mean? Is there some need to accommodate that? No I don't really no no I don't really see the point of using gleason score in this no no so so what I have showed you is the more frequent more common starting and end point that people use because you know basically gleason score pertains to the phenotype of tumor cells right how aggressive they are with the motility index with the invasion index and etc etc right so it doesn't and and yes you know it's it's related to the stage and you know basically relates to how far tumor progressed and how aggressive the tumor is right but at the same time it's it's not used in this kind of so what I want to do with this slide is to explain to you how the survival time is used and how can we construct the champlain myocurves so let's see how the patients proceed usually through the study so here the the diagram on the left shows you the continuum the time continuum and patients in the study so you see there there is a 18 months in total so the cruel of patients usually happens within first say six months and then they would be followed up by for further 12 months and so the total period of a follow-up would be somewhere from you know from 12 to 18 months and so the most recently accrued patients such as this one the very bottom one would be followed up with the shortest time so now we see that by the end of this study four patients died and these are represented by filled circles right so here one two three and four two dropped out of the study and these are shown with asterisks so this one is still alive and it it didn't reach the end of the study so it dropped as well as this one it's still alive but it dropped out of the study so thus we have four firm survival times black filled circles and six censored times now what we usually do is that we ignore the the starting point for each patient and we can reorganize this graph and get the picture that you see on the right and we can reorder it by survival time to understand further how the data is handled so so now this explains to you so these are dropped out patients these are dead and these are alive patients so now what is the survival probability for a given length of time you can understand it from looking at this figure on the right so the survival probability for a given length of time can be calculated considering time in intervals and probability of survival month two is the probability of surviving month one multiplied by the probability of surviving month two and at each given interval of time the probability of survival would be a proportion of patients who have reached the event so the patients that dropped out obviously are censored but those that have been used and the patients that reached the end of the study are also censored they're not in in some kind of probability analysis those two different types of patients would be equally weighted yes so they're both censored observations whether they dropped out or whether the end point of the study or whether they reached the time point of the study the end point but without reaching the event so the patients that stayed alive and reached the whole end point of the study wouldn't they be weighted more on no they were just followed up for a longer time but they were still the incomplete observations and they still handle a censored observations so okay and just to say that this probability of surviving month two is conditional probability because the patient has to survive month one in order to make it to the month two and so the overall probability now after we have calculated probability at each and every time interval will be a just a multiplication of all of these individual probabilities at each interval and in the reality time intervals contain exactly one case so as soon as any single patient dies this constitutes a individual time interval at which we compute the proportion of patients who have succumbed to the disease so the proportions now of these are calculated at each time interval and the series of such calculations make up a so-called live table which is then can be represented by a curve which is called kaplan maya curve and that's what you see here so you see two groups of patients now for which there is a one kaplan maya curve so it's drawn as a step function so here's the probability survival probability from zero to one and this is time in months so this is a step function as you see and at each drop we have a patient death or patient who has reached the event of interest and censored observations are not considered as time intervals but they are represented as tick marks on on this particular curve so and basically this function shows you the notation for the surviving probability now what is the probability of a patient to survive say two and a half months and that patient belongs to the red group by a number of other parameters molecular profiling or staging or something else so by going here you can say that the survival probability is 0.5 was the survival probability at month one it is definitely higher so this how you can use in any single kaplan maya curve you can estimate the probability of a patient who belongs to this cohort of patients what is the probability of the survival of a patient for the length of time and so now how can we compare different groups using kaplan maya curves so for the studies in which we want to compare the survival experiences of two groups of patients in this case red and green we can construct two kaplan maya curves for each group and see how they behave so in theory we can see we can simply calculate the proportions of survival subjects in each group at a given time point and compare them right so now what you can definitely say from these two curves is that the survival experience of treated patients are way better than untreated patients okay so now and that's just a visual evaluation of this now can we derive some statistics that will tell us whether the survival experience between two groups is indeed significant and for that we use a logrand test for this purpose it's a non-parametric method to test the null hypothesis that compared groups are samples from the same population with regard to the survival experience so with the p-values low as with any hypothesis test then we can say that the chance that these are equal survival experiences is quite low so they are different so the logrand test tells whether the survival experiences of two groups is different or not but it does not tell how much different and there is another metric used for that which i will tell you about in a second but this is a test that gives you p-value with regard to the differences in survival experiences of two groups or multiple groups so this method is based on a simple idea which avoids the arbitrary decisions as to which particular time point to pick in order to compare the proportions of survived patients that would be very intuitively simple approach to see if the survival experience is different and that is take different a certain time point and just to compare the fractions so but we do not know what particular point in time to pick for that purpose so if we do choose some time point it would be very arbitrary so the logrand test allows you to avoid that the arbitrary decision choosing the time point so now the principle of the logrand test is to divide the survival time scale into intervals according to the distinct observed events in both subgroups here ignoring censored observations then the proportions at each interval are calculated and compared in a similar way as in chi-square test so the principle is to compare proportions at every time interval and summarize it somehow so this slide shows you the notation for the chi-square test and below for the logrand test so here we have k time intervals and similarly to the chi-square test we have our observed proportions and expected proportions and then we just take the subtract one from the others sum it up and then here we have a variance of observed minus expected and then similarly to the chi-square statistics we compare with the chi-square distribution with chi minus one degrees of freedom and get the p-file so that's the statistical background of this test now i told you uh the logrand test does tell you whether the survival experiences of two groups are different statistically but it doesn't tell you how different to tell you how different we may use a hazard ratio which tells you exactly how different so hazard ratio compares two groups differing in treatments or diagnostic variables or something and then it measures relative survival in two groups based on the complete period studied so based on the complete timeline here so um so here's a notation for this one and um so uh the r basically the hazard ratio gives the relative event rates in the two groups so for example when r is equal to 0.43 then relative risk or hazard of four outcome under the condition of group one is 43 percent of that of group two when r is say something like two then the rate of failure in group one is twice the rate in the group two so group one would be at a higher risk it needs to be noted though that um hazard ratio is computed for the entire period of study and may not be consistent throughout the time of intervals and so Kaplan-Meier curves are essential in this regard to visually inspect the consistency of differences in survival experience along the curves and so we compute r for different time points and see how consistent that is with this so now I will move on to the last method that I'm covering before we move on to the lab so this is a way more complicated statistical method I will briefly touch upon that so it's called a Cox proportional hazard model which is used to investigate the effect of several variables on survival experience so um let's read through that it says that it's a multi very proportional hazard regression model uh to model survival times again it is also called a proportional hazards model because it estimates the ratio of the risks hazard ratio or relative hazard there are multiple predictor variables such as prognostic markers whose individual contribution to the outcome is being assessed in the presence of the others and the outcome variable so the hazard function is closely related to the survival curve and represents the risk of dying in a very in some time interval after a given time so um so here's a um formula for this so we have um axes as independent variables of interest such as tumor stage, tumor volume, um the expression value of our bar marker or the level of estrogen or testosterone or something else we want to include in this model so this would be x now b are the regression coefficients that will be estimated by model so there is an assumption um made um under this model that the effective variables is constant over time and additive in particular scale then similarly to the kaplan meyer uh hazard function is a risk of dying after a given time assuming survival thus far so it's conditional similarly to the kaplan meyer then it's a q a cumulative function and h zero is a cumulative baseline or underlying function and then further the probability that we are interested in the probability of surviving to time t can be expressed through uh the hazard function so this is basically an exponential of the negative of the hazard function so the cox model must be fitted using an appropriate computer program we will be practicing it in r today so the final model will yield an equation for the hazard as a function of several covariates so how can we interpret the output from the cox model so here on this slide you see the results of the cox regression analysis on the randomized trial comparing uh the drug versus placebo the chosen model included six variables all of them are listed here in the first column and they were initially tested for um statistic significance alone individually and they were all significant at the uh five percent level so now um an important result um of this model is a regression coefficient that you can see in the first in the second column of this table for each of the covariates the cox model gives you so two important features of the coefficient sine and magnitude so sine can be positive and negative and that will tell you the association with poor survival so positive or negative association with poor survival so now this will have negative association with uh poor survival and this would be a positive association right now magnitude is another important feature of this it refers to the increase in hazard for an increase of one in the variable of that particular covariate and now the important thing here is the exponential of the coefficient which is here and i will tell you why in a second so this will tell you um this will tell you that by increasing say the serum bilirubin by one you will get this much increase in the risk for that particular patient who has higher levels serum bilirubin um in terms of the survival and for example um with regard to the therapy which is a binary um covariate it's either you give a patient a therapy or you don't so so now if you don't give a patient um a therapy right um there will be an increase in risk of dying like one sixty percent so that's how we interpret that so the hazard function is again the same the step function and we can express survival function through hazard function as i showed you in that formula above and then we can plot it as a survival curve similarly to the kelvin maya curve so this is just one of the examples and so here is the survival function and just to mention this is called a prognostic index that you may see in the literature so now the power of the analysis i have to stress depends on the number of the terminal events deaths for example or relapse of the disease and sometimes it takes many many years to follow up a follow up to compile the data that will give sufficient power that's why didn't studies normally use other end points instead of death which are more frequent so for example recurrence time so of course increase the sample size is another way to achieve power um and unfortunately it is not very simple to estimate the required sample set and there are ways to help you with that and these can be some nomograms that can help you for instance for prostate cancer to calculate the proper sample size so now i will briefly summarize what we have learned here and we will move on to the lab where we will apply the knowledge that you have just acquired so clinical data is a highly important component as an intrinsically different from genomic transcriptomic data survival data is a special type of data requiring special methodology main applications of survival analysis include estimation of survival probability of a patient for a given length of time under given circumstances we use kelvin meyer survival curve for that comparison of survival experiences of groups of patients for example to ask the question whether the drug works or surgery is helpful and for that purpose we use the law grant test and then we may want to investigate the risk factors that may contribute to the outcome of the patient um with the goal to make a prognosis for a given patient and choose appropriate therapy and for that we use cox regression model so um if you have questions you can certainly start asking them um we can move on to the lab and while uh people are um pulling out necessary files you can ask me some questions