 Okay, so let's proceed with the analytical part of our day-to-day. We will be talking about the survival analysis and about the clinical data. So the disease characterization nowadays include both scanning of the whole genome and all transcriptomes that you've heard so much yesterday about, and the clinical data. These two types of data are actually, they are needed to be analyzed together as a whole. But at the same time these data types are intrinsically different. So what is the clinical data? And I will be referring to these as clinical variables. So this is a typical example of a clinical data that you can get out of the published studies. This was a cancer study, the paper that we've been using throughout the course, breast cancer. And so these are different clinical variables here. The ID of the tumor, race, family history, this is the binary data, there is yes or no. Nodal status, again yes or no, and in some case the number of nodes that are involved. Then radiation was the patient treated with radiation, yes or no. Then with the chemotherapy or hormonal therapy. Then if there are some histological characterizations of a tumor sample including some marker protein using immunohistochemistry. Then staging of a tumor, size of a tumor. It can be age of the diagnosis and it can be a level of a certain markers. In this case it's a hormone level of progesterone and the estrogen receptor level. Then some other types of typing, the staging and grading of tumor samples. But also there is a number of clinical variables which have something to do with the time. And that is for example the overall outcome, whether the patient is dead or alive at this particular time point. What was the overall survival time? What was the disease specific survival time? Other may include something like the recurrent status or the time to relapse. But basically distance recurrent status and so on and so forth. So these variables have something to do with the time. That's called survival time. These are the times to a given endpoint. And these types of data requires a special analysis which is called survival analysis. Why it is so? I'm going to explain to you. So I just have to mention that we have allocated a big slot in our workshop for this part. And so the most important thing is for you to understand the basics of the survival analysis. What it's for and what's behind the statistical approaches of this part of statistics. So that when you come back to your research you are able to do these simple analysis yourself. But in order to do them you really have to understand what's behind them. So you are just welcome to jump in and ask me any questions if you don't understand. Because it's really of the paramount importance that you understand every step. So the survival analysis, the main goals of a survival analysis are these ones. So for example, can we estimate the probability of an individual surviving for a given time period? Say for one year. So say for example you have a cohort of patients that are of a certain age and they've been treated with a certain drug. And you have recorded the survival times and outcome, clinical outcome. And then you analyze the data that you have. And using that data you can predict the survival probability of a new patient coming in. And so the method which is used for this purpose is a Kaplan-Meier survival curve or a life table. No, not necessarily. No, it's just with any disease. Now other goals might be for instance, okay, we have two groups of patients, right? One patients are treated with a new drug and then you have a group of control patients who were given some standard treatment or were not treated at all. And so you want to answer the question, is this drug really working for these particular patients with these disease under these conditions? And for that type of goal, we use low-rank test which compares different survival curves, Kaplan-Meier curves. And now there may be other goals such as, okay, so what are the risk factors that contribute to a poor prognosis of a patient? Say we have characterized samples using multiple technologies, multiple platforms. We have a great number of clinical data for these particular samples. So now what are the risk factors? And this is a very important question when it comes to the point when a new patient comes in and that patient exhibit a certain pattern of values of those variables. So the expression level of this marker or of that marker, this particular mutation or that particular mutation age and it presents a certain, so and the tumor is of this size and of this grade. So now what is the prognosis for this patient? How is the patient going to do in the future? And that is a very important question in order to choose the appropriate therapy for a given patient. So if the prognosis is poor, then the patient has to be exposed to an aggressive therapy. And so for that type of goals, we use a Cox regression model and they may be univariate or multivariate Cox regression. That means that univariate is that when you take into account one variable into your regression model and estimate its effect on survival or you can test many clinical variables at the same time in the Cox regression model and that would be called a multivariate regression model. And those different variables would be covariates. Clinical parameters and genomic data as well. So it could be genomic data as well. So if you read carefully through the CHIN paper, if you just have it handy maybe, I would prompt you to look into that paper and find a table where they have the list of the high level amplicons which were associated with poor prognosis. Have you found it? What's the number of it? So this is table one, right? So and it says univariate and multivariate associations for individual amplicons and or disease specific survival and distance recurrence. Okay? So within this analysis, so what I told you so far is that our clinical, our variables in the multivariate Cox regression model would be your clinical variables such as these ones like a nodal status, radiation, chemotherapy, hormonal therapy and so on and so forth. So these could be one types of your variables covariates in that model. The other types of variables could be a genomic aberrations and that's what they've done and summarized in table one. So they covariates were all those 11 high level amplifications. We will be practicing Cox regression model using clinical data only like this one because what they've done there is that they've basically defined those high level amplicons manually. And so it is a little hard to do within this time frame that we have. So I'm going to let you practice yourself using a clinical variables and this is most widely used variant of this analysis in studies. By manually, it means that they just look through the tumors themselves and they defined the extent of those amplicons. So it was back two years ago when there was no automatic. Yes. And then they try to summarize the overall amplification level in some custom way. So we just don't have this data in order to put it into the Cox model here. So we will be using just a clinical data that is available for this data set. Yes, in R. Yeah, we'll be doing it in R. But I would like to have our lab today in slightly different style than yesterday. So there are just few methods here, but they have to be used carefully. So we have to really understand what you're doing. And what I will be doing during the lab today is that I will walk you through the whole process. So we will be doing it together. And once you understand what you're doing, you will be prompted to perform some simple exercises on your own. So this is the survival data. So the survival data or survival time is the time from a fixed point to a certain end point. So what could be the endpoint? The endpoint is the outcome in our case. And that is a death or recurrence or relapse. What could be the starting point? The starting point could be surgery, the time of surgery. The time of diagnosis, or for example, the time of treatment. So the intrinsic difficulty with the survival data is that our observations are incomplete, meaning that almost never we observe the event of interest, such as death or recurrence or relapse, in all of our subjects. So by the time that you do the analysis, a portion of patients have reached the endpoint and another portion have not. So those who have not are incomplete observations because the event of interest hasn't occurred. And those observations are called censored observations. And so in other difficulty with the survival times is that it doesn't follow the doesn't fall into the normal distribution. And so it requires a special analytical techniques which we are going to cover now. So I'm just going to explain a little bit more what are the censored observations are. So for any of those outcomes, you cannot analyze the three possible outcomes. It has yes or no. Death, recurrence or relapse, then you analyze it. So it is the particular endpoint that you choose. So you can play with the relapse time. Or the death, but not the three of them. You have to select your endpoint. Yeah, and very often people just do multiple analysis using different endpoints. Any other questions? So what are the censored observations? Again, they arise whenever the dependent variable of interest represents the time to a terminal event or to some event of interest and the duration of the study is limited in time. So again, the censored observations are incomplete observations. So the event of interest did not occur at the time of analysis. So this table shows you a few examples of censored observations from different areas. So in our case, the event of interest is death of the disease and the censored observation is going to be still alive. Now, if our endpoint is the relapse time, what is the censored observation? The patient has not relapsed yet, yes. Or if our endpoint is the distant recurrence of a tumor, what is the censored observation? Okay, that's right. So now from other areas, survival of marriage, for example. So the marriage didn't survive. They divorced. That's the event of interest. Now the censored observation is going to be still married. Surprise. So now another example is a dropout time from school. Has a child dropped out? No, still in school. So the event of interest has not occurred. This is a censored observation. So it is necessary to understand different types of censoring. The most important ones for our understanding is the type one and type two censoring and right and left censoring. So type one and two censoring refer to either time that is fixed. So for example, there is a follow up of patients and we decide to terminate the follow up in the year 2006. Right. And by that time or 2009, whatever. And by that time, the time of an analysis is fixed. And so you just explore what is the proportion of your patients who have reached the endpoint and who have not. So with the type one censoring, the time of the analysis is the fixed thing. And your variable is the proportion of your patients. Now with the type two censoring, which is used much less frequently in our studies is when the study is conducted up until the point when you have, for example, 50% of your subjects reaching the endpoint. Any questions? How do you contain those two types that enter into the study? So with the tests, you just specify what type of censoring you have. So with the patients until the end of the study, or do you start the study once you have reached a certain number of patients into the study and then start recruiting them and then start. Yes, exactly. That's what usually happens in the end of the trial. So for example, the trial is planned for about five years, right? And then during, say, first one or two years, you're recruiting patients. And after that, you follow them up until the end of the study. So what is the right and left censoring? So this type of censoring refers to the time continuum. And so with the right censoring, there is a no starting point. And there is no a certain endpoint. And this means our example that I mentioned to you. So these patients have reached the endpoint and these ones have not reached the endpoint. So it means that by the time we conduct the analysis, those patients are still alive and you don't know what's going to be the survival time, right? And that's why it is called the right censoring. Now, I want to mention this type of censoring before going to the left censoring because this one is a part of this. This is called the interval censoring. And this means that the event of interest occurred within the interval of time, right? And this would be the example of that. So say the death occurred within some period of time. You don't really know where exactly or the relapse occurred within some period of time. So when the patient comes in with a relapse disease, you don't really know where exactly the relapse took place, right? With this interval? So if you actually, so look here, for example, so for example here, with the left censoring, say you know the endpoint, but you don't know the start point. And that is, for example, the start of the disease. So when the patient comes into the clinic, it already manifests some disease, right? But when the disease started, you don't know. And so that is an example of your left censoring. Yeah, this is what I'm going to talk right now about. So for example, now we're moving to the analysis of the survival time and survival probabilities, which is the basis of the Kaplan-Meier survival analysis. So for instance, this is a trial with the patients accrual for the first six months. And then following them up up until 18 months in time. So as you can see, all the follow-up time for different patients are completely different, right? And moreover, by the time of the end of study, some of them are dead, black, some of them are still alive. So now, how do you infer the survival probability? So the first step that you make is that you order your patients according to the time of the follow-up. And then you divide your time scale into the intervals, which contain exactly one case of a complete observation, where a patient has reached the event of interest. In this case, the deaths, and you are ignoring the censored observation. Okay? So you have divided this time scale into these intervals. And then at every interval, you compute the proportion of those who are dead and those who are still at risk, who are still alive. And that gives you a probability at every particular interval. So for example, probability of survival month two is the probability of surviving month one, multiplied by the probability of surviving month two, provided that the patient has survived month one. So what does it mean? It means that when you compute the probability of a patient surviving month two, the patient had to survive month one. And so this is called a conditional probability because this is condition for the survival month one. And so this is how it's done in Kaplan-Meier survival estimator. So the survival probability is just a geometric sum or multiplication of all of those conditional probabilities at every time interval. And every time interval again contains exactly one case of a complete observation. So the survival probability actually can be represented with a table. And in that case, it's going to be a life table analysis. And graphically, it can be represented by Kaplan-Meier curve. So this is again your probability of survival, which comes from this formula. It is just more mathematical syntax. Geometric sum of these proportions are still at risk. The patients in F are those who are failures or the patients who have reached the end point. And so this is your survival probability function. And this is a step function because you have your probabilities only when you have a certain, when you have your event of interest. So once the patient dies, you have a drop. The next patient dies, you have a drop again. So what it tells you, for example, you have two survival curves here for one group of patients for another group of patients. So this is your time scale and this is your probability of survival. So it tells you that over time, the survival of your patients from both groups is actually declining, the survival probability. And it's declining more rapidly within this group than with this group. So now let me ask you this question. If you have a patient coming in to the clinic and you know that this patient is under the same conditions as the patients who have been used to construct this Kaplan-Mayer curve. So what is the probability of that patient surviving two and a half months? 50%. That's right. So that's how you interpret those curves. Now let's say these are the patients who were not treated with a drug. Now let's see, what if we treat this patient with a drug? Do we increase the probability of his or her survival? Two and a half months? Yes, the probability really increases. So that's how we interpret those curves like this. I'm sorry. The sensor? Yeah, I'm sorry. Thank you. I didn't mention that. That you see vertical tick marks? These indicate censored observations. So if you look into this formula, this are the number of patients that have given time interval who are still at risk, who have not died, right? But if you have censored observations, which happened before that time interval, then you just remove them from this. So that's how it is accounted for the censored observations. And that's how you get those vertical tick marks. Sure. So for instance, you have these two curves, and you plot the Kaplan-Meier curve, and you see that they separate. And now the question is, is it really worthy to put this drug into the clinic? Is it really effective for these patients? Well, yeah, two curves actually separate. Well, let's go ahead and do it. But now the question is, are these survival experiences are actually statistically significant? And we use another test to answer this question. So this is more for an estimation of a probability of how the patient will do in time, under given conditions, and for just the visual examination, initial examination of different survival experiences of different groups of patients. So now what is the test for that? Yeah, so, but in any case, this is sort of an estimate. This is an estimator that has been fitted using the data from your cohort. And where you see the number of patients, that's good. If there's more number of patients, that's good, right? But you always want to refer to the next test, which is a log-rank test, in order to say, OK, whether it is really statistically significant. So you can estimate the number of patients in each group. Meaning, in the red line, you can add every drop and every line in the patient's plate. And the same at every degree. So you can see that there are a bunch of patients in the green line, and a bunch less than the number of patients in the red line, like this. So I get the number of patients, right? Yeah. The number of vertical drops and the number of vertical peaks. Yeah, you see, there is a multiple drop there. Yes, that's right. OK, so now the log-rank test. This is a nonparametric method to test the null hypothesis that compared groups or samples from the same population with regard to survival experience. So the survival time does not follow the normal distribution, so we need a nonparametric method. And you can think of doing some similar, some simple analysis, such as comparison of immediate survival times into groups. But it's not that adequate, because you are looking just at the average picture, whereas you have different time intervals, right? So, and that's what the log-rank test for. It tells you whether two survival experiences are statistically significant, but at the same time it doesn't tell you how different. And there are other tests that I briefly mentioned that serve that purpose. So now, how is this log-rank test beckoned? So you have two groups of patients. And then you divide your time scale again. So again, you divide your time scale into the intervals. And then you test the significance of difference for every individual interval separately. And then you sum up, then you summarize. And that's going to be the overall statistics. OK, so now how do you define your intervals? So I'm going to tell you. So the intervals that you're going to have contain, so they happen only when you have your event of interest. So you ignore your censored observations, but you take all of the complete observations from both groups. You merge them. So for instance, here, this is time interval, and this is the next interval. But there are many more intervals from that group in between, so you include them. So you merge them, right? So that's how you get your intervals for the log-rank test. And then for every interval, what you do is you just compare your proportions of patients at every interval, which is similar to the chi-square statistics, two by two tables. And then you just summarize it across all the time intervals. So this is the basic formula for the chi-square test, where you have basically your observed variable. I mean, your observed measurement and expected value. And the chi-square test gives you this estimate. So how different two proportions are. This is the essence of the chi-square test. And the log-rank test is similar to that, with one exception that here you have not an expected value, but this variance of this expression. And so similarly with the chi-square test, you get the statistics. And what you do is that you just compare with the chi-square distribution with the chi-minus one degrees of freedom, where chi, k, I'm sorry, is a number of your intervals. To what curves? So, so, oh, hello. What's wrong with this? Okay. All right, so this is the essence of the chi-square test. So when you compare two proportions, and so you have two by two table, where you have one group and another group, and you have a proportion of your events in one group and proportion of your events in another group. Right, so for example, in one column you have group one, another column is group two. And now you have your event of interest, death. So within group two, you have eight patients dead, 15 alive. Thank you. And in the second column you have, say, 15 patients dead and three alive. Right, so actually these actual numbers are your observed measurements in the chi-square. Now, what you do then next is that you can calculate a proportion in each group of patients dead. Right, you just sum up the column. This is a number of your patients in your group, and then you take a fraction of those who are dead. And then you compare these proportions. So now the expected values actually come from a sum of columns and rows of these two by two table. These are your expected values. And this is the essence of the chi-square test. And so it is similar for the low-rank test. So when you get the statistics done again, you compare with the chi-square distribution with n minus one degrees of freedom, and you get your p-value for that one. So now the low-rank test, as I told you, tells you whether two survival experiences are statistically significant. But at the same time, it doesn't really tell you how much different. And here's the very simple measure that gives you this extent of differences between two groups. This is called a hazard ratio, and it measures a relative survival in two groups based on the complete period study. So it is a very rough estimate, but at the same time, it gives you the overall picture. So for example, what is the hazard ratio of 0.43? And you compare group one to group two. It tells you that the relative risk or hazard of poor outcome under the condition of group one is 43% of that of group two. So group two is definitely doing worse than group one. Yeah, so now we're moving to the Cox proportional hazard model, which is a slightly complicated topic. And I will try to explain as much as I can so that you understand what it is and how to interpret the results of this test. So it is used to investigate the effect of several variables on survival experience. And it is a multivariate proportional hazards regression model with a number of different variables taken into account, which are called covariates. So this is a formula for the hazard function as a function of time, which is some baseline hazard multiplied by the exponential of this term. So what is this term? This is a sum of all your independent variables of interest, such as size of a tumor, stage of a tumor, nodal status, chemotherapy, protein level, for that matter. And these are the regression coefficients for those variables, which are to be estimated by the model. So you have a number of your covariates, clinical variables, and you want to explore the effect onto the survival of a patient. And so those variables come together here and you get, as a result, you get these coefficients that you will need to interpret. So, and this is a hazard function, right? The hazard model, the hazard regression model, has one assumption that the effect of your variables is constant over time and additive in particular scale. That's the assumption. So it is additive and it's constant over time. So now similarly to a Kaplan-Meier hazard function actually gives you risk of dying after a given time, assuming survival thus far, right? So far I've been talking about the survival probability. So the survival probability would be this, with the exponential of the minus of the hazard. So, and you can actually construct the same survival curves using the survival probability that comes out from this regression model. And we will be practicing that too in our lab. So, again, yeah, it's just a couple more points that I wanted to emphasize is that it is a cumulative function. So what it means is that over time the risk of dying is increasing with this particular set of variables or that particular set of variables, right? And so again, as with the Kaplan-Meier curve, what's the immediate outcome of this thing is that for an individual that comes in with a particular set of values of your variables, you should be able to estimate its patient's survival using this Cox model. Well, these should be censored observations. So one of the examples of censored observations is that the patient actually falls out of study for some reason. For example, he moves or he dies of some accident. That is a censored observation. So now it is very important now to understand how do we interpret the results of a Cox regression model. So this is a very simple example that I took from this book that I personally really love. This is a medical statistics of a clinical research by Alan Depp Altman. So this was from the PBC trial of comparison the patient's outcome using this drug versus placebo. This was the number of patients. And so these were the clinical variables that were read out from the patients. And these is the result of a Cox regression model, the multivariate regression model, using all of these variables. So as I said, the outcome of the regression model are the coefficients and the P values of whether... Yeah, I'll mention it later right now. So here what you see is that they would usually give you a regression coefficient itself, then the standard arrow of a coefficient, and then the exponential of the coefficient. How do we interpret it? So two important points. When you look at the coefficient, you look for a sign, whether it is positive or negative sign. So the positive sign means that the higher value of this variable, the more risk you have of dying. The negative sign means the higher values you have for this variable, the less risk you have. Now, can you estimate how much less, how much more from this table? Yes. So if you look, now you can look at the magnitude of this and then more specifically at the magnitude of this exponential of this coefficient. So if you take two time points and put it into the formula that I gave you before for the hazard, cumulative hazard, and if you take the ratio of two time points, then your baseline hazard is going to go, right? And you end up with this expression, right? So now the magnitude now of the exponential of this coefficient right there refers to the increase in your hazard if you increase your variable by one. And this is times of the original hazard. So for instance, here for example, you see that this one shows negative coefficient and the exponential translated to the point 95. And this means that the increase of the level in serum albumin by one is going to result in a hazard, 95% of that one of the first one, I mean. So this means 5% decrease in risk, okay? And other example here, for instance, with if a patient receives a therapy, this is a coefficient, this is the exponential. And so this actually already an increase by one because it's a binary data. The patient either has therapy and then the code is one, or it doesn't have a therapy and then the code is zero. So for this dehotomous data, it's pretty simple. So if you have a therapy for that patient, then I mean, if you don't have a patient, oh, I'm sorry. If you have a therapy for that, then I mean, if you don't have a therapy, then you increase your risk by 168% compared to when you don't have therapy. Yes, yes. So the negative means that it is a favorable factor for the poor outcome. The difference between what is the e and the x, e and what an option. Yeah, so this is a hazard of a poor outcome. Okay, and so now if if you have this hazard function, estimated, and then you transform it to the survival probability, then you get a formula like this, where you have this term, which is frequently referred to as a prognostic index, which means that you can estimate the prognostic index for a given patient with a given values of your variables. And then having the survival probability, you can plot it and it's going to be the same step function as in Kaplan-Meier and the same representation. So these are patients treated with a drug and these are having a placebo. Okay, so now what have we learned from this part? So clinical data is highly important component of our studies and is intrinsically different from genomic and transcriptomic data. Then survival data is a special type of data, which requires special methodology, which we've covered. And main applications of survival analysis are the following. Estimate the survival probability of a patient for a given length of time and we use Kaplan-Meier survival curves or light tables for that. Then we want to compare the survival experiences of different group of patients and we use lowering tests there. This is commonly used to answer the question, is the drug working? And then we may investigate multiple risk factors that might contribute to the survival experience of a patient in order to make a prognosis for a new patient with a given set of clinical variables and to choosing appropriate therapy. Okay, so now if you have any questions, I'll be happy to take them.