 Okay, so let's start the next module for I will it is devoted to the survival analysis and I will start with a very short relatively short introduction to the methods because the the running those methods is actually very simple as just you know a matter of running a single command in R but it's really important to understand what's behind those commands and how to interpret the data that comes out of those commands and so in the beginning I will show you what it is to do the survival analysis. So the disease characterization is composed of contains two major components the molecular profiling of whole genome or whole transcriptome and clinical data. These are two complementary data that should be used together in order to do a reasonable research and get a knowledge out of what you do. So the clinical data is intrinsically different from the data that you've been seeing so far in the course from the high-resolution molecular profiling data. So this slide shows you the typical example of clinical data available for tumor patients and the clinical data would be all of those things the idea of a patient race family history nodal status yes or no or number of nodes involved then radiation or chemotherapy hormone therapy protein IHC then staging size of a tumor age and diagnosis then hormones level and etc all of this is the clinical information and now there's a special type of clinical variables we call them variables which pertain time and these are called survival times so for instance overall outcome dead or alive this is a status that is related to the survival time then overall survival time of the patient disease specific outcome dead or alive disease specific survival time recurrent status time to recurrence time to distance recurrence and status of a distance recurrence so these are all clinical variables that are related to the survival to the two times right and so they require special type of analysis which is called survival analysis what is the survival analysis we typically have three major goals in survival analysis first of them is to estimate the probability of individual surviving for a given time period say one year given the disease characteristics that a patient presents and for this type of analysis we use Kaplan-Meier survival curve or a life table which are just a different representation of the same data now another goal can be to compare survival experiences of two different groups of individuals say treated with a drug or and placebo and for that purpose you construct a Kaplan-Meier curve but they do not really tell you whether the survival experience is statistically different for that purpose we use low-rank test which compares different Kaplan-Meier curves and gives you p-value another goal can be to detect clinical genomic epidemiologic variables which contribute to the risk so associated with poor prognosis and there may be multiple variables that you want to take into account and for that kind of analysis we use multivariate cox regression model that tries to estimate the survival or a say relative risk of poor outcome taking into consideration multiple variables so survival data what is the survival data survival time is the time from a fixed point to end point there may be different starting point and different end points these are just an examples so say the starting point was the surgery and the end point was death or recurrence or relapse of the disease so the time between these two points is considered to be survival time another example of a pair of starting and end points is a time of diagnosis to death or recurrence or relapse another may be treatment as a starting point and then the end point something like the same death recurrence or relapse so the the specifics of the survival data is that we almost never observed the event of interest in all subjects so at any given time of a clinical study not all of the patients reach the outcome our outcome is they succumb to the disease dead so not all of them are dead so we have not observed the event of research interest not a clinical interest in all of the patients and these incomplete observations are called censored observations and certainly this type of data requires a special analytical techniques so if it more on the censored observations so they arise whenever the dependent variable of interest represents the time to a terminal event and the duration of the study is limited in time now this is in complete observation the event of interest did not occur at the time of the analysis so this table shows you a few examples of censored observations so say event of interest death of the disease census observation would be still alive now the event event of interest survival of marriage then censored observation well still married big surprise now another example event of interest drop out time from school censored observation the student is still in school so there are two types of censoring type one and type two censoring so type one censoring is more frequently used and it pertains to the fixed time and then calculating the fraction of those subjects that have reached the end point and the rest who did not reach the end point so the time is fixed and this will be type one censoring type two censoring is that happens when the proportion of subjects is fixed so say we end the study whenever we reach 50% of events so 50% of subjects reached the end point and we're going to stop the study and that would be censoring of type two now there is also right and left censoring which pertains pretty much the the time continuum so the the left censoring for instance here is that you do not know when there was a start point for instance the patient comes into clinic and the patient is admitted to the clinic on the basis of a diagnosis of a certain disease but now we do not really know how long the patient has had this disease so so this would be an example of a left censoring now the data with right censoring happens well for instance a patient dropped out of study and so the patient was follow-up for a number of years for the recurrence of the disease and suddenly the patient leaves the area moves to another town and it just drops out of study and so we do not know when this patient will relapse so this is a time I'm sorry the right censoring there can be an interval censoring when the event of interest can be estimated to happen within a certain time interval but not at the exact point in time so the the the example of that would be the patient has has come back to the clinic and you know it's it's evident it's obvious that the patient has relapsed when did the patient relapsed because there is already a substantial you know disease that is going on in the patient but when exactly the relapse happened we do not know maybe a month ago maybe half a year ago and you know the patient was asymptomatic and he just didn't come to the clinic so how do we estimate the survival probability of a patient to survive for a certain period of time and how the Kaplan-Mayer curves are constructed so the graph on the left represents the process of how the patients process through study so for the first six months the patient the patient's accrual is taking place and for the next 12 months they are followed up and so you see the patients can join the study at different time points and so they would be followed up different lengths of time right and so if you if you sword these patients so I know the the clear the clear circles would be censored observations and the the field circles would be the patients who reached the endpoint so they are dead so now if you order these patients according to the follow-up time you you end up with the graph like that on the right and then each and every time when the event of interest occurs you can compute the fraction of patients who are still alive right and this will give you these probabilities okay so survival probability for a given length of time can be calculated considering time in these intervals so in each and every interval corresponds to a single event of interest a single death in our case now the probability of survival month 2 is the probability of surviving month 1 multiplied by the probability of surviving month 2 but provided that the patient has survived month month month 1 so this is a conditional probability and so the final probability would be a geometric sequence of these interval probabilities so now the kelvin wire curve is represented in this way it's it's a fraction it's a survival probability which is a fraction basically fraction of patients over time so this this declining curve tells you that over time less and less patients are still alive right and this is a step function because basically you can't you can't do a sloping because because it's a fraction and so it is a step function here and so the survival probability this f is a geometric sequence of those fractions of still alive patients at each and every time interval so these are the set functions they look like this and this vertical tick marks are censored observations and immediately looking at these two curves you can say that the poorer outcome group is a much smaller group because you can see you know big shift big steps like that and you can clearly see that there is a much fewer patients in this group compared to this group now how you it how you use these kelvin wire curves so the question is what is the probability of a patient to survive 2.5 months given that a patient presents symptoms characteristic to this group which was used to construct this kelvin wire curve and the probability is 0.5 now for instance you are dealing with two subgroups of patients treated patients and on treated patients and you are looking at the survival experience now are these survival experiences significantly different can you tell this from kelvin wire curve how well they separate no you can't you need a p-value and the low-rank test give you this possibility to estimate the significance of survival experience differences so it is a non-parametric method to test the null hypothesis that compared groups are samples from the same population with regard to the survival experience in other words it tests the hypothesis with regard to the null hypothesis that there are no differences in survival experience so but at the same time it tells you whether these survival experiences are different but it doesn't tell you how much different so the way low-rank test test works is that for each time interval so the time scale is divided into the time intervals corresponding to each and every event of interest and we compare proportions of still alive subjects at every time interval and then we summarize this across all of the interval so it's it's similar to the chi-square test which is designed to compare proportions and so and it's even expressed similarly to the chi-square test so this is a chi-square test this is observed an expected proportion when you compare two groups and the expression for the low-rank test looks like this where you also have observed an expected proportions and then you compare the these the chi-square statistics to the chi-square distribution with a chi-minus one degrees of freedom and you get your p-value now low-rank test does not tell you how much different survival experiences are but hazard ratio does so the hazard ratio is a simple measure it's measures relative survival in two groups based on the complete period studied so you have by by the end of the study you have two groups with observed and expected proportions of people who are still alive and then you just take a ratio and this will be your hazard ratio so hazard ratio of 0.43 for example would be a relative risk or hazard of poor outcome I'm sorry under the conditions of group one is 43% of that of group two so group one is doing better the risk the hazard for group one is only 43% of group two so group one is doing is doing better so now the the downside of hazard ratio is that it gives you a relative risk for the entire period of a study but then it may not be necessarily that it's consistent throughout all of the time points along the study and so it makes sense to check for the consent to compute the the hazard ratio for a number of time points and then see how consistent it is across time intervals now none of these methods can basically comprehensively take into account the effective multiple variables on the survival experience for that purpose we have a Cox proportional hazard model which is statistically quite complicated model itself I will not give you a lot of details of that but I will explain to you what it is and we will be practicing its application on the real clinical data so it can be so the goal of this model is to investigate the effect of several covariates on survival experience and if if that's the case if you have multiple variables that will be multivariate proportional hazard regression model however you can have a univariate when you just simplify your analysis by using just only one covariate now why is this so method the hazard function behind the Cox regression model is estimated in the following matter so this is a hazard function right there and x1 to xp are independent variables of interest such as tumor size stage hormonal status perhaps amplification status of a certain region and the genome or expression level of a certain gene of your interest and then B's or Betas 1 through P are regression coefficients that are supposed to be estimated by the model then the model has the assumption that the effect of variables is constant all over time and additive in a particular scale so similarly to Kaplan-Meier curve this hazard function and this is a baseline this the h0t is a baseline hazard so so it is similar to Kaplan-Meier survival probability so the hazard function is a risk of dying after a given time assuming survival thus far it is a cumulative function and as I mentioned h0t the cumulative baseline or underlying function and then probability of surviving similar to the Kaplan-Meier survival estimator can be expressed through this hazard function like in the following formula and so for every individual with the given values of covariates of our multiple variables in the model we can estimate this probability of survival for a given length of time these are actual values of clinical variables say it can be binary say normal status zero or one it can be staging one two three four can be categorical it can be a continuous so say a certain level of a certain you know expression level of some sort right so it is pretty much continuous so it can be different type of of data if it's numerical then it's continuous and it's qualitative right and the coefficients are exactly the things that we want to get out of the model they are estimated by the model and they are used for interpretation of a contribution of a corresponding variable into the risk of a poor outcome per patient and now the expression within this bracket is often called a prognostic index which is probably some of you have heard of so how do we interpret the results of the Cox regression model fitting process so this is this is a result an example I'm sorry where you see a number of clinical variables that were used for multivariate regression model and columns two three and four are contained the output of the regression model the first column is the coefficient itself coefficient B the second column is the standard era of this coefficient and the last column is the exponential of of the regression coefficient so now immediately two observations two main things the sign of a coefficient and a magnitude that's what we should be looking at so the sign means positive or negative association with poor survival so whenever you see the negative sign it has negative association with poor outcome which is the good prognostic factor yes so whenever you see the positive sign it has a positive association with poor outcome so this refers to the association with poor outcome yeah so what what is this exponential of B so this is used to assess the magnitude of a contribution of any given variable and how it is done is as follows the magnitude refers to the increase in log hazard for an increase of one in the value of the covariate and it's evident I'm sorry obvious from this formula down there so in other words if you increase say if you increase the age by one year you will get an instant increase in risk by this is a fold increase so this would be 100% which which I'm sorry and this is a relative this is relative to to the baseline which is before increasing by one so it's a bit confusing let me let me illustrate this using this these numbers so see this is an increase of value of the variable by one of this variable by one will result in this increase relative to baseline so if it's 100% then it's just basically no change it's hunt it's it's it's 100% of what it used to be before the increase by one now 95% means that if you increase by one you get a reduction by 5% so you'll get 95% of what it used to be before you added one is it clear yeah so if you increase the serum been a ribbon in whatever scale it was in whatever scale it was by one you'll get this increase relative so it's going to be 1231% off the control yeah which is the level of serum bill of Rubin less by one so that's how you can interpret this now as I mentioned the hazard can be used for for expressing the survival probability and very similarly to Kaplan Meyer the estimation of a survival probability can be plotted similarly to Kaplan Meyer curves and this is just an example for patients with the drug and placebo that were analyzed using Cox regression model so can you estimate average survival gains by by treatment by and by a modeling that are shipped to the right of the curve because it looks here like you're getting an average gain in survival of approximately a year just you know taking for example that any particular point in these of ivory curve and going back to the placebo curve it looks like that distance is approximately so now whether it is statistically significant low-grain text would tell you now whether it is clinically significant it's a question now so the as with any statistical test you always consider both the significance and the magnitude of your change right and so even if it may be statistically significant with a p value 10 to the minus 10 you you get a difference a really tiny difference which is not clinically important the same here but sure but if I look at that curve it looks like actually after about 48 months the curves are almost identical it's just that they've already been shifted do you see what I mean if you could almost if you shifted the as if I bring her down it would it would almost be exactly parallel to the it would almost exactly amount to the placebo correct so now right test will tell you is that measuring in the area under the curves similar to no no no no and that's what I said with with regard to low-grain test tells you whether they are different statistically significantly now the hazard ratio tells you how much the different but based on the overall study length and as I mentioned it makes sense to see what what are the hazard ratios at any given time intervals and then see you know how the hazard ratio is consistent over time points but is there a test that you could do that would answer like I'm not I'm not sure that I asked exactly that question what I what I was asking is can you I mean eyeballing these curves it looks like you could do a different statistical test saying what is the average time difference between the two there's a probability of surviving but it also looks as though the average person on the as of 5 green curve is living about a year longer than the average person possible yeah well this is this is used sometimes but it is it is thought not to be completely proper thing to do to to use the same median or mean survival of a group and then to bluntly compare it it's not particularly there are other methods which I'm not covering in this course but there are a certain summarization methods to basically summarize the differences in a survival experience yes yes yes yes yeah yeah yeah and and our functions that we will be practicing later producing those point-wise confidence intervals for Kaplan-Lauer curves so the it should be noted that the power of the analysis depends on the number of terminal events and that that are usually deaths right and so the higher power requires longer follow-up times because death may not be that may not be happening that often not a often outcome so the alternative to overcome this problem of long follow-up times is to use the alternative more frequent endpoint and that would be time to the recurrence and will will be using that as an endpoint in our practice and then estimation of a sample size to achieve required power is a really hard task as I mentioned to you and there are nomograms that can help you to estimate the approximate sample sample size to achieve a desirable power so I'm wrapping up the introduction part and we will be moving on to the practical work so what have we learned from this part that clinical data is highly important component and is intrinsically different from genomic transcriptomic data survival data is a special type of data requiring special methodology and main applications of survival analysis were summarized in that slide that I showed you in the very beginning and they involve three fundamental statistical tests Kaplan-Lauer survival estimation log-rank test for differences in survival experiences and contribution of multiple risk factors to the overall survival and that is done used to Cox regression model so these are the useful references for you in case you want to look further into the topics and we can take a very short break if you like really few minutes and then we will start the lab