 All right, so the rest of the talks in this session are about healthcare data. But this is specifically about a healthcare analysis tool and how to use it for things that are not involved with healthcare. So hopefully this will be interesting and useful to some of you. It's actually been one of the biggest projects I did last year at my work. We're gonna talk a little bit about what rate sensor data is and then about what a survival function is and how to estimate it. I have a couple of real world examples I'll show you but they're kind of boring industry paying money customer examples. And then we'll take a little detour into the sequel to actually compute these curves. So rate sensor data is on a left to right timeline. Anytime you have data where at some point you stop being able to observe it and you don't know how far it goes on from there. So if we wanted to, for example, know the average length of the lines in this figure, we would know that it was more than what we can see because we know some of them go on longer than what is visible here. But we don't know exactly how much further they do go. And so it's a little hard to answer some questions about like how long these things go and how information like that. And not all rate sensor data is all beautifully lined up at the start of your experiment. Some of it is from real world data and so it's things like when did the customer start paying you? Or when did somebody graduate from their MLIS? And so there's all this information about that just, it's based on how long you've seen it for and you can be censored at any time because they drop out of the study or because you hit the censoring time where there's no longer, it's now today and so we don't know if people have hit the event that we're interested in. So how do you find a midpoint when so many points are unknown? Well, they're not unknown. We do have some data about them. So there is a method, we can try and do this. But I'm actually also gonna try and say that maybe this is not the best question. A lot of the time I get questions like how long does it take someone to use such and such a feature? How long does it take someone to send an invite? And they really wanna know what happens in the first few days or in the first little while of us getting something. So it's just as good or better to answer how many people send one during the first day, how what proportion of people do something during the first seven days is just as useful or more useful than what's the average time? The average time could be much longer but we can still use survival analysis to get that average time and some other information about that process. We're gonna look at survival functions and they are these downwards sloping curves that have time on the x-axis and the proportion that have not yet hit the events on the y-axis. The event is canonically death because of survival analysis. So it's often called the proportion surviving and that's why it's called the survivor function. We're gonna estimate them with Kaplan-Meier curves and they're a non-parametric estimator that will approach the true survivor function as we add more data and it uses all the data available for every time point. So it treats the data that is like this as data that is like this by saying I've seen it for the first few days so I can count it for day one regardless of when day one happened in the time. Then we can actually estimate the actual estimator that we have is actually pretty simple and it's a stair set function over each period of time and we take the product of one minus the number of observations that saw the event at time t over the number of observations we know about at time t. And as the centering goes on you're gonna have more and more observations drop out of that bottom number. You're gonna have, you're just gonna not know about people for some of my paying customers I know about for four years or three weeks or whatever but as I get further out and I start looking at four years I have fewer and fewer customers that I know about for that long and so there's that number in the denominator gets smaller and smaller. And one other wrinkle of the survival function itself is that this estimator of it will often not go all the way to zero because you would have to have every single you would have to have one minus one in the product and that would mean that every single one of the observations that you knew about at a certain time saw the event but we know we have censored data so that's not likely to happen. But if you do do that and you have your estimator it's really easy to find the medium and it's not the mean that people are actually asking for but the medium is probably close enough for what people really want. And so you can just find out what's the proportion that survive. You wanna find out when 50% of the proportion have met the event and so that's just something you can find on the curve and go down to the timeline. But that's not all you get this entire curve. And so this is a real data set from my real work where we want to have time that a customer stays a subscribing customer and is paying us. And so I can answer questions about like what's the probability that somebody will stay for the first seven days after they sign up? Or what's the proportion that stayed for an entire year? And this is a lot more rich of a data set than just what's the average. But that's not all. You can look at different groups of data and compare their curves on the same chart and get a feel for which ones are more likely to hit that event sooner or later. And this is our favorite example because it gave us some really good business objectives to change things around. When we bill people annually, they're more likely to drop out in the first 30 days than when we bill them monthly but after the first 30 days, they're more likely to stick around for one year than monthly customers are to stick around for just two months. So we really want people to stick around and keep paying us. So this incentivizes us to give discounts for annual subscriptions and try and sell them more for people. And I'm gonna talk a little bit about how to do this in SQL. And you probably don't need to do this in SQL for what you're doing. You might. It's cool. But you can definitely do this in R or Python and then it's a lot simpler. It's a little more straightforward. But I work in dashboard land and I need to show this data, update it periodically with all the real data behind it. And I don't wanna have to do some kind of crazy job that runs Python every night and uploads a file with the image in it every time. So I just have, I just do it in SQL and I'm able to actually get that data right away in the dashboard for people to see it. So there's a couple of tricks I'm gonna play. I'm gonna generate a list of numbers. I'm gonna sum the results of the case statement and then my most mathematical trick here, believe me on this one, we might need to look it up but the log of a product is equal to the sum of the logs. So we can rewrite that product function so that we take the log of it, we change that to the sum of the logs and we take the exponential of the whole thing to get it back out into the right frame. And that gives us, that can let us use tools that are all built into SQL. The SQL does not know this notion of taking a product over many rows. It's pretty cool with taking a sum over many rows. So all you have to do, all you have to have on your main table is the event time and the time scene. And the event time is gonna be null for all those censored observations. You're gonna have a lot of those censored observations where you just don't, they don't ever hit the event. They might hit eventually later but you don't have that data yet. But you always know how long you've seen them for, whether that's one week or four years or whatever it is. And so you just get that into the same frame and put it all together. And then as like a framework for our query, for our next part, we're gonna generate a list of the time periods that we want to go over. And we're gonna do this with row number. So we're gonna create this table called day shift by selecting the row number of, over anything from a big table and limiting to 3000. So I only want 3000 days. So this is gonna give me a list of one through 3000. And I can play with that and you just use it as a set of numbers. I apologize for there being a slide that is 100% SQL, but I promise you, this is just calculating one minus DI over MI for each of those time periods. But let's look at it. So for each of those time periods, so we're from day shift, our list of one through 3000. I get the day number, I get one minus. I take the sum of the time when they, remember this is the numerator is the people who've actually seen the event at that time. So we've seen them for that amount of time and the event time is not null. And then the denominator is how many we joined from our table that has our data in it. And so that's all the ones that we've seen where the time scene is greater than or equal to this time number. So now I've got a table that's just one comma one minus DI over NI. And I wanna pull it together and get that product function. So I get to use a summation as a window function. So I take the exponential of the sum of the log of the inside, the one minus DI over NI. And I take it over unbounded proceeding in current row. So that takes all of the previous rows and then adds them up to the current row. So that for every TI you have T1 by itself and then T1 minus T2 and then T1 minus T2 minus T3 and so on. And then this gives you a table that just has the date number and then the proportion that survive for each time. But I told you you could have more than one curve and you can have them in your SQL without having to compute each of these separately. You just need some categorical variable and to include that in your original data set. And then when you calculate the DI over NI for each of the days, you also group by your categorical variable. So I would have a Danum comma billing period here and I would group by Danum comma billing period. And then when I get to the end, I just tell it to partition by billing period so that it calculates these separately for the two different billing periods or however many cases that you have in your thing. And then you still have your unbounded proceeding in current row to get your multiplication for all the past values of TI. And then you'll get something like this where you have the date number and then also the different values for the billing period or whatever your categorical variable is. So kind of analysis is pretty cool. You can use it, it can help you out. You can find the average times in an event, but that's not all. You can actually understand the process and get some data about like how long different portions of time last and perhaps most excitingly, you can compare different groups and make business decisions based on which of them are more likely to hit those events and how frequently. So thanks for listening to me. You can find me on the survival analysis channel on the csv.com slack. All right, so I'll still be in the Q and A channel like everybody else. I'm Ansid on Twitter and I'm Ansid at Weirder Irifang, Massedon. Awesome, thank you so much, Melissa. I'm gonna clap. My cat is determined to be part of this. Hello kitty. It's fine. Okay. We have a question, if you don't mind answering, if that's okay. Yeah. How much slash what kind of dialogue did it take to switch to this method versus using average slash rates? Oh, what a good question. It was actually an idea proposed by my CEO. We had plenty of buy-in right away. They had heard about survival functions and knew that they could have something that could give them that comparison over groups and that was really what they were going for. But they still asked me questions in that midpoint way and I just don't answer them. I just answer them for their first seven days or the first day or something. Cool, we still have a bit of time. So if anyone has any other questions, I'm sure Melissa will gladly try to answer it as much as she can. Oh, she totally will answer it. I think my best. Let me just trim my camera back on. Okay, not alone now. If everyone's happy, Melissa will be around. She does have a Slack channel on CSBConf which hopefully she'll visit after the conference as well. And oh, that's a good question. Any suggestions for background reading? Just slide to the SQL arms and Odo. The book I've been using, I hope this is not flipped for you, is Survival Analysis by Kleinman and Klein. And this is just chapter one. There's so much more. There's all these statistical tests to tell you whether those curves are different or ways to put covariates in there and add them into the thing. Like there's so much more to survival analysis. That's just like the very piniest bit you could do to be able to do something useful. Great, well, all these great questions. Someone has asked, can you tell us a little more about your own background? I'm a data analyst slash data engineer. I've been doing this for a long time. I work at a startup and I'm just the one data person there. So I do all of the product analytics and financial analytics and things like that. But I'm very interested in government and open data and things like that. And so I'm always really happy to be here at CSB Count. Amazing. And another question is how, sorry, have you used Survival Analysis for forecasting? I have not. I think that that would be interesting. And this has me thinking of looking at outcomes for program evaluations. Any warnings for people trying to apply it? I think you should totally try it. And that would be an easy thing to have separate curves for. So different programs and what are the outcomes and when did people reach them? And how do people reach them from certain programs faster than others? Cool. Well then, if there are no other questions, which I feel like when I said, does anyone have questions, everyone came out? But a big thank you from all of us at CSB Count because Mel, you've been one of our biggest supporters for many years now. So thank you. You keep accepting my talks, I keep coming. And they're always so much fun, so. Thanks so much. Thanks so much. Bye.