 So I'm from Canada. I'm from Guam originally. I spent some time in Waterloo, Toronto, and now I'm in Ottawa doing data analysis for Shopify. My biggest projects have been the Bayesian methods for hackers project. It's an open source textbook on learning Bayesian methods, and it's all in Python. If you want to learn like cool Bayesian stats, I think this is the way to do it. It's sort of abstracted away from the mathematics of Bayesian stats. So it's very computationally motivated versus mathematically motivated. But that's one of my projects. The project I want to talk about tonight is LifeLines. And LifeLines is survival analysis in Python. Now survival analysis is one of those tools, if you read the description of this talk, it's one of those tools that didn't quite migrate to the more mainstream or popular now machine learning field. You have field topics like linear regression and logistic regression that were primarily stats topics, and they sort of migrated into machine learning that field. Survival analysis unfortunately didn't. It's sort of the leftover kid inside stats that wasn't brought over, which is so disappointing because personally I felt the same way. Survival analysis, that sounds like a really boring topic. But in fact, survival analysis is really, really cool. So LifeLines is a Python implementation of survival analysis, and it was developed here at Shopify and later open sourced. It's used to measure durations. Now durations TF, what? Historically survival analysis was first developed by actuaries and medical professionals, and they were interested in measuring how long does it take someone to die. Like that is what an actuary does. That's what they measure. They look at life tables. They look at probabilities. They want to know, given your risk, given how likely you are to die in 10 years, how much money can it make from you. Medical professionals wanted to know, if I give this cohort, this population drug A and this population drug B, who is going to survive longer? And they want to know that as soon as possible. They don't want to wait until everyone's dead. They want to know as soon as possible to measure that. And your typical demographic question is, what is the life expectancy of a baby born in Canada today? So survival analysis answers these types of questions. So to sort of abstract a little bit, these researchers wanted to know how long between the birth event and the death event. So in the actuaries case, the birth event might be when you're born. The death event is when you actually die. In the medical professionals case, the birth event is when you're diagnosed. The death event is when you die. And then there's an additional problem on top of survival analysis, which makes it really, really interesting and really useful. So there's the idea of censorship. And this is when I don't actually see when the individual dies. So consider in the medical profession, I give a miracle drug to one cohort and a controlled drug to another cohort. The miracle drug means that no one dies and the other drug means that some people die. I want to be able to still make an estimation, even though I haven't seen all deaths. So in this case, I've been censored from seeing the death event in one of the cohorts. I haven't seen all their deaths. So that's what I mean by censorship. That's what survival analysis interprets censorship. I haven't seen all the death events. I know up to some point the person hasn't died yet. And I know they're going to die in the future, possibly, but I don't know when they're going to die in the future. I just haven't seen their death event. Often it's the current time that censors us from seeing the death event. So no one here has obviously died. So all your death events have been censored from me. That's not true for grandparents and so on, but all you people, your death has been censored. So if I were to measure all your average lifetime, I would have to incorporate that censorship. And I have some typical questions here. You're pointing at me? Okay. He was doing this. How can I measure population lifetimes when most of my population hasn't died yet? So that's a very traditional problem. So this is a graphical example. So suppose everyone's born at time zero. Can everyone see that? Okay. The contrast is good. Everyone's born at time zero. The red lines are people who have died. The blue lines are people who haven't died. So at time 10, I want to know what's the average lifetime of an individual from this population? So if I were to naively take the mean, it would underestimate the average lifetime because I haven't seen the blue guys die yet. So I'm underestimating how long they're going to live in the future. So that's censorship. That's what happened. So it's time 10 right now. I haven't seen these people die, but I want to know what the average life expectancy is. So this is the counterfactual. This is what would have happened had I seen all of history. These blue guys clearly live a lot longer than the red guys. So if I, again, naively took that mean back at time 10, I would be severely, severely underestimating the true population average. Does that make sense to everyone? That's the censorship problem. I don't see all the data, but I still want to make inference without being biased by censorship. And censorship is everywhere. Once you have this tool of survival analysis, you keep seeing censorship. I keep seeing it. This is one of my most frequent go-to tools, survival analysis. And it's a great way to catch people's stat mistakes. You go, oh wait, I'm not considering the censored people. A lot of people just, if you were given this problem, a lot of people would do two things. Option A would just take the average, and that would underestimate. Option B, which is even worse, would be, okay, let's just look at the dead individuals and see their expected lifetime and use that to generalize. So both are wrong. So modern day survival analysis is a lot cooler than its actuarial history. I'm not saying actuaries are boring, but you can make that judgment. So we use it at Shopify in this sense. So a customer joins Shopify. That's their birth event. The death event is when the customer leaves Shopify. So that's their lifetime. Censorship occurs when the current time, today, censors an individual from their death. So all of our current shops, the 120,000 shops, I haven't seen them die, but I still want to make inference on that population. Another example is when a leader forms a government, that's the birth event. The death event is when that government dissolves. Again, censorship is when the current time, the current time prevents me from seeing how long they are. So I'm going to know how long, on average, presidents say in office. That's a bad example. I want to know how long monarchies say rain and power, but there's exist monarchies today that I haven't seen dissolve yet. Another example, the birth is a couple starts dating. The death is a couple breaks up. Censorship in this case could occur when some couples never break up. They eventually get married. In fact, the partner's death comes first before they divorce or break up. So in that case, essentially, they never break up. Therefore, their lifetime is infinity. Death comes first. Actually, death comes first, and they can't actually break up or divorce. And finally, another example is Senator Emter's office. That's birth. Death is a leave office. Censorship could be they die before retiring. So in this case, again, I want to measure how long they're in office for, how long it takes them to retire. Death can actually be a censorship event. And these last two examples actually turn the idea of death on its head. When I say birth and death, I'm going to abstract once more. Birth could be anything. Death could be anything as well. You could be interested in what's the time between the first pregnancy and the second pregnancy. In that case, birth is the first pregnancy and death is the second pregnancy. So you can throw birth and death these terms around however you like. So the main application of survival analysis is constructing the survival curve. And I'll show an example here. Again, we can all see that. Cool. I'll read the text if you can. So these are two lines. The title is life spans of different global regimes. So the blue line is a lifespan of democratic regimes. And the red line is the lifespan of non-democratic... or the survival curve, excuse me, of non-democratic regimes. To interpret this, the y-axis is probability of existing past or surviving past x-years. So I'll give you some examples. Let's look at 10 years. So this is 10 years. The probability of a democratic regime surviving past 10 years is about 10%. For a non-democratic regime, it's about 37%. So we can see very quickly these two lines diverge. And that makes sense. Democratic regimes are high turnover. Non-democratic regimes, they have other influences to make it stay in power. It is real data. We'll do some more of this stuff later. So this is pretty cool. And these are survival curves. I used just simple data to get these curves. Lifelines generated this. So we'll do some math. It's not very complicated, and it's really cool. So the survival curve is defined as follows. So t is the lifetime of some random individual in the population. That's capital T. Small t denotes time, 0, 1, 2, 3, up to whatever. S of t is defined as the survival curve at time t. And it's defined as the probability that I live longer than lowercase t. So in words, what is the probability that the individual lives longer than small case t? So if I know the survival curve, I know everything about the population. I know how long they are on average they live for. I know the median life expectancy. I know where is the most likely chance they're going to die at. I know what's the maximum they could live for. I know stuff like that statistically, of course. But that's the heart. If I know the survival curve, I know everything about my answer, about my population's lifetime. I can answer any question, essentially. Okay, but I don't know the survival curve. That's the problem. I just have data. And let's do some Python. Okay, cool. So this is that Python notebook. We're going to use some pandas and lifelines magic. So first, a look at the data set. Okay, so the DD data set is a socio-economic data set. It's really cool. It's basically every political leadership from 1948 till now from around the world. I'll show you an example. So this is all the Afghani leaderships. So you have the head of state. You have what kind of democracy they were. You have the monarchy or the type of regime. You have the start year, the duration, how long they were in power for, and so on. So let's look at another example. I'll show you more. So you've got more countries, Vietnam, Yemen. It's a really, really cool data set. It's like a famous data set called the DD data set. They used it for something else, not survival analysis, but I want to apply it to survival analysis. It's got the region. Yeah, it's one of my favorite data sets. So that's DF. So I'm going to import... So from the Lifeline's library, I'm going to import the core estimator of survival curves. And that's called the Kaplan-Meier curve. Kaplan... And it's inside Lifeline's called the Kaplan-Meier fitter. So who knows, like, scikit-learn. Who's used scikit-learn before? A handful. So I've tried to model the Lifeline's API after the scikit-learn's API, because I really like the scikit-learn's API. So Kaplan-Meier fitter is an object. It's instantiated like this. And it exposes a fit method, which is what scikit-learn also uses. It's a fit method. And all I have to do is give it two arguments. I give it... Let me go back a step. I give it the durations. That's how long they're in power for. Call that T. So that's just the number of years. Yeah, there's like 1800 points here. And I need censorship. And that's, I think... What do I call that? Event observed? Or just observed. Cool, yeah, thanks. And that's just like a true-false. So it's one, if I saw their death, zero if I haven't. And so zero means they're censored. I haven't seen their death. Okay, so KMF just accepts T and C like that. Yep, thank you. And it returns like, you know, so we have 1800 observations of which 340 have been censored. So in this case, a censored... Actually, specific. Censorship in this case means you either died in office or I haven't seen you die yet, i.e. you're still in power at the time of this data set's creation. So I'm only interested in how long it takes a government to be born, enter the government, and then dissolve so it naturally... Did my computer just do that? Yeah. Did it just like change color on you guys? Yeah. Cool. It's flux. Okay, so that's cool. That's great. Let's do some plotting because that's fun. So Lifelines has some nice wrappers around plotting. So this is the Serava curve of all political regimes. We can see after 10 years there's only 20% chance you'll still be in power. Now I haven't done any group buys or any, but like partitioning. This is just across all global regimes. So that's pretty cool. Let's do some partitioning. We've got a better view of what's going on. So let's see. There's a cool... In this data set, let's look at... How about regime? The different types of regimes. So there's military dictatorship, presidential, I think democratic. Well, we'll take a look. Did I call it DF? Yeah. So monarchy, civilian dictatorship, mixed democratic. I'm not sure what that is, but... So let's go 4R in... Let's go IX is equal to... I'm using much like pandas stuff here. If you don't know pandas, highly recommend it. It's pretty much the best thing ever. So I'm going to go through what I'm doing is I'm taking the entire data set. I'm only looking at elements that satisfy the R, like it equals R. So it's equal to civilian dictatorship or monarchy. I want to take that data, fit it into my KMF, and then plot it. Cool. And I'll do a plot. And one more thing I want to do. There's a label param. I could just call it R. That makes sense. And that's maybe too many. Let me just reduce the number of things. Okay. So we're looking at four different regimes here. There's a monarchy. There's this large blue one. The civilian dictatorship is this red one. Military dictatorship is this purple one. And this green one is a parliamentary democracy. So we can see if you want to rule be a monarch, make sure your dad or your mother is already in power. Otherwise, I would choose between civilian dictatorship or military dictatorship. Finally, parliamentary demographics. They don't last very long. And that makes sense. Canada has a parliamentary democracy. And we have a pretty quick turnover. I think Stephen Harper has been in power for 10 years now, maybe. Too long. Possibly. And we'll just do the last few. I'm not sure what they were. So this is presidential democracy and mixed democracy. So presidential democracy is interesting. That's like the U.S. where you sort of almost always have this time limit of like two terms or eight years. So you see this cool drop off at four years, which makes sense. That's the really quick presidents. And then very few make it past eight years. So that's pretty cool. What else can I show you in this data set? Do you guys want to see like partitioned by the, what is this one, the continent? The region? Sure, continent? So Shiana, let's do them all. So let's say in Africa, you have the most stability, not the most stability in your government, but you have your government stay in power the longest, which I guess makes sense. And then what's this blue one? That's Asia. Yeah. So that's pretty interesting. You can sort of like measure the lifetimes of these political regimes. Sorry? Yes, of course. Yeah. So there's more monarchies in Africa. That's true. So these are influenced by the types of governments inside those continents. One thing that I'll quickly skim over, and if I can pop back to my slides. So there's like, you know, hazard curves and other more like cool mathematics. But what if you have more data? So what if I had data about the age of the president or the person in power or the gender or the country, stuff like that. I can do some really cool things. I can do like survival regression where I start to incorporate that additional data on a per individual basis. And then I can create a survival curve given you have so-and-so attributes. So it can get really specific. I can create like individual survival curves for people. And I don't have enough time now, but in one of my other talks, I created the survival curve of Stephen Harper. And he was going to make it six years. He had a good chance of making it past six years. But he had like a 12-year maximum. The probability was zero. He was going to make it past 12 years. So make something happy maybe. So what else does LifeLines have? It's got tests, like stat tests, like p-values and that stuff. It's got cross-validation for model selection. So you need that for your regression models. You need cross-validation to vet your models. Make sure they're doing what you think they're doing. They're not overfitting. Some utils for transforming life tables into durations and back and forth. And some artificial data generating library. That was more a leftover from when I was testing it locally. I just wanted to make sure that if I created an artificial data set, my model should fit that artificial data set perfectly. I won't get to that yet. I got two minutes. I'm going to end quickly. There's one thing I want to show you guys. So my latest project is Data Origami. Data Origami is a free way to learn data science. So we go through things like PCA, we go through things like some stat models, some probability models, lots of pandas. We use lots of pandas. I do some Bayesian modeling. What else? Survival analysis, of course. We look at really cool libraries like Patsy. Patsy is a way to create feature matrices from existing data. Survival analysis, yeah, yeah. Lots of A-B testing, like Bayesian A-B testing. That's really cool. So I encourage you guys to check this out. It's my latest project. It's like a little startup too, so I'm pretty excited about it. And there's a really great blog too that I highly recommend. You guys check it out. So if you're interested in data science or what I was talking about tonight, Bayesian survival analysis, that sort of thing. Check out datoregami.net. And just ping me at Cameron.Davidson.pelon at Gmail. And I can hook you up with a sweet discount too if you're coming out tonight. 45 seconds, yeah. Perfect. Okay, thanks everyone.