 All right, good afternoon everyone. I hope you got some nice coffee with you, I've got mine. So where we're kicking start, we're kickstarting the data science track again and we're gonna be joined by Alon who's gonna be talking about causal inference. Before Alon gets himself set up, just a quick announcement. If you haven't already then do look out for the sprints that are gonna be happening tomorrow and day after. We have some really nice projects and you can also propose your own projects over there. So do check them out in the schedule. And yeah, if you have any issues then let us know. And a quick thing, now moving over to the talk. Quick thing about Alon, he is a senior data scientist at Spotify and he's gonna be talking today about sliding into causal inference with Python. With that, the stage is all yours. Hello, I'm Alon Neil and today I'll talk about sliding into causal inference with Python. On the agenda today, we have three parts. We'll start with an introduction. What do I mean when I say sliding and what is the gold standard for causal inference? And then we'll spend the majority of the talk discussing what happens when you actually can't apply that gold standard. So we will talk about some of the common reasons and some of the common solutions. And then finally we'll recap and discuss the next steps you can take in your journey into causal inference. If all we dive in allow me to say just a couple of words about myself. I am a senior data scientist at Spotify, although for obvious reasons I won't be discussing any of the work I do at Spotify. I am based in the UK. You can find me on Twitter and on LinkedIn. And I will admit that I still quite often read causal inference as casual inference. And casual inference, as you can tell by the name, is a very chill and laid back variety of statistical inference. Due to time constraints, we will have to resort to best of simplifications and exuberant hand waving. And causal inference is a huge topic. We only have 30 minutes. So that's a sacrifice we'll need to make. I hope you'll forgive me. I will also skim through the slides that have coding in them just because the code is very simple, very well documented and it is available on GitHub. So you can just review that at your own leisure. So with that, let's get started. When I was a kid in the 90s, there was this show I really, really liked. It was called Sliders. And it told the story of four unwitting explorers. We have this guy in the orange, yellow jacket, Queen Mallory, super smart. He invented a sliding device that allowed him and his friends to hop between parallel universes. So the intro to the show always had him say, what if you found a portal to a parallel universe? What if you could slide into a thousand different boards? Well, it's the same you and you're the same person, but everything else is different. So they could slide, for example, into a world where Penisthine was never discovered or where the atomic bomb was never finished and see how that played out and how that affects the world during the 90s when they slid in and out and kind of explore alternative histories and all sorts of cool stuff like that. So in academia and in industry, we already use simulation quite heavily. And the question I have for you today is can we simulate parallel universes too? Can we build a sliding device of our own with Python, which would be really, really cool. But before we answer that, we need to start from the basics. So in order to start from the basics, I'd like to introduce an example that will follow along with us throughout the slides today. And it has to do with my girlfriend, Emma. So this is an artist depiction of Emma, of course. In real life, she is much more lovely and most of the time she has three dimensions. And Emma is very, very passionate and very good at knitting and crocheting. And she's so good that she actually decided not really that she would start selling wool online. And she decided again, not really to cause that shop, wait for it, Amazon. And now the question is, can we use data science to make sure that business is booming and the customers are happy and so on and so forth. So a very simple and common example of something that we can do is to run an A-B test or more formally a randomized control trial. It is something that happens a lot with marketing and emails in particular. So let's assume that there is an email that Emma sends out to her customer base every month with the latest and greatest things in the world of wool and yarn. And by default, she uses a blue font in her emails. Now we want to see what would happen if she changed that font to be orange. So the way you do a proper A-B test or randomized control trial is you randomly split 50% of the customers to one group and 50% of the customers to another group. And then you measure how they perform, you see which number is bigger and you award that as the winner. Now, obviously it was very hand wavy. If you want to know more, I strongly recommend Jack Vandervaas's talk, this is for hackers, even though it's on five years ago, in my opinion, it's one of the best buy-con talks of all time and it's still very relevant nowadays. So a randomized control trial is called the gold standard in causal inference. And the reason it's the gold standard is because observed and unobserved covariates are balanced through the magic of randomization so they don't come into effect when we measure the impact of the treatment or the manipulation or the change and so on. So an example for an observed covariate would be gender, city you live in and so on. And an example of unobserved covariate would be color blindness. So if you run an experiment about the color of the font that you're using in your email, you would care about color blindness and imagine that if the randomization was off, if one group had a greater proportion of coral blind people than the other group, that would enter into the effect and into the results that you'll see. And so obviously we don't want that and the beauty of randomization is that it solves all sorts of issues like that where there are unobserved covariates. We don't know what is the value of that property of each given individual or you may even not know that this is something you should care about and randomization solves that because basically it's balanced between the two groups so that doesn't enter into it when you measure the impact of a treatment compared to the control. So that's all fine and dandy but what happens when you can't experiment, when you can't run a randomized control trial. So let's quickly discuss a few of the main reasons why you wouldn't be able to. One of them is network effects. So you could see it as called software violation or spillovers. Basically it means that the allocation of users to one group affects the results of the behavior of people on the other group. For example, in our email campaign, imagine that the variant, the orange campaign was so successful that everyone who received it immediately rushed into Amazon and completely bought out the entire stock. Then what happens is that the people in the control group, the blue group, they don't have any more wool to order. So the fact that people in one group affected the results of another group is a problem because we artificially lowered the order volume of people in the control group which obviously we didn't mean to do. And so any comparisons we draw are unfair because the people in the blue group wanted to order wool as they always do but they just couldn't because there was none left. So that's network effects in a nutshell. Other reasons could have to do with morality or legality. For example, if you wanted to see what is the effect of drugs on kids between 12 and 15 and how does that affect their grades, you probably can't do a randomized control trial that you probably shouldn't because giving drugs to kids is not advisable in any situation. And similarly, there's like experiments that are too expensive or infeasible and so on. I discussed these more in the appendix. So having said that, what can we do when we can't use a control trial? So what we are going to do is to try to get as close to that cold standard as possible and we'll see two applications. First of all, we'll see a case where the randomization to control and variant wasn't randomized. And so what we will do is we will fix that location to make the comparison between the groups fairer. And that's pretty cool. And then we'll also look at a case where everybody's exposed to a treatment so you literally don't have a control group to compare it to and we'll see what we do there. So if we go back to Amazon, let's assume that every, just for simplicity, every order on Amazon contains exactly one ball of fuel. Delivery fee is two pounds flat bill order and things at a steady state, which is a fancy economics term to say nothing changes. So if you ordered five balls of fuel last month, you will order five balls of fuel this month and five balls of fuel next month and so on and so forth. Obviously not very realistic, but again, oversimplifying and whether in my hands in fact of conveying the information. So having said that, these are the baseline conditions. Emma came up with a brilliant idea. She wants to introduce a new service called Amazon Sublime, which offers unlimited free delivery on every order if you pay an 11 pounds a month subscription fee. And the question is, how will that affect the business? So let's do what we do very well. Let's take the groups that got the treatment that registered to become subscriber of Amazon Sublime and we'll see that on average they order 13 times a month. And then we'll look at people who were untreated. We'll see if they got 4.7 orders a month on average and the difference is 8.3 orders a month. So 180% more than the non-sublime subscribers and basically smashing success. Let's roll it out. Let's get as many sublime subscribers as possible. Or maybe we should. So let's kind of dive in a bit more deeply. So here's the distribution of orders before sublime was introduced monthly orders. We can see that very, very conveniently we have exactly 10,000 customers and 1,000 customers order one boleman, another 1,000 orders twice a month, another 1,000 orders three times a month and so on and so forth. Perfectly uniformly distributed. Couldn't be simpler than that, right? And now let's see this is what they do before the introduction of sublime. Let's see what happens after. So in this chart you will see the horizontal axis says what is the number of orders before and the vertical is the number of orders after the introduction of sublime. And it's easy to see it first of all that only users that ordered six or more times a month only them got a sublime subscription, which makes sense because if they ordered six or more times a month, they would have spent 12 pounds per month or more on delivery fee. And here you can pay just 11 pounds flat fee and get free delivery. So it makes sense and for the same reason it makes sense why people who ordered five or less times a month didn't get the subscription. So if we again zoom in to the people with a high order volume before the introduction of sublime we'll see that 50% of people among those cohorts got the subscription, 50% didn't. So a big dot represents 1,000 people, a small dot represents 500. So everything is really neat and nice. And if we kind of take our rulers out and measure the gap, we'll see exactly the difference between the number of orders before and after it's exactly five regardless of how many orders you made prior as long as it's six or above. So what we did when we compared the treated to the untreated is this. We included all of the untreated users including people in the low order volume which were unaffected and they would never convert. So we made an unfair comparison. What we should have done is to compare people that would consider getting sublime because they have six or more orders a month and use that as the control group. When we used everyone that weren't treated we introduced a selection bias because they're strong and more people with high order volume they self-selected themselves to the treatment group. There was no random assignment, they selected themselves. So what we should do is we should find an appropriate control panel. And the way we do that is with a very nice technique called propensity score matching. So here are the steps for propensity score matching. First we learn a model that predicts the propensity a user will get the treatment. So in our case, people who ordered five or less times will get zero propensity for example. And then we will match every treated user to an untreated user that has a similar propensity to get the treatment. Then we'll measure the delta between how much they order we average that and we'll get the effect rates. So in practice again I'm gonna skim through the slides we kind of generate some dummy data we maintain that about 50% of the population should get sublime if there are six orders or more the uplift would be five orders each. And then we create some random data and we get this summary table that shows indeed if you ordered between one or five orders a month there's zero conversion rate and then six to 10 roughly 50%. And next the important part is to learn a model. So we will use a simple logistic progression just following along the basic steps that you've probably seen many, many times. And if we see that if we plot the predicted propensity to convert by the number of orders you made prior to the introduction of sublime we'll get that, which is fine. It's not great, but it's fine because the model did learn that the more orders you make the more likely you are to become a sublime subscriber. So in reality we would expect that but in practice we can make do with even that we are using a very simplistic example we can make do with the model the way it is. So we take a propensity scores we match people by the propensity scores. This is one very, very, very naive and simple way to match. In reality there are many different ways we could spend an hour just discussing on how to match. So I will spare you. And then we do some magic and you will have to trust me that we indeed get a 5% gap and not 8.5 order gap and not 8.3. So just to summarize we saw that the split to treated and untreated wasn't random, it was skewed, it was inflated our effect size. And then we had used PSMs to fix that problem with assignment to treatment and untreated. We saw a very, very simplistic example where we only had one barrier. So it was very simple and we didn't even need to actually learn a model we could have matched them by hand. But in the real world we have many covariates. And so the propensity score is a very, very easy way to assign to each user we get that or each experimental unit whatever you wanna call it we get that narrowed down to one number. So if you think about adding more and more dimensions and the curse of dimensionality matching people by hand would be really, really hard. But if you have only one score that become per person that becomes easy. That was really nice. And now let's jump into the next example where I'm rolling back the clock that was sublime was never introduced. Let's say that now Emma wants to see what would happen if she just gave pre-delivery for everyone on every order. And everybody knows about it makes the rounds on social media and so on. So what would the effect of that be? So we have a problem. The problem is that we just don't have a control group because everyone is exposed to the treatment everybody gets free delivery on every order. So in an ideal world what we would have is this data table which shows how many orders each user made with free delivery. And then another column is how many orders they would have made if they hadn't had free delivery. And then you calculate the difference and you get the average treatment effect or the effect size and that's great. But in reality, you only get to experience one world or one outcome. You only know how many orders a month people place with the free delivery. You have no idea what they would place if they didn't have free delivery because we can only observe one outcome per person, per time unit. So what do we do? This is actually called the fundamental problem of causal inference. And as I said, we only observe one outcome per person. We don't know what would have been the outcome had the person been under different circumstances. So there are various things that you can do kind of fairly simplistic. You can do pre or post analysis and diff and diff which are covered at length in the appendix of the stock. Again, available in the GitHub. What we will focus on is synthetic control which is actually our way to build a sliding device. So what if we use Python to generate this parallel world where the treated won't treat it, which is basically what we want to find out. So in the context of Emma, let's say that there are three product lines. There's people who, there's green wall, blue wall and yellow wall and the people who buy, you know, each color, they are their own custom base. There is no overlap. They don't know each other, they can't each other well. So they can't know each other well. They can't each other well, all very independent. And people who order blue wall have a low ordering frequency. People who order yellow wall have a high ordering frequency and people who buy green wall have a medium one. And so let's say for the sake of the example that Emma offers the unlimited free delivery and every order only for people who buy green wall. So that is something that we can walk with because if you remember your, you know, days in nursery or kindergarten, you would know that if you mix 50% blue color with 50% yellow, you would get green. So this is at a very, very high level. This is what synthetic control does. It generates a control group by waiting different groups by themselves, not a good comparison, but waited they do make sense. So, you know, we can see here that the images walk out by the things that we established earlier about the order of frequency. And in code, again, I'm not gonna cover that in too much detail, but basically over here, we are generating some dummy data. And we can see that things look the way we wanted them to look like. So we can see that green people on average order of five times the amount of blue, seven and yellow, three. And now the magic happens. So we are going to use a library called causal impact from the fine people at Google. And what it does, it basically does what we saw earlier to wait the different potential control groups to make a counterfactual to the treatment group or control group by kind of looking at a time series of what happened in the past. So this is the output of that code. And let's kind of review what we're seeing. So before the dotted line, we see the average green orders before the introduction of free delivery. And here we see the actual average green orders after what we see is the counterfactual average or the green orders after. So these in the purple shade, these are the counterfactual orders those users would have made that had no free shipping being introduced, which is fairly easy to believe because you saw, you can see here that for a while, nothing changed. So everything was at a steady state. So it makes sense to, it's easy to believe that this is what would have happened. And you can see here too the effect size and that is a super, super simplistic way to convey the idea of synthetic controls. And it's so simplistic that it's even wrong. And I will leave it to you as an exercise to understand why, but here's a hint. And again, ping me on Twitter or LinkedIn if you want to discuss more. So that was a very simplistic example. Let me share a more interesting one. We kind of one of the most famous studies into the use of synthetic controls wanted to see what is the effect of a law that was passed in California. And what the authors did is basically they generated a synthetic California by waiting different states. So you can see on the left, you can see a table with the different states and the weights they got. And kind of dive in, we can see over here the dotted values, the dotted line is where the treatment was introduced. The dashed line is the count of factual. And you can see that before the dashed line, the actual performance of people in California is the black line and all of those gray lines are a very noisy way that the different states behave. But waiting them together, you can see that count of factual that was created is fairly robust. It fairly mimics very well what California looked like. And so what we can see is that if we trust that the good, the treatment, the match is good, then we can say, well, here's the impact because this is what would have happened with our country factual. So that was a lot of information to convey. Let's do a quick recap. So we saw a couple of methods with PSM and the controls before and after a different default, fine-ish, and you can read more about them in the appendix. And now let's talk about what to do next. So basically from now on, we need to cover two things. We need to cover all basically, not three, but you need to kind of take the plunge and take the next steps and talk about what. So there are so many things that we didn't cover. Some of them are in the appendix, other ones are listed here and there are so many more. And then there's also depth. So for example, we saw a very, very simplistic implementation of PSM, but I didn't cover, for example, how do we know that the balancing worked nicely? Maybe it's good enough. And how should we match and how close is close enough when we match by propensity scores and so on and so forth. All sorts of kind of edge cases and considerations and aspects that you can spend a whole lot of time learning and when you actually want to implement synthetic, sorry, propensity score matching, but also it applies to static controls, respectively. So the final thought that I kind of want to leave you with today is that we saw the fundamental problem of causal inference. And we also saw that we have fancy shiny tools with backed by mathematics to address those issues. But what I want you to keep in mind that we must not forget, we will always only live on one error. So whatever result we get from tools like PSMs or causal impact, we will never truly know if the effect sides we found is real, if it's accurate, if it's precise. And I don't want to get too philosophical, but if you think about KL theory, basically the outcomes of a system can vary massively even if there's a slight difference of the starting condition. So I'd like to invite you to reflect on that for a bit without getting too philosophical. And here's where you can find the code and here's how you can find me. Thank you very much for listening today. I enjoyed giving this talk and I hope you did too. All right, fantastic. I loved your talk and especially the Amazon example. I was laughing to bits, I have to say that. So we might not be able to take a lot of questions, but I just have one for you right now, right? So in this entire realm of causal inference, what do you think are the key limitations? What's that one limitation that sort of prohibits us to go further in this particular field? And if that makes sense. So what I like about causal inference is that it's still a very much life, like we haven't figured out all the answers. So we are still walking on things there's lots of reverberants, a lot of papers going out, a lot of people in different places in academia and in the private sector there are suggesting new methods criticizing old ones. And we haven't discussed, for example, how to validate what you're doing actually makes sense or how to refute your models. So I think that like that is the limitation but also kind of the most exciting thing. So we don't have all the answers right now, but we're walking on them and we're doing cool stuff. Again, both in academia and in the private sector. So we are getting there. And obviously kind of, I know that we are short on time, but the things that I kind of like to remind myself and I kind of go down and rabbit holes and other things things is that in sliders, for example, they would have one thing change in history and everything else stayed the same. So for example, the golden red bridge was colored blue instead of orange. And that's the only difference between the two worlds. And in reality, we have chaos theory which says that the outcomes of a system could be very, very different if there is a slight difference in the initial conditions. So I think that that's both really, really cool, really, really frustrating at times, but that makes kind of the work on cause and inference kind of this fun treasure hunting exercise. I hope that makes sense. That does, that absolutely does. Well, I know that I'm gonna find you after I'm done sharing all the sessions for everyone else. I hope you'll be in the breakout room for any other questions. So feel free to find Alon and thank you so much again. Thank you for having me.