 This video is sponsored partially by Kite. They provide a code completion service for machine learning code. It integrates super well with your editors and even Jupyter Notebooks. So click the link in the description to try Kite for free. Now back to the video. As a data scientist, you come cross companies with different goals. Two goals of every company would be get more customers and don't lose those customers. We can gain customers with good products, notoriety and good services. But it's hard to please everyone. Losing customers is a reality and in the industry it's called customer churn. Today, let's take a look at customer churn and see if we can use machine learning and data science to better inform the churn of customers. The goal of this video is to get you thinking like a data scientist. Let's not just throw stuff into models. Let's understand how you tackle this problem and to end. I'm taking customer churn as the problem since it's a very common problem in every industry. Hopefully you should be able to apply this strategy to any problem that we tackle here. So let's do this. Before we move on, just a quick favor. Can you destroy that like button for the YouTube algorithm gods to pick up? That would be lovely. The more likes videos like this get, the algorithm will be like, hey, this video is pretty sick and send this video out to people like yourself. By just hitting that like button, you're helping us all out as a community. This is so much appreciated, so thank you. So first, what's the business? You're not a random Kaggle messing around with a data set. You are a data scientist in a company. You have a business. You have an objective. To begin, let's talk about the company you're working for so that it's easier for us to better understand and formulate the problem of customer churn. Oh, and congratulations, by the way. You got the job. You work for a laptop company that your grandma founded. It's called Grandma Fixes. A customer has a broken laptop. They can place a work order online, ship the laptop to the warehouse. Grandma and her workers fix the laptop and they send it back to the customer. So now you have like a business in mind, so it's easier to think about customer churn. And to that end, so what is customer churn here? Customers churn when we lose them. This is true, but we can come up with a more precise definition than that. Maybe something like customers used to place work orders, they stopped placing work orders, and then that's when we know that they have churned. Okay, well, this is getting a little better. It's getting better, but as a data scientist, I want something more concrete and actionable. Let's start by introducing a timeframe. So maybe something like this could work. Customers who haven't placed an order in the last X days are said to have churned. Now, this is more like it, but what is this X? Let's say that you have this entire store of data now that you can query. Can you think of a potential way to figure out X? Think about this for a second. If it isn't defined in the business, one way we can do this is with SQL. Take all the customers in the last year with at least two orders, find the average time between work orders for every customer. For every customer, we have then a number, which is a time between orders on average. And maybe we can take something like the 90th percentile of these numbers. Now, from the data, let's just say that this number was three months. This would imply that, well, last year, 90% of the customers placed a work order within three months of their previous order. And so we may be able to use this in our definition of customer churn. If a customer doesn't place a work order within three months of their last order, then we will consider these customers to have churned or they have left us. Remember, this is just one of many ways that you can think of devising a definition for customer churn. But this is a solid example. We've converted a vague business terminology to a queryable and actionable, concrete definition. That's the goal here. Good job. Now we can get into some machine learning bits. So let's say that we have a perfect model. We need to first answer a few questions. What are the inputs? What are the outputs? What do we do with the predictions? And when are we making these predictions? While you're watching this video, it might help to open a Google doc and write your answer to these questions. Let's go through these together. So the inputs, what are they? The inputs here are the features for a user. Now what features exactly? We will brainstorm this in the next steps. So the next question, what are the outputs? Given the features about a user, the model should output the probability that this person will place a work order in the next three months from their last order. We are treating this as a binary classification problem. There are definitely other ways that we can kind of treat this problem. Like maybe we can predict the number of months until the customers churn so that it becomes a regression problem. But I choose this binary classification approach for two main reasons. The first is that regression problems can give noisier results, at least noisier than a binary classification problem. And the second is that, do we really have use for a regression problem in this context? To know this, we will need to answer, what are we going to do with the predictions? And that's the next question. So what are we actually going to do with these predictions? What are we using these for? One way I'm thinking about this is a marketing campaign. We can run this once a week or once a month and send emails to users that the model thinks will churn in the next three months. Probably this includes a promotion in the email with the, hey, we miss you. Don't leave, we'll give you some discounts. Just something to prompt the user to stay on the platform, you know? And I think I answered the question of when are we making the predictions? Probably once a week or once a month. Now coming back to my reasoning for the classification approach, I talked to marketing and apparently they have one email they send to everyone who churns. So even if we have a perfect regression model that says Sam will churn in three months and Lisbeth is churning in five months, it really doesn't matter. They're both getting the same email anyways. So I could have just said Sam is going to churn and Lisbeth's going to churn and it would have the same business impact. So this is another reason why data science problems are more than just machine learning. You need to think about the business impact in the business context when making decisions about your models too. Now we have a concrete definition of our problem to solve. Good job so far. Let's not think about building the features. We know the model takes some user related information as input to determine whether they churn or not. But what exactly do they take in? I think it's best to open a Google sheet with three columns. So the first column is a potential feature. The second column is your hunch on how this feature affects the label which is customer churn. And the third column is you getting your hands dirty with SQL to confirm your hunch. Think with me about some potential features of a user that could be indicative of customer churn. I'll give you a sec. So one feature could be the number of days since the last work order. The hunch here is that users who haven't placed an order in a while are more likely to churn. A second feature could potentially be like the number of work orders this user has requested in like the past six months. If a user has made more work orders in the recent past, chances are that they will continue to do so in the near future. And hence they are less likely to churn. Like this, you can conjure up more features depending on your data. And once you have an assortment of features, you'd wanna verify your hunches if they were right with some SQL. And we are good here. So great, we know the features. We now want to basically train and predict this model once a week or like once a month. Let's say that we have the two features and the churn label. How do we build the dataset? So for training, you need to think historically. For testing, you need to think in the present. It's easier to think in the present. So let's start with that. So let's just say that we have this trained model already ready to go. If I wanted to make a prediction for Sam, I can compute the days since his last work order and the number of his recent work orders and input it to the model as features. Then the model would basically spit out a probability of churn. Like Sam, I would do this for all active users on my platform. So yeah, the present situation is simple enough. Now let's think about this historically. For every day, I could get the users active at a time, compute their features, and I could also compute the labels by determining if the users actually created a work order for the next three months. It's retro data, so we know this information. But doing this every day creates a lot of redundant data. On January 1st, Sam was likely to churn in the next three months. On January 2nd, Sam was again, likely to churn in the next three months. And the 3rd of January and so on. Lots of redundancies. As an alternative, we can sample 100 days in the year and for each day just determine the active users and then compute their features and label at that time. So why did I sample 100 days instead of just taking like every Monday? Well, if I do take like Mondays, for example, I might insert a weekly bias into the data and I really don't want that. And in this way, we get the training set. Now, for training the model, there are a couple of gotchas that you kind of have to be a little wary of. So we can't shuffle the train and test data. It's typical in machine learning problems to shuffle data, but if we were to do that here, we would potentially be making predictions on data from the future. This is a common problem in data science and it's called data leakage. We want to avoid this at all costs to ensure that the model runs well, not only on evaluation, but also when testing in the real world. And another gotcha is that we can only train on data up to three months in the past, since we do not know if the customer is any later than that actually churned or not. We don't have these labels. Now, when evaluating the model, we can throw terms like precision and recall around. But remember, in the real world, you're talking to people who don't know what these terms exactly mean. So let's break it down. So precision is, off the number of users who churned, how many of them did we say would churn? And recall, off the number of users who we said would churn, how many of them actually churned? With these definitions concretely in mind, we can now report the stats of our current model. This model is probably by no means perfect and you shouldn't expect it to be, but acting on the model output at least gets us closer to the overarching goal of don't lose these customers. Hope you all enjoyed this data science case study and until next time, bye-bye.