 Hey guys! In our last video, I talked about 5 tips for data scientists, but let's actually get our hands dirty by analyzing some data related to finance. There are many facets of data related to finance, but I'll be analyzing one part in detail, and hopefully you'll be able to apply the same principles anywhere else, so let's get to it. First, let's define a problem. We're going to analyze data for a purpose. So what is this purpose? In this case, I want to analyze customer churn. Customers churn when they terminate services from a company. My purpose is to determine who is churning, why are they churning, what can we do to reduce it, and if we can predict when a customer will churn based on transaction history. So that's great, we have our purpose, so what do we do next? Second, explore our data and understand exactly what it is and what it represents. It's now that we should not worry too much about the purpose of the churn analysis. Let's just try to understand the data we have in general, and I do this by plotting some big stats. For this analysis, we're going to use a simple Kaggle data set. Here's my understanding of this. Every row represents a customer who has ever done business with us, Telco. These customers could still be doing business with us, in which case they're active users, or otherwise they're not, in which case they are churned users. Telco offers two services, phone and internet. I'll run you through this notebook of my analysis and then explain the details through presentation. So don't worry, you'll see how it all comes together in the end. To get an idea of how many users use our phone services or internet services, or both, I plot pie charts for some ballpark numbers. Internet users can either have two types of connections, a fiber optic connection or DSL. We can plot the distribution of users for each service. Now we have some understanding of our data. So what's next? Third, get back to the question of analyzing customer churn and don't get sidetracked. To achieve our goal, I establish comparisons between different sets of active and churned users. I want to see if there's a difference in how long they spend with us. So I compare active internet users and churned internet users and do the same for phone users too. So we can see here that these box plots have the median lifetime of active users larger than that of churned users. But we cannot simply look at these box plots and just say that the lifetime of an active user is greater than that of a churned user. We need to first see if this difference is statistically significant. This is done with hypothesis testing. Gotta love those p-values. I'll prepare a separate video on this, but here's a short explanation. The type of test you conduct depends on the data you have. For comparing these two groups, you'd think of performing something like a t-test, but that test has assumptions that the data must be normally distributed and both distributions must have equal variances. In most of these cases, the test for normality fails. So I have the option of transforming my data into a normal distribution with techniques like box cocks. But even that fails in my case. So I use the Man Whitney U-Test, which only requires that the data is IID independently and identically distributed. Since the p-values obtained after comparison is less than 0.05, we can reject the null hypothesis of the U-Test, which states that both of them belong to the same distribution. In other words, we can definitively say that active user lifetime and churned user lifetime come from separate distributions. And we can make the claim that the median active user has a longer lifetime with us than the median churned user. Similarly, I make the comparison between internet versus non-internet users and phone versus non-phone users. And finally, a three-way comparison between phone users, internet users and internet plus phone users. Now, multiple groups can be compared with tests like ANOVA, analysis of variance. But I do a pairwise Man Whitney U-Test here because of the normality constraint violations. I could also use Crisco Wallace H-Test, which is a generalization of the Man Whitney U-Test for multiple groups. But the hypothesis is too weak for me to tell anything useful. The null hypothesis states something like the median of all three groups is the same. So even if I reject this hypothesis, it just means that two of the groups have statistically different medians, but we still won't know which of the two groups from the test alone. So this leads me to use the pairwise Man Whitney U-Test. Anywho, say that you've done your analysis. Now what? Well, if you show this notebook to other people, it's gonna be pretty difficult to read. In all this code and explanation, we can see which details are important and which are not. So you need to dig through the entire thing once again, and point out which parts stand out and capitalize on that. We'll do this by compiling it into a set of presentation slides, which is step four. I ran through everything before, but let me emphasize my top findings in this presentation. I'll first kick things off with some big stats. As telco, we have customers coming for two types of services, internet and phone services. Here are two pie charts showing the number of internet users, phone users or users of both services, each plotted for current active users and churned users. Each slice has a number along with a proportion of users. The big takeaway 85% of our users who left us had both internet and phone services. And currently, such users contribute to over 60% of our business. Now note, just because 85% of our users who churned had both internet and phone services doesn't mean the combination internet and phone is bad. They still constitute a large 63% of our active user base after all. Similarly, 6% churned users only had phone services doesn't mean that the phone only policy is good. They just don't have as many active users to begin with. I plotted something similar for active and churned phone users. These phone users may or may not have an additional internet plan. They could be fiber optic DSL or just no internet. Interesting to note that 76% of our churned phone users also had fiber optic internet services. Now we compare the tenor or lifetime of active and churned users. By lifetime, I mean the number of months for which they stayed with us telco. The plots on the left are active and churned internet users. And the plots on the right are that of active and churned phone users. I wrote the median lifetime in months for each plot in red. And under each plot, I wrote the p value on performing the test of statistical significance. In this case, it's the man Whitney you test. Note here, I don't just put the actual p value. I just state that it is less than 0.001. This is how you should report p values. If it's greater than 0.05, just write the actual value which denotes non statistical significance. If it's between 0.001 and 0.05, then write the actual value again, denoting statistical significance. And anything less than 0.001, just write less than 0.001 to show strong statistical significance. There's no need if that 2.6 e to the negative 10 or whatever the actual p value is. Since the p values during the comparison of active and churned internet users is statistically significant, we can reject the null hypothesis of the u test, which states that the distributions of both populations are equal. Hence active internet user lifetime and churned internet user lifetime are different distributions. The same can be said for active phone users and churned phone users. They belong to different distributions. So our current internet and phone users have stayed with us longer than our churned internet and phone users. This slide looks like the last, but this time on the left, we're comparing active internet users to active non internet users. Since the difference is statistically significant, indicated by the low p value, these two distributions are different. And hence we can say that internet users are older than our non internet users. We see a similar case in churned users too. So the internet users also stayed with us longer than the users who didn't have an internet subscription. Once again, we have a similar slide. But instead of comparing internet and non internet users, we compare phone and non phone users. The graph on the left shows the comparison of lifetime of current active phone users and current active non phone users. Since the p value after performing the U test is not significant, we cannot reject the null hypothesis. So we really can't establish a difference in the tenor with telco for these phone and non phone active users. For churned phone and non phone users, however, the p value is significant. And hence we can say that our churned phone users stayed slightly longer with us than our non phone users. Now we can make the comparison between three box plots. They denote the lifetime of users who had both internet and phone services, just internet services, and or just phone services. We performed the Paralyzed Man Whitney U test and determined that all p values are significant. Thus we can say that the users who take both internet and phone services from us stay longer with us than those who just take internet services from us. These users in turn stay much longer than those who just take phone services from us. Interestingly, among our churned phone only users, more than half of them stayed with us for only a month after activation. Like I mentioned before, our internet users have two types of services, fiber optic and DSL. The goal of this slide is to compare the lifetime of users with different types of internet services. Perhaps users of a specific type of internet service tend to churn sooner than the rest. Well, in this case, we find that our users with the fiber optic internet service stay with us longer than our DSL users. For our internet users, we give them an option to opt in or opt out of our technical support. I thought this might have an impact on customer tenor, so I compared the lifetime of users with tech support and those without tech support. And I find that users with tech support stay much longer than those without tech support. Now I've only mentioned a few of these facts in the presentation so far, but the list of facts that we can extract is endless. Facts are amazing, but how does knowing these facts help us solve or at least mitigate the problem of customer churn? Well, this is another thing to take care of. Point number five, coming up with solutions. And so I take some of these facts and try to think of what we can do as telco to decrease customer churn. I mentioned before that users with internet and phone services stay with us longer than users with just one of either service. What we could do is when a user signs up with us just for our phone service, we can entice them to sign up for an internet subscription as well, selling both for a package deal. Perhaps more users would sign on and be users longer than they are now. Another fact I pointed out earlier, over half of our past exclusive phone users churned within their first month of activation. Knowing this, what can we do to mitigate customer churn? What we could do here is improve our phone service policy, perhaps by including new features. But since our current exclusive phone users have been with us for over two years on average, I don't think this is the problem anymore. Then again, new services are always something to think about. We know that our fiber optic internet users stay longer than our DSL users. Both services have their advantages. The fiber optic connection is fast, while DSL is reliable and affordable. Customers require internet services that soothe their needs, whether it's bandwidth, location, usage, or price. So for our larger businesses or larger bandwidth consuming individuals, we can offer discounted fiber optic package plans. This would better tailor internet services to make it more customer centered. And thus we could expect less churn. Another fact that we found out was that users with technical support churn later than those without. So to increase customer time with us, we can offer a 12 month free tech support subscription. This way they'll be more satisfied with telco, stick around longer, and even willingly subscribe to this technical support after a year of good service. So yeah, that's my mini analysis and presentation with this Kaggle data. It's always fun playing around and seeing what insights you can come up with. The link to everything is down in the description below, so check that out. Now this is great, but can we use machine learning in this? Well, a useful application I can come up with is building a churn predictor. Given customer information, predict how likely he or she is to churn in the next month. However, I would need the user's behavior over time. Where were they before telco? When did they join us? Now given this information, we could extract features and throw that into our model. But since we only have access to one tuple per customer, we cannot build this classifier just given this data. So I'm not going to force a model out of this. Now here the general points to keep in mind while conducting this type of analysis. First, define your problem. Have a specific goal in mind. What do you want to get out of this analysis? Number two, don't be afraid to explore your data. Getting to know your data is very important, so don't let the goal define how you understand your data. Perhaps something you thought seemingly unrelated may actually help in the analysis. Three, when starting the analysis, don't lose sight of your goal. After you know your data, conduct the analysis by always keeping your goal in mind. No point in doing a bunch of random shallow analysis. Number four, presentation matters. You may have done this amazing complex analysis on data, but if no one understands your work, then it's like you didn't even perform the analysis in the first place. Number five, come up with solutions. As a data scientist, you should try to come up with some business solutions based on the insights. Domain knowledge always helps and it does take some time to research. Number six, model if you need to, not if you want to. Machine learning isn't always the answer. If you feel that the model will help address your goal, then it is something to consider though. Customer churn is one of the many facets of what you could be dealing with as a data scientist in finance. I'm hoping that going deep into one aspect can help you actually analyze data, similarly in any field of finance. And that's all I have for you now. So if you like what you saw, hit that like button. If you're new here, welcome and hit that subscribe button. I got some cool links in the description, so check them out. Still looking for your daily dose of AI? Then click or tap one of the videos right here for an awesome video and I will see you in the next one. Bye!