 Hey guys, in this video we're going to take a look at a quintessential topic in statistics that every data scientist should know. Hypothesis testing. We'll start with a description of a hypothesis test and then move on to its real applications. So, stay tuned. Regardless of application, a hypothesis test involves the same set of steps. First, make an initial assumption. Then, start collecting data. And after that, we start collecting evidence to reject or not reject the initial assumption. This initial assumption is the null hypothesis, which we always assume to be true. In a court of law, for example, the null hypothesis could be that the defendant is innocent. In stats, we also determine a competing alternative hypothesis. In essence, the union of these two statements should be the universe of everything that is true. In our court scenario, it would be that the defendant is guilty. We then collect evidence that's like fingerprints from the crime scene, luminal tests or shoe prints. And if the jury finds sufficient evidence to refute the initial assumption, then we reject the null hypothesis and proceed as though the defendant is guilty. However, if the jury does not find sufficient evidence to refute the initial assumption, then we cannot reject the null hypothesis. And so, the defendant is treated as though he or she is innocent. Note that I word this carefully. Just because we reject the null hypothesis does not mean the alternative is true. It's just that the evidence provided is not sufficient enough to support the initial null hypothesis. So, our decision all rides on the evidence provided. Stronger our evidence, the more reliable is our outcome. Of course, since the evidence never encompasses everything about every situation, there are times when the outcome of the test is just wrong. There are two ways to make errors. So, we may reject the null hypothesis when it is actually true, or we don't reject the null hypothesis when it is actually false. These two types of errors are called the type one and type two errors respectively. Your question? Well, this is cool, but which one's worse? Type one or type two? And the answer to that is, well, it depends on the context. In our court example, it's better to treat a guilty person as innocent than lock an innocent person behind bars. So, the type one error for rejecting the null hypothesis when actually true can be catastrophic. On the other hand, say you're a real estate investor, trying to predict whether house prices will crash in the next month. The null hypothesis would be that the market will not crash. This is the initial assumption and is assumed to be true. Hence, the competitive alternate hypothesis would be that the market crashes in the next month. We gather evidence that is like something like mean loan amount taken to purchase property, loan to value for different transactions, the number of cash transactions over time, etc. But say we fail to reject the null hypothesis when it is actually false. In other words, we fail to predict the crash. Such a type two error is much more catastrophic than the false positive type one error. So yeah, type one error versus type two error, which is worse, it depends on the context. So I said that we can either reject or not reject the null hypothesis. But how do we decide this? This is where your good old p values comes in. So what is a p value? Say the null hypothesis is true and we want to compute some test statistic t. How likely are we to observe a more extreme test statistic in the direction of the alternate hypothesis? In our court example, assume that the defendant is innocent. Then we gather some evidence. The p value now represents how likely the defendant is innocent when we have this evidence. So if the p value is small, it's not likely. And so we reject the null hypothesis and treat the defendant as though he or she is guilty. If the p value is large, we cannot reject the null hypothesis. So how do we know what is small or what is a large p value? We set up a threshold and this threshold is commonly denoted by alpha and is called the significance level. It's the probability of making a type 1 error, that false positive error that we were talking about. We typically set it to be 5% or 0.05. So we've seen what exactly a hypothesis test is and even how to conduct it yourself. But can this only be used in a court of law? Well, no. A typical application is while comparing different groups. Let's say Netflix wants to conduct an experiment. They want to increase the average daily user watch time. And so they have this brilliant idea of introducing a new feature. But they don't know if this new feature will have a positive or negative impact. And so instead of introducing this feature to everyone, they only release it to a subset of a thousand users. In statistics, these users who have this new feature form the experiment group, while the other users who don't have the feature form the control group. To evaluate the effect of the new feature, we compare the average daily user watch time of the control users with that of the experiment users. And typically we compare their mean values. Now that we've painted a picture, let's state our problem. Does this new feature affect the average daily user watch time? We can determine this by following the steps for carrying out a hypothesis test. First, come up with the initial assumption. Our initial assumption, the null hypothesis, would be that the new feature has no effect. Since we compare mean values, it is stated as the mean of the control users is equal to the mean of the experiment users. And the alternate hypothesis is just that the means are unequal. Note here, I could have also stated the null hypothesis such that I just check if the mean of the experiment group is greater than that of the control group instead of equal to. However, this would hide the negative effect of the new feature, if any. Now that we formulated the null hypothesis, we now gather data. This would mean determining the daily watch time for every user over the course of experiment time and averaging these watch times. And then every user constitutes one data point in either the control group or the experiment group. These points are used to generate the distribution. Now that we gather data, we need to extract evidence to either reject or not reject our initial assumption. And we do this by determining the test statistic and a p-value. This type of test of comparing means between groups follows the t test. So the test statistic is t. And we use this along with the degrees of freedom to determine the p-value. In this histogram, I plot the average daily watch time for 1000 control users and 1000 experimental users. The mean is about one hour for control users and one and a half hours for experiment users. But we need to make sure this difference of 30 minutes is statistically significant before we say one is greater than the other. We can check this out using the t test. Since the p-value is less than 0.001, then at a 5% significance level, we can reject the null hypothesis that the means are equal. And so the experiment users are certainly on Netflix 30 minutes longer than the average control user. The new feature thus has some effect. Now, will they launch this feature? Well, that depends on other factors that may be influencing watch time. This entire procedure of hypothesis testing while comparing groups, particularly a control group and experiment group, is quite common. In my last video, while demonstrating the use of data science and finance, I conducted an analysis on customer churn for a company. And the groups I compare were active users, users who currently do business with us, and churned users, users who don't. This is also a very common example. I hope that the detailed example illustrated steps to carry out a hypothesis test. The t test is the most common, but there are many other types. For example, when comparing the means of two or more groups, we can use ANOVA. So it's basically the t test with multiple groups. The t test has assumptions that the data should be normal, the variance of the groups should be similar, and the data points should be IID, independently and identically distributed. The first two aren't too important, but third one is mandatory. If data doesn't have the same variance and the sample sizes are uneven, then we can think about using the Welsh test. But this is the modification of the t test for skewed sample size groups. Tests like ANOVA and the t test are parametric tests. They have strong assumptions, but the hypothesis is also quite strong. However, they also have their non-parametric counterparts. For example, the Man Whitney U test determines if a sample from either group constitutes the same distribution. If we have multiple groups, the more generalized test is the Kruskal-Wallis H test. So we have the t test is to Man Whitney U test as ANOVA is to Kruskal-Wallis H test. Hope that makes sense. There are endless such hypothesis tests out there, and I'm sure that just by reading the null hypothesis and also knowing how hypothesis testing works, you should be able to handle any situation coming your way. Perhaps in a future video, I'll go through some of these in detail. But that's all I have for you now. So if you liked that video, hit that like button. If you're new here, then welcome and hit that subscribe button. Ring that little bell for notifications when I upload. There are some cool links down below, so check them out. Still looking for your daily dose of AI? Then click your top one of the videos right here for an awesome video, and I'll see you in the next one. Bye.