 This episode was inspired by a comment I stumbled upon in the comment section of a YouTube video about Scott Walsh's woodworking and joint strength comparison. The commenter raised an interesting point about statistical analysis. They argued that comparing averages alone might not be the best way to determine the effectiveness of different methods. Instead, we should consider the spread of results and whether they follow a normal distribution. They also questioned whether removing the maximum and minimum values from the average was a good idea. I found this comment fascinating as it opens up an opportunity to explore the intricacies of data analysis. Can we take the data at face value or should we discard it because it doesn't adhere to a normal distribution? Was the use of a trend mean? Correct? These questions will be the focus of this video. Before we delve into statistical analysis, let me provide some background on the test. Scott Walsh conducted various joint strength tests using a machine to pull apart joints. The metric was simple. How much force does it take to break a joint? He used five samples per joint and calculated the average after removing the minimum and maximum values. I want to cover three points in this video. First, was excluding the maximum and minimum values from the average a bad idea? Or incorrect? Two, what can we test with this data? Since there's only five samples, can we come to any conclusions? Or is five samples really not enough to know if any joint is really stronger than another? Three, how many samples should be collected if a follow-up test was done to test a hypothesis Scott Walsh had in his video? He called the glue starved hypothesis. Now, I understand that statistics can sometimes be overwhelming, but this is not your fault. I believe that the current approach to teaching statistics is flawed. I won't go into depth in this video, but if you want to reteach yourself what a hypothesis test is and a p-value, I suggest you check out my video called p-values, beer, and mosquitoes. For now, let's take a look at this YouTuber's data and see what we can learn from it. Also, the audience for this video isn't necessarily statisticians. This video is for anyone who's interested in these questions on YouTube. All right, was excluding the max and min values from the average bad? Let's address this question first. The max and min values were excluded from the average. Is this incorrect or a bad idea? No. Before I go into details about why removing the min and max values isn't incorrect or bad, let's first take a look at the data. I've compiled all the data reported in the video and will share it on GitHub in the form of a CSV for those who want to explore it. There are eight joints and five samples per joint, giving us a data set with 40 rows. I can thumb through the rows to get a general sense of the data, but I might want to know something about what the general takeaway is. Some people might ask, what is the data saying? But this is the wrong question. Data doesn't say anything. It's up to the analysts to tell the story. So what's the story here? This is where the trimmed mean comes into play. When we argue that trimmed means are bad because they discard the spread of the data, however, if this were true, then we should never trim the mean. This would mean that we should discard the median. Why should we discard the median? Because a median is a trimmed mean when 50% of the data is trimmed. That means 50% over 50% under, leaving it with just a single point, the median. You can think of it as the spectrum. Where the mean is on one end, where no trimming is going on, and the median is on the other. The trimmed mean is somewhere in the middle. You can control how much of the trimming you want to apply. If I had to choose between a mean, a median, or a trimmed mean, I would definitely pick the trimmed mean in this application. This is because the median is just too trimmed, and the mean is not trimmed enough, and there were some outliers that needed to be addressed. So let's give Scott Walsh a gold star there. As a side note, a trimmed mean isn't the only option. We could have also gone to a winterized mean. This is where we replace extreme values with the next largest value. This is a common practice in data analysis as well. You may be thinking, how are you allowed to just change data like that? Well, we are not changing the data so much as we are making a transformation to the data. That is more robust to outliers while still preserving the sample size and taming the extreme values. Trimming and winterizing are both valid options. In this case, I think either option would be fine. Now you may be wondering, if I can just remove data or change values, why do I even need to collect data in the first place? Good point. Let's talk about that. Scott Walsh did one thing that made all of this okay. He showed all of the data. Showing all of the data keeps the dialogue open. Other people can take the data and do their own analysis or check his work or push back if they think something was done incorrectly. This is kind of like a first commandment of data analysis. Share all of your data, not just the average, not just the median, or the trend mean. Show all of the data, always. So that was a lot of talk about trend means and windsrized means for summary statistics or summary metrics. Now let's talk about what we can test with the data. So what can we actually test or what can we know from this data? So again, there was only five samples per joint. Can we really say that one joint was stronger than another with this data? Well, to answer this, we have to go to p-values. So given this data, let's calculate some p-values. Disclaimer, in no way was Scott Walsh trying to do a hypothesis test. He was just showing that all joints were strong enough for the hobbyist woodworker. But I'm going to put my neck on the line a little bit and make some risky arguments with this data. Specifically, I'm going to make an inductive argument. If you don't know what inductive arguments are and what makes them special, then I suggest you check out my video, p-values, beer, and mosquitoes. Please. So maybe you've heard of p-values in the past, maybe you haven't. Maybe you've had a section of statistics about t-tests in school. In this course, you probably learned about p-values and the need for data to be independent and identically distributed. Sometimes we shorten this to iid. You may have even used a complex equation to calculate the t-statistic, degrees of freedom, and compared the t-critical values to determine whether to reject or fail to reject the null hypothesis. This is a lot of work and it's not necessary in the computer age. We can use permutation tests instead. Hopefully this example will show you how easy it is to do statistical analysis on your own data without having to remember the exact formulas and associated assumptions. So how do we calculate a p-value? Well, the way I like to think about p-values is in the context of a hypothesis test. You may have heard about the null hypothesis and the alternative hypothesis. I'm going to just throw those terms out the window and we're going to use different words. So we're going to call one argument the skeptics argument, who says you just can't really determine anything from this data. And we're going to have another argument called the optimist, who thinks that there is something that you can say from this data. So let's take a look at what a skeptics argument would look like in this case. A skeptic says that there's no difference between the force needed to break two joints. Let's take a butt joint and a dovetail joint. The optimist says that it takes more force to break the dovetail. So who was right and how can we tell from the data? And does the mean really sufficient? Well, we're going to have to do something more than just compare the means. There is some truth to the comment that the commentator said on Scott Walsh's YouTube video. There is something to be said for the spread of the data. And so how do we take that into account? Usually the t-test would help you take this into account, but we are going to use computers to do the heavy lifting. If the skeptic is correct, then the labels we put on the data are arbitrary. So we can say some of the butt joints were dovetails and some of the dovetails were butt joints and recalculate the difference and force it took to break them. If the labels really are arbitrary, like the skeptic would believe, then we should be able to see that the original mean difference is very common. It's not really an extreme value. If the optimist is right, then the labels are not arbitrary. And we should see that the original mean difference that we get from the raw data is in fact pretty rare. Either we just observed some very rare phenomenon or the skeptic would have to admit that in fact there is a difference between the strength of the joints. This probability that we observe a mean difference greater than the original difference is called the p-value. Now you may say, stop, you can't calculate a p-value on the data if it's not normally distributed. Also, in my stats 101 class, I learned as a rule of thumb that you need at least 30 samples to calculate a p-value. So let's go to that first point. You do need a normality assumption if you are using classical methods. For point two, you do need a larger sample size to detect smaller differences in force. The 30 samples rule of thumb is just that, a rule of thumb. The first differences are large enough that we can get away with a smaller sample size. We would not be able to do the same if we wanted to know if one joint was just 1% stronger than another. In that case, we'd need a much larger sample size. And this is true in general. When you want knowledge about something that's very detailed, you have to come up with a larger sample size. That's just the cost of that knowledge. But what things are fairly obvious and most people would agree, even without the statistics that they are different, you don't really need a large sample size. Now I will talk more about how to calculate the number of samples we would need to do some more hypothesis tests later on. So if you're interested in doing your own experiment but don't know how many samples you need to collect, stick around until the end of the video. So what can we test with this data? We can really test whatever we want. I just showed an example of how to test if a dovetail is stronger than a butt joint. We found a p-value of 0.031. To me, this is more than enough evidence to say the skeptic does not have enough ground to stand on. Dovetails are stronger than butt joints. If you want to test other things and go for it, I think permutation tests are better for people who are new to statistics because they involve much less formula memorization and you gain intuition about what kind of data makes a hypothesis fail. All right, let's hit point number three. How many samples should be collected if a follow-up test was done testing the glue starved hypothesis? In the video, Scott Walsh mentioned that his butt joints and dovetail joints broke at a much lower force than he expected. They're much weaker than he thought they would be. His thought was that he made his joints so tight that there wasn't much room for the glue to hold the joint together. So how many samples should we collect? We would need 50 joints, which is 25 joints per group. What are we testing specifically? We are testing if either a box joint or a dovetail joint that is not glue starved is at least 200 pounds per square inch stronger than the glue starved alternative. I will hand wave the details for how this calculation works. Essentially, I run the same permutation test as I did above, but I put the permutation inside of a loop which tests different sample sizes from 5 to 65. Then I see what the chances are of making a false positive error and a false negative error with different sample sizes. I then pick the smallest sample size that will give me at least a 95% chance of avoiding false positives. My power is set to 80%, which means there is an 80% chance of accurately detecting a true effect or difference. I hope this video was helpful in answering some questions about some general basic statistics and how to use computers as a replacement for formulas and memorization to do your statistical analysis. Thank you very much. Thanks for watching.