 In some cases, data is collected from several sources, a treatment may be applied to different cohorts, or data is collected over several years, or by different research teams. But when we aggregate the data, we lose information. We no longer know which cohort a person came from, or what year the data was collected, or which research team gathered the data. What could possibly go wrong? For example, suppose an intervention is applied to the students in a school district. The aggregated data is shown, should the intervention be applied generally? Since we want to decide whether or not to use the intervention strategy, we'll consider two quantities, the fraction of treated students who improved, versus the fraction of untreated students who improved. Of the 121 students with an intervention, 46 improved, so the fraction of treated students who improved is about 38%. Since the 1073 students without an intervention, 391 improved, so the fraction of untreated students who improved is about 36%, and so it appears that improvement is more likely with an intervention. So we should recommend using the intervention. Or should we? Well, suppose the data was collected from two different schools. The raw data is shown, and this is the data that was aggregated to form the table in the previous slide, and let's look at the schools separately. For school A, of the 38 students with an intervention, 11 improved, so the fraction of treated students who improved is a bit under 29%. Of the 681 students without an intervention, 206 improved, so the fraction of untreated students who improved is a bit over 30%. So it appears improvement is more likely without an intervention, and so we would recommend for the students at this school, no intervention. This may be a little surprising since we recommended an intervention earlier, but ignoring facts and evidence is only something you can do if you're a politician. We should re-evaluate our decision when more information is available. In this case, the additional information that's available is which school we're looking at. And a normal person would stop here, but mathematicians are not normal people. Let's take a look at school B. Of the 83 students with an intervention, 35 improved, so the fraction of treated students who improved is about 42%. Of the 392 students without an intervention, 185 improved, so the fraction of untreated students who improved is about 47%. So it appears improvement is more likely without an intervention, and so we would recommend no intervention. Wait a minute. In both cohorts, the group without an intervention did better, and so in both cases we would recommend against an intervention. But when the data was combined, the group with the intervention did better, and so we would have recommended an intervention. This is an example of what's known as Simpson's Paradox. It's named after, no, no, but not a bad guess, Edward Simpson, who first described it in 1951, though Carl Pearson and others noted it as early as 1899. Now Simpson's Paradox isn't really a paradox. It's a consequence of information loss as we aggregate data. In particular, the aggregated data lost the information about which school the students came from, which means more detailed information is useful. For example, suppose school A's student could be divided into those who play sports and those who don't. The disaggregated data is shown. Of those who play sports, of the 17 with interventions, 5 showed improvements, so the fraction of treated students who improved is about 29%. Of the 473 with no interventions, 123 showed improvement, so the fraction of untreated students who improved is about 26%. So those with interventions did better, and we'd recommend interventions. Of those who didn't play sports, of the 21 with interventions, 6 showed improvements, so the fraction of treated students who improved is about 29%. Of the 208 without interventions, 83 showed improvements, so the fraction of untreated students who improved is almost 40%, so those without interventions did better, and so we'd recommend no interventions. It appears we can get any decision we want by choosing the right statistics. That's how it should be. We should re-evaluate our decisions every time we gather additional data. If we know nothing about the students, the intervention is warranted. But if we know that students are from school A, the intervention is not warranted, unless we know the student plays sports, in which case the intervention is warranted. There is no paradox. Simpson's paradox is an important thing to keep in mind whenever we try to make decisions based on evidence. Suppose you want to decide on whether to undertake a course of action. The more you know, the better your decisions. In this example, we'd want to know what school and which students. Simply knowing the overall rate of success is irrelevant. Conversely, if someone has the raw data but refuses to share it, they don't want you drawing your own conclusions. Remember, the more you know, the more power you have, so be suspicious when someone refuses to share data without a good reason. And the data has some private information is a good reason. The data is proprietary is not.