 Let's get started with looking for shaky insight. We're now in the business of defining potential problems for making inferences and aiming to be able to spot them. To get a better overview, let's take a systematic approach. Where could we look for shaky insight and weak spots? Here are three parts of the research cycle where we might find them. Concerning the questions we ask, the way we run experiments and the inferences we then make. Let's get started with looking at the kinds of questions we ask. Here's a problem for our approach of studying things by empirics. How do you deal with things that you can't see, that are unobservable, like an attitude, a personality characteristic, pain, love, etc.? This is a problem that largely the social sciences face. Many of the things we want to study in political science, sociology or psychology are not readily observable. The fix, as we said before, are upper-vational definitions. Love is what the answer to the question, do you love this person, is. But it's also obvious that this approach comes with a problem. You define the thing that you want to measure by how you measure it. That has pretty circular reasoning. If your measure is wrong, you measure something else without even knowing it. So you still don't know what the thing actually is. You're restricted to walking in socks because you don't know how to make a shoe. You might be wrong, but you don't know. This means we've now found the first area where we might find weak spots. Let's look on to the way we run experiments. Here's another problem. When you run your experiment, even if you collect data from many, many participants, it will be unlikely that you test everyone who belongs to the group of people you want to learn something about. For instance, let's say you want to understand how the flu spreads. In your experiment, you can test adult women and children of all sexes. However, your sample does not include adult males, because you just don't have access to the sample. Nevertheless, you want to be able to draw conclusions for how this flu spreads that also include adult males. Now, is it okay for you to draw conclusions from your specific sample, that this children and adult women, to the whole population, that is children and both adult men and women? We need to be careful with drawing such conclusions. It might be that for some reason, the flu spreads differently in adult males. If that is the case, the result of our experiments run only with children and adult women will not hold for adult males. Let's consider another version of this problem. In medical research, before a new drug or medical procedure can be tested for effectiveness and health benefits with human participants, it needs to be tested with animals. However, just because a certain drug works in lab animals, let's say in mice, that doesn't mean it is safe to assume it will also work in humans. That's why, before releasing new drugs for mass use, they are also tested among humans. This is another example, illustrating why we need to be careful about drawing inferences from an experiment with a specific subsample to the population in question. Relatedly, think about the fact that in experiments, there is some kind of intervention we're interested in studying. What if we had chosen a slightly different intervention? Would our results still be the same? Let's say our experiment tested the impact of online therapy for treating depression. In our experiment, participants were either in a control group where they received no therapy or in an experimental group where they used an online platform to text with a licensed therapist. Let's further assume we find that the participants in the experimental group had less symptoms of depression after doing this for a month compared to the control group. But can we now conclude that all forms of online therapy work? Would texting with a robot also work? Would video chats work? We don't know. Also, can we conclude that online therapy works for treating all kinds of mental health conditions? Would it work for treating schizophrenia? Would it work for treating autism? We don't know. In other words, we need to make sure not to overgeneralize or overstate what we can truly learn from the experiment that we have run. So now we found a few more weak spots regarding specific samples and specific manipulations. Let's look at which problems lurk in drawing conclusions from our experiment. Now let's take a look at the next issue. We call this underdetermination of theory by evidence. If we observe that one of our predictions doesn't work out, we don't exactly know why. It's hard to pinpoint which part of our theory was wrong. So a number of different ways forward present themselves. What's more, you probably remember the hammer and nail metaphor, right? When you have a hammer, everything looks like a nail. In the context of theories, this metaphor can also apply. You might find several theories that are consistent with certain things you observe in the world. For instance, I can explain why men hold the door open for women from a perspective of signalling prosociality and nothing else, or from a perspective of benevolent sexism. Without knowing more about the situation, it is unclear which theory is true. If your belief system does not address gender roles, you'll choose the belief that prosociality explains the situation. So you choose the belief, the theory that fits into your belief system. In science, this also means that we may currently rely on theories that fit with the most common belief system, but are not necessarily true. For instance, the way we currently think about the functioning of human cognition on memory and reasoning is often compared to how a computer works. Computers are a good metaphor to help us formulate theories about the brain. That doesn't mean these theories are necessarily correct. The brain might function in a completely different way that we simply cannot imagine at the moment. Let's consider another reason why we might not want to trust the results of our experiment. Have you ever taken a medical test? Well, then you probably know that there is always a chance that the test is wrong. Even the best tests we have sometimes show results that are not true. For instance, a pregnancy test, although generally very accurate, may show that a certain person is pregnant, although she really is not. We call this type of error a false positive. Or it might be the other way around. The test may show that a person is not pregnant, although she really is. This type of error is what we call a false negative. This example shows us that whenever we test something, there is still a chance we might be wrong, even if we try our best to be right. It is possible that the effect we find in our experiment is not true. It is possible that our experiment differs from what is truly there. That sounds pretty dramatic, right? So not even experiments can help us uncover the truth? Well, let's not give up that easily. There are two things we can do to deal with this issue. First, we can use inferential statistics to assess how probable it is for us to find the effect we observe in our experiment if in reality no such effect exists. In the example of the pregnancy test, inferential statistics would help us understand what the probability of finding hormone levels that make the test show that the person is pregnant is given that the person taking the test is actually not pregnant. We would only find very low probabilities of making such errors acceptable. By the way, we can also calculate the probability that our experiment misses out on an effect that really exists. In the example of the pregnancy test, this would mean calculating the probability of the test showing the result not pregnant, given that the person taking the test is actually really pregnant. The other thing we can do to combat the problem of potentially detecting an effect that does not exist or the other way around is to rerun our experiment. In the pregnancy test example, this would mean taking another test. If both show the same result, this would increase our confidence in the veracity of the conclusion we draw. We would be more confident that our test shows indeed the truth. Relying on a single experiment to draw conclusions may be risky. Considering the results of multiple studies, however, increases our confidence in the conclusions we draw from this body of research. In other words, you don't really want to have to rely on a single experiment, because there's a chance it's just a fluke determined entirely by chance. In many situations, we really want to know more. We want to know if we are onto something. If we ran our experiment again, would our result be the same? That is what we would want, right? Our characteristic is called replicability. We're not the only ones who think it's important that our experiment is replicable. The philosophers, Hamphill and Oppenheimer, postulated that it is important for our postulates to have low-like characteristics derived from their long-run frequency of corroboration. Now, of course, something can always go wrong, so we accept that sometimes our effect will not show up in an experiment. That's just how life goes. One-time failure to replicate does not mean that we can be sure our initial finding was a fluke. But how do we decide about whether we think our effect is due to chance or due to a law-like effect in such cases? The philosopher Mille put forward two criteria that can help us with this. First, we should try to find out if there is clear other evidence for our effect. This is what Mille called the money-in-the-bank criterion. The second criterion is the damn strange coincidence criterion. We should try to find out if the evidence for the effect is so convincing that it would be pretty weird to find this evidence was indeed not true. In other words, Mille would recommend that we are critical in assessing a broad evidence base for the effects we try to test in our experiment. Wanting one experiment is not enough. This means that we have now identified some other potential weak spots about experiments. Interpretations that don't take into account the theoretical context that overgeneralize wouldn't value individual studies too much. To make it easier for you to spot shaky insights, here's a checklist for you. It contains all the 10 issues we have spoken about in this part of the course. Let me show you how to use the checklist in an example. There is a CNN news article reporting about a study on how humans can survive on Mars. Part of the trouble with spending time and space is that the lack of gravity leads to muscle deterioration, which has negative health consequences for the astronauts. The article claims in the headline that antioxidants in red wine can combat these adverse health consequences. That would be really cool, right? So, let's find out more about the experiment. This example is really cool because you can freely access the publication of this experiment online. How nice is that? You can pause the video and take a look at the paper yourself, following the link provided. If you look at the publication, you will quickly see that this experiment did not, in fact, take place on Mars. Instead, muscle deterioration was induced by making subjects not use the muscles of their legs. The subjects also did not, in fact, drink red wine. They ingested a sugar water mixed with an antioxidant-rich powder. If you look further, you will also quickly see that the participants in the study were not, in fact, humans. The study used rats. Rats were hung from a mechanism that simulated gravity as it would be on Mars. Therefore, rats did not have to carry their whole weight on their legs and their muscles deteriorated. Rats fed with antioxidants retained more strength in their legs than those who were not. Now, this study is really, really cool. Finding a way to simulate Mars on Earth, that is a really cool accomplishment. But the conclusions we saw drawn in the news article reporting about this study are overstated. Let's go back to that article. The article shows a conclusion about humans on Mars, drawing it from rats in the Mars-like situation on Earth. That means two of the points on our checklist come to mind. It is not clear if it is permissible to draw conclusions for humans from a study with rats. And it is not clear if it is permissible to draw conclusions about life on Mars from Mars-like simulations on Earth. So far, we haven't even checked the other criteria on our checklist. Already, though, we would classify the reporting in this news article as shaky inside. We could now also look at the original experiment in more detail to see if our checklist helps us spot other reasons for why we should be careful whether we want to rely on the conclusions drawn. Now it's your turn. Pause the video, read the linked article on the potential health benefits of matcha tea, and cross-check it against our criteria for spotting shaky inside. Reading this article, first, it sounds like you should pour yourself a nice matcha tea in the morning if you suffer from anxiety. However, the article also tells you at least three things that should make you wonder whether you can indeed draw this conclusion. We have to worry whether it is permissible to generalize from the subjects in the study, mice, to humans. And we do worry whether it is permissible to generalize from the procedure used in the study to other circumstances. Mice, who drank matcha, spent less time in a safe zone in a maze than those who did not drink matcha. Does this mean that drinking matcha for breakfast will get humans out of the house more? We can't be sure. Finally, the article suggests that matcha reduces anxiety. Now, anxiety is something per se unobservable, so we need to look at the professionalization of this. You can't ask mice if they're experiencing anxiety, so the way to measure anxiety here was by how much time the mice spend in a safe zone. The logic is that mice who are scared would spend more time in the safe zone. However, we don't know if this actually measures anxiety. Maybe the mice who didn't leave the safe zone as much felt lazier and therefore didn't explore their surroundings. We don't know just from this article. All in all, this suggests that we are facing some shaky insights here. In this part of the course, we've seen that there are some things to keep in mind when interpreting the results of our experiments. We have spoken about the challenges of measuring things that are inherently hard to measure, evidence that may be consistent with several theories, and about generalizing our findings. Finally, we've spoken about false positives and about replications. In the next part of the course, we will talk about some other aspects of shaky insights. We will talk about practices that can make the conclusions you draw from your experiments questionable or even invalid.