 If the actual contents are off this talk, the talk is called the Dirty Empiric Science. It's unfortunately not a perfect method but still the best method that we've found so far and if I say there still be on the fridge then you can believe me but if you actually look into the fridge then that's science because with science you basically test hypotheses and see if these hypotheses are actually true. Okay, sounds good but where's the problem but psychology can cause some problems with that approach because there are a lot of studies that show that there is a certain effect but if people try to replicate these studies they get a different result so we actually can't try to trust the studies that are being released, however in this talk we will hopefully learn how you can figure out if you can actually trust a certain study or not and please give a warm round of applause for our speaker. Good morning everyone. Welcome to this talk. I hoped I would first talk about what I can't do which is drawing and you know that's a joke but I would rather talk about my motivation why I want to give this talk and I hope you can read what's written on the slides and now I will read what is written in the bubble. Wow, so many people here. I'm dizzy and if I fall over I'm sure there will be no one who will help me and then another person says why? Why do you think that? Do you know the bystander effect? If there are a lot of people then no one helps. I've read this in a study that proved this. This can't be. I was at the congress and someone already helped me. Well that can't be true. The study of Borgs and Pentra has found out that people in groups do help as much as people who are not in groups. Well as the saying goes don't trust any statistics that you haven't forged yourself and we want to talk about how empirical studies actually work and how what constitutes studies and what makes a study a good study and how you deal with different results and in the end we want to talk about what kind of problems there are in science in general. First of all about me. I'm Anna Klingow. I'm a psychologist. I work at the University of Kassel in the section of human machine interactions or systems techniques. So we do a lot with technology and the interface with humans and to keep this entertaining I'm showing you a study where we can basically go through everything by example and this is about facial feedback and the hypothesis is that it's not only our faces and our facial gestures that influence our emotions but it's also the other way around and here we want to state that smiling increases the positive attitude and if you don't smile then this decreases the positive attitude and when the study was released there was first one approach that said this is all based on cognition this is on self-awareness when I'm looking at myself when I smile then I conclude that I'm happy and this makes me more happy and there's another approach that is psychological and they say that the muscles that make you smile are connected with the emotions and now the question is how do you figure out which one of these two approaches or explanations exactly the right one and there's a study that investigated this where they try to figure out which one of these two hypotheses or which of these two approaches or explanations is actually true and as I said previously trust is something different than understanding if I say yes you can trust empirical studies but then we don't win anything but if we actually understand the studies then we have a better estimation and better understanding of if we should trust these studies and to clarify I only want to focus on quantitative research here and that means testing a hypothesis that we already have but there's also other approaches like qualitative research where people first develop hypotheses so what do we need to do empirical research first of all we need a hypothesis for instance smiling increases positive emotions and how does the experiment work or how could an experiment work so first of all we have to find some study participants and we have to manipulate them that sounds bad but it's not bad it's just a regular term that we make sure that people do what they need to do for the study and in this case it's smiling or not smiling and then we have to measure the effect of the manipulation and don't worry there are no formulas in the slides and in the end of course you have to use statistics to compare study populations and in the ideal world we would of course test all people so we would need to take everyone on this planet to make our test of course this is not realistic and that's why we need to have a huge sample from the population that should ideally be representative of the entire human population and the sample should really be representative of everyone we're interested in so you should have young people old people man and women and this is all not always realistic or not always feasible to achieve and that's why you basically have to make assess yourself if it's still reasonable reasonable to perform the study if you don't have ideal study populations and I have to make some assumptions if this study would still work with an unideal study population and it's also important to consider that we're doing a random sample so if I have 100,000 people and I only pick 10 then there is a certain influence by this random pick and depends on where within the population or with where within the distribution I sample people for my study I will get different results and in this study we had 92 students that participated and here we wanted to investigate the effect if smiling is really causal for the effect and it's really important that everything else is the same because if there's anything else that is different between the groups then these differences could be causal for whatever we observe and and it's also important that you don't directly instruct people to smile because this could influence people to think about their actions but we would rather have them do it implicitly or intuitively so that we don't have so that study participants don't fully understand why they're doing certain things and and of course how do you trick people and one way is to say okay let people just have a pencil in their mouth and so for those in the audience if you want to participate you can take a pencil between your teeth and then you can try to move the corners of your mouth and if you take the pencil through your lips then you will notice that your muscles will be affected and you can't smile anymore and this is a way to get them to to get the people to smile or not smile without telling them and the good thing is that with this manipulation we can sort of find out which of the two theories works because because cognitive theory can't be can't be reused here because when you think of why am I smiling right now your idea would be not because I'm happy but because I have this pen in my mouth so if people still are more happy if they have this smiling muscle construct construction then it's definitely the muscles that are related to their to their mood so now we have to measure so one thing that's very important here in psychology that is always problematic because most of the things we want to measure you can't measure them directly you can't measure emotions different to chemistry or something like that you can't actually measure it directly usually by asking them or by questioning them and to make it comparable between two different people there's a scale how happy are you on a scale from one to ten for example and because our hypothesis is that emotions are hate heightened we need to make them happy before so they were able so in this study they watched a funny cartoon of course we tested the study tested the cartoon whether it's actually funny and one term that I that this is often needed is operationalization that is I make things measurable I make it so that I can measure things that are not measurable directly and using this questionnaire is one way of make one way of operationalization and important is that measuring is always inexact every sort of measurement that you can do even the way you think it should be exact with measuring like the length of your table you still have a little bit of a difference so there is always something inexact about them and this is this is even more pressing when I can't measure what I actually try to measure but instead I have to operationalize it so this makes it more more random mistakes now let me summarize the mistakes so that everyone understands so trial subjects were invited they should hold a pen either in their lips between their lips or between their teeth or in their hands the hand is basically the control group which shouldn't have it we shouldn't have an influence at all then they should watch a cartoon and all of them watch the same cartoon and then they should sort of guess what is the better or like what is my mood right now and we had to tell them a cover story and we told them well it's for disabled people how does writing with your mouth make or like influence your emotions now methods comparison and this is important to me now we don't have you don't have to actually be able to calculate the statistics here but you have to be able to understand how such a statistics works and what we want to do is compare the two groups in particular we want to compare how funny did they find the cartoon and how happy did it make them and because we had more than one person we use the average of each group now can I compare that this directly can I say well those that held it between the teeth had a higher average this is how I I basically confirmed my hypothesis but this is too easy I can't do that why because there's these random errors first of all those that come through measuring and second of all those that come through sampling because I haven't asked everyone but I've only asked a sample and then there's I have two different random influences and now I have two values and I want to find out are they different in a real way is it just random that they're different or is it real so this was the original table from the stubby and mean funniness where the averages that were compared now we have to basically choose a statistical method and there's a lot of different statistical methods that are useful for some uses the most well-known is the t-test which is basically a way to compare to two medians or to two medians yes but that doesn't work because we have more than two groups so this is why we do did a variance analysis which allows us to compare more than two groups and this the the way of thing that we use was planned contrasts so I can choose what kind of what kind of pattern do I expect and then I can choose whether it works or doesn't so I'm hoping to be able to explain it to you now a little bit the basic question is do the values that I have differ from each other in a real way or just randomly so how probable is it that I'll get this pattern by chance now I made this graph and you can see the three groups they seem to be different now the question is it random or not now what I can do is I can calculate the variance the variance is the difference between each of the each of the values to their mean to the group mean and then you basically use a formula it doesn't matter how the formula looks but then and then you get a value and this is what we call within group variance and what we do next is we do this for each group or each cluster and then we compare the middle of the mean values of the different groups and then we can take in another variance to calculate the difference between the different groups and these two variances I can put in a relation to each other and and see if the variance within one group is really huge so that there's a great diversity within one group and I can also say that can state whether the differences between different groups is huge or small so is there a huge difference between the teeth and the lips groups and if I look at the relationships between other relations between these groups I can determine by using a table or by using a statistical procedure if the observed result is by chance or is actually a meaningful result from a statistical standpoint and the most crucial thing is that you basically compare how how random last how random does my result look so for the plant contrast if I had so if I see okay there's an elevated level for the teeth and lower level for the lips that's what I would have way that's what I would expect and then I can compare my expectations with what is actually observed so you want to see how likely is it that something is just by chance and of course 100% certainty is an unrealistic objective or aim for what we're doing here but usually we say okay if something is only about 5% likely or 1% likely to have occurred by chance then we say okay it's really unlikely that this just happens by chance there might be some effect and we call this probability for saying this is just by chance we call this P so this is the P value and we see that this value is so unlikely like less than 5% then we say okay it is significant and we see in this specific study we have a P value of 0.03 so it's 3% and it's really unlikely that this happened just by chance and therefore we can be more confident in saying that there is probably a statistical connection so what can we do with the results so we see okay the test is significant we see this is most likely not random and we see that the difference that we observe is most likely caused by the manipulation that we have done and because we did not so because we didn't do any cognitive processing it's more likely that the physiological theory is more likely so the body and not the mind theory is more likely but this is still not a proof in the sense of a mathematical proof it's just a support for the hypothesis because this proof can still be disproven or this result can still be disproven by other more elaborate proofs or statistical experiments so now of course we want to talk about how can you trust statistical how can you trust empirical studies so how does empirical study work in principle and what makes a good clinical and what makes a good study okay so what do we have to keep in mind we have to have standardized tests so that it's all established and it has been tested by other people and what I really appreciate is when I get an explanation why people use certain procedures to test so when they explain why they chose specific tests and I can really understand why certain decisions have been made and um and so something else that is also very important is or very helpful is when there's a manipulation check to make sure that people are actually smiling with the pencil in the mouth if the methods that were applied actually achieved what was intended to be achieved and it's also helpful to compare this to other studies or compare methods of other studies and see what what kind of approaches they use and something that's also very important is that you have a good samples so usually as a rule of thumb you want to have like 30 people from a group which is of course not always possible but this is something you should strive for and a common and a reoccurring problem is that most studies just use students as a sample because they are more available for those studies this can be a problem when you and this can be a problem when you want to measure things like political opinions then students have a different political background than the general population this is something that you really have to be aware of and if you but if you look into something like smiling then this is most likely not a problem and it's also important to see if people have if the experimenters have tested if the students have or the participants have understood what the study was about because you wanted to trick them into believing something else and of course you want to have proper statistics you need to have the right statistical procedures for instance if you do a lot of small tests then this is not good because you have a greater number of false positives because you always say like five percent for each test but all these five percent add up and this is why the probability for wrongfully saying that something is significant when it is not increases and unfortunately it's quite difficult to understand and if the statistical procedure is the right one for a particular experiment or for a particular study and usually what you can do still is look at what kind of things are being reported what kind of metrics are being used and correlations and causality I guess most of you know this so you have to look at what kind of conclusions are being drawn so a correlation is just a relationship between two things but a causality is a directional relationship where you say a modify so influences b and so if you for instance see that stalks and babies cure cure then you might conclude that there is a relationship but if you actually do the experiments you will see okay this is nonsense of course and it's really really important that you look at what kind of conclusions are actually allowed to be drawn and if there's like a temporal relationship that is exactly the other way around then of course b can't change a okay one hint something that you want to look at is to figure out whether study is good or not most studies are actually published in journals and journals have peer reviews and peer review means other researchers are actually kind of reviewing the research the study that was submitted to this journal but of course this is not uncontroversial because sometime there are some researchers that kind of give better or worse results to specific people but in general if it's a journal that does not have peer review that's a bad sign similar impact factor it's basically the impact factor belongs to a journal and it basically says how often is have studies from this journal been cited this can be seen as a criterion of quality but you really can't but it's really important not to overstate the importance the idea is the more well-known a journal is the more choices it gets because more people will submit their papers and then it will choose better better studies but again that doesn't mean all studies with a high in journals with a high impact factor are great but it's still a good hint for you so if you found a study that you found very interesting or something you can see an impact factor so everything with an impact factor of one you can cite so i tried showing you what is a good study but now i want to ask what do i make when i have different results what do i do especially if the results are completely opposite to each other now i'm going to go back to the bystander effect example and this was in the comic and it said as the more people are watching an accident the less people actually help the by the basis is a case of kitty ginovies who was murdered and there were about 40 neighbors that heard her crying out and that basically knew what was happening and none of them actually did anything about it so then john m darley he thought well i really want to figure this out so he did a couple of experiments he basically simulated an accident and then varied how many people were watching and then he measured how long would it take them to help and this effect was actually found quite often and it's been discussed a lot because it's pretty scary for everyone but there's a lot of different different results so some studies found this and some studies didn't and what are we doing so first let's look at it more exactly sometimes when studies look like they're opposite there's just differences in the design that make the that explain the results the the differences sometimes only seem opposite for example sometimes the samples are different if you use children versus example versus adults it's possible that there isn't one study that's wrong but instead kids and adults are just different or you can look at the operationalization so how did it look were there more trial people or like people that were not in on the experiment or was only one person not in on the experiment and everyone else was in this can be a difference so what can we actually do we can do a meter analysis or a moderator analysis this these are analysis that basically try to find out does it make a difference if it's children or adults about meter analysis now these are basically summaries of more than one study in a journal article now usually the author of this meter analysis reads a lot a lot of studies and does and tries to read all of the studies to a specific topic in between a time frame and then he or she will put them in an article in a journal paper and then he will also describe the differences and try to calculate the power of effects so these effect power of effects is basically trying to say how how much are these are these two samples actually different um so these power of effects are basically a measure of how large an effect is and they're standardized so they can be compared between different studies if i look at the mean then one person might have used a different scale so i might not be able to compare them but i can always compare the effect size one example of that is cohen's d and there's kind of a rule of thumb so a small thumb would be 0.2 a medium effect would be like 0.5 and starting at 0.8 you would we would basically talk about a large effect size so and if you look at the effect size then you see or you better understand how impactful the influence actually is or how you could better understand what kind of question was asked and i found a meta analysis of this bystander effect and you see that this is actually quite a lot um offers because these literature reviews are very uh take take up a lot of work and you you're encouraged to look into this study because it's actually very interesting because they also talk about um contradictions and uh make it more easier to understand what the results actually mean then there are also reviews which are also papers that also try to summarize the current knowledge or the current theories and hypothesis in a research area and usually it's from researchers that work in that field and they also try to talk about contradictions based on based on the knowledge that the authors have and and um people try to also explain why there are contradictions in different studies in these reviews and that and these are the pieces of advice that i give you to basically better judge scientific results and scientific publications and in the end i would like to talk about what kind of problems the science or the scientists have and i don't want to show you just the positive side of science because there are more problems in science and then shown here than these three problems that i go i'm going to talk about but um these are very important so first of all i want to talk about p-hacking replication the replication crisis and the publication bias so what is p-hacking it means that you manipulate your data to obtain the desired result the the aim is that you decrease the p-value below five percent to obtain significant results and so that you obtain a significant result and this way you still you still basically trick the test to still say okay what you found out is statistically significant and the bad thing is this is really hard to detect for other people and only very few authors publish their raw data and even then if you have the raw data it's not always possible to replicate all the work that people have done to obtain the results and there are different approaches to deal with this one is replication studies for instance i look at one study and i try to replicate it and see if i get the same results which of course is not always uh and this way you can basically um reveal p-value hacking but even if you find something different it might not be intentional that the offer has done anything bad it might have been that there's a mistake and you have to contact the original offer and talk about why you get different results and this is just an approach this does not so these replication studies it's they're just an approach they do not solve the problem itself um and then there's also this replication crisis which is kind of strange because i just talked about replication as a method to basically combat p-value hacking um but there was an actual crisis in the field of psychology that a lot of scientific studies in the field of psychology and some of them were very important while recognized could not be reproduced by others and what are the approaches to deal with this and the only real approach real strategy is to have a better plan for the study to reduce um the the number of potential bad influences and do a thorough planning in advance and sometimes it also makes sense to change the significance threshold so i talked about having a threshold of five percent of one percent but sometimes it makes sense to reduce it even lower to make it less likely to get a false positive but even this problem is still not resolved or solved the last problem i want to talk about is publication bias and the problem is that um significance results tend to be more published than non-significant results so if you don't find anything new or anything significant it's unlikely that papers will be able to publish in journals and if you basically have the spies of most of the publications that only show positive results then you can get a different impression on what is happening in the field if you knew more about um okay so another approach is meta-analysis so you do research on existing studies and try to do corrections on the calculations to account for different um to account for different factors including potentially non-published studies that had no results and no significant results and when i create a meta-analysis i also call different authors and different researchers who who might have studies on the topic but that were not published and the more i do this and the more responses i get the better i can assess and estimate how many non-reported effects there are how many non-significant results have been observed that did not make it into a publication and still this problem is not resolved um now there are uh of course i would like so i hope that i have answered the questions further so the last as the last couple of questions answered but still the one big question remains can we trust clinical studies and the conclusion is that yes somewhat you can do that but it depends so different studies have different levels of quality and uh it's very important that you assess the study quality yourself and that does not mean that everyone should be a scientist but it's worthwhile to um um to to not only use the summary from popular popular science articles if you want to use this in a conversation but it's also reasonable to really go back to the source and try to look at the original publications and try to get an idea of your own by going to the original sources and it's really really important that you understand probabilities and that's why i included this method part in the beginning so because many people say okay a study found out that something is like this so it must be true but this is a common misconception because in the end this is all probabilities and this is all about being certain about some things so and the most study basically no study is like 100 percent certain and then of course there are meta-analysis um that can basically give you a better overview over what has been published which also applies to the reviews and of course we still have these open problems in science and scientific work and this is something that you should not disregard or ignore and well the conclusion is really like said in the beginning science as it is right now is what we it's the best what we have right now but it's not perfect and it's important to stay critical and right now it is so studies are the best thing we have and but we should still be critical small comment maybe so it's the end even also looking at locked up science all the studies that I cited here are freely available on the open on the internet and you can actually watch them if you read them if you want okay thank you very much now we have time for questions please go to the microphones and I will try to coordinate the questions but first of all let's ask the internet where is the internet nothing from the internet now let's start here hi so regarding the error probability now you said sometimes one percent sometimes five percent the study had three percent my question is where does this value come from are you doing the study and someone at the end looks at it as a P value of 0.5 or 0.4 so we're not significant are you actually doing P hacking so where do I need to go to um where does the P come from okay so I didn't explain this exactly enough it's important that before doing the study on which significance level do you want to work do I want to achieve five percent do I want to achieve one percent and then this is my aim and then I use the statistical methods and I get a value and this might be like three percent and if I say I want to be below five percent and I observe three percent then yeah everything is fine I have a significant result and if I had said one percent and I observe three percent then well unfortunately it's not significant and that's exactly what I'd say if you say if you said higher aims and that means lower significant thresholds then you can be more confident this doesn't really answer my question because where does the P value come from independent statistics or exactly exactly this is what you calculate you compare the one variance with the other and you get one value and you see and you look up in a table if that value is significant or not unfortunately this is too complicated to explain in detail how the calculations are being done and I recommend that you look into a statistical handbook to see how these calculations are being done okay so we'll continue with mic four is it conceivable that the different and parts of a study and it's being distributed across several people so that you get a meta study within one study and so that some people just do one part of the work and other people do other parts of the work that's possible and that that's happening sometimes so most of the time it's not the researcher himself that are actually that is actually doing the study with the trial persons instead of it's the assistants and also it's and also you can give it to different people to do the statistical analysis you could probably do that and I've seen it often that you ask someone to check your results and I'm guessing your idea would be possible but it might be a lot of work thank you for this really nice talk and I think another problem is that people who are outside of science are not in the field they they don't get the results apparently from the studies but they read newspaper articles or some other things that are from journalists and even the journalists don't have it from the original source they sometimes have it from the marketing department of the university and there's sometimes really weird reporting results that really warp the original results and yeah that was more or less sort of my approach to encourage you to encourage the people to look at the original studies and to look at them themselves even though many of them are kind of off-putting to many people are there possibilities to look at details of the study planning and to basically publish them before you do your study so that you can't basically a commitment by researchers because we had set p at all point one but we actually got all point three yes it's possible and it's actually gaining more adoptions so that a lot of papers actually say they actually a lot of journals say okay before doing the study you actually have to send in an abstract and then you have to have to make sure in advance that you commit to a specific plan and to a specific threshold and I think this is a very reasonable thing but it's not done enough at the moment so I want to talk about the significance is not really something that shows quality but instead it's just a manifestation yes that is correct so if if the results are significant it doesn't tell you if the study is good or bad or qualitative of high quality and some but there are also studies that are good because they found it found so they are good despite finding no significant results thank you thank you for this nice talk I have a question regarding publication bias in many sciences there's the problem that you need to publish a lot in order to get a position to get a job so how does psychology deal with that when you already have a problem with publication bias yeah how do people deal with it well you still need to publish to have a good scientific career and I don't know of any good approach to deal with this it's really unfortunately that scientific studies that find something significant are more likely to be published if I knew a solution to this problem I would be very happy I'm sorry that I can't offer a solution thank you for this nice overview for me the last part was the most interesting my question is how do these three problems all go to work and it seemed like there's there was three problems but actuality all of these three problems basically are interrelated and they're related to how the science works today and yes so of course there is a connection between these three problems p-hacking is p-hacking can be reduced by replication and then of course replications there are problems and of course we can talk for a long time about this and I can also recommend the other talks about replication pay replication bias that are currently running or currently being presented but I didn't understand what I should say to the publication bias so how is p-hacking related to the publication bias so because you have to publish studies you do p-hacking absolutely that already answers the questions and if only studies are being published that have a significant effect then of course people are as scientists are much more incentivized to change the data and to induce p-hacking and to create p-hacking to have publications but the problem is that publication bias to resolve this is really really difficult because we as we as scientists because as a scientist we're also not that interested in reading in journals that we have that are so many things that don't have any significant relationship it's now 12 23 the internet workup and has a question so how do you deal with errors that you can only see after you've published it as an author so in the ideal case you should state that something was wrong and of course do another follow-up study to see see if this incorrect measurement or whatever has actually a negative effect on the results and so far I haven't experienced this myself and I haven't seen this for other people yeah thank you for me again for the lecture I will briefly go into the challenges you said at the end of the day the data is missing and replications can sometimes not be made and replications that are there approaches that journals ask you to publish your data and they require that you conduct the study more than once before publishing so I'm not aware of any requirements to do a certain studies twice but there are some scientists who by themselves want to publish their data and they publish the data in the internet but I'm not aware of any journals that really require that the data has to be published but my impression is really that it's the scientists who want to publish and who want to support science as a whole with publishing the data is it have you seen that people like to basically offer a wrong topic so you're actually not oh for example it's about disabled people but instead it's about you think that in psychology that many people would think that you actually don't lie to them no I don't think that there are anything like I mean so these cover stories during the study so you want to dilute the study participants about what the study is actually about while the study is ongoing but in the end you will tell them what the study is actually about and it's also and also in some studies the researchers actually ask the people if they can guess what was actually tested and if people realize what's actually going on then it might be reasonable to leave these people out because the understanding of the tests can influence the results there's a question could you please say something about funding bias so basically who gets the funding and how it influences research yeah that's a big topic for research unfortunately I cannot say so much simply because because I was only at the university and and I didn't have to deal with specific donors of money because my research so far has been completely funded by the government okay my question is is there a rating system for studies or could we make something about it so that everyone can basically see which scientists of already looked at the study and give grades to the study so that at least they are well then a couple of people that have actually read the study and seen that it was done right yes there is something and you can look at at the citation count if a lot of people have cited the study then hopefully everyone has read the study and you can somehow conclude that the study is probably good but the so if you would establish a rating system then how would you agree on what kind of ratings to give to papers it's like very subjective and so of course we have the peer review and it goes into the general direction but I don't think that a rating system is something that you can objectively set up and this is the last question now a short question regarding meta studies so how many studies should you use for a meta study for example I think there were a couple of hundred studies in one it depends on the field if the research field is relatively new then of course there are not many studies and we cannot use many studies for your meta studies but the bystander effect is something that has been researched a lot then of course I expect a lot of studies to be part of the meta study so I don't think there's a general guideline that you can use to say how many studies there should be I would rather focus on how did they select the studies for the meta study did they call the office what kind of search terms did they use and in these meta analysis they usually state this what kind of strategies they use to find studies to so that I can see for myself or I can assess for myself if they actually use the valid strategy or for a strategy to find everything that I deem relevant okay and now this