 in. Okay, so hello everyone. Welcome to this workshop, Visible Learning, Best Practice or Boon Doggle. We're going to talk about challenges in assessing a meta-meta analysis. My name is Andrea Kalmendahl and I'm a PhD student at Linears University in Sweden. My main work is on a meta level in psychology and educational research. So I do this assessment of Visible Learning, but I also conduct a meta analysis and building some future platforms in order to share data within educational research. And with me today is Thomas Nordström. Yeah, work as a senior lecturer at the same university. Sorry. And this project is also together with Richard Carlson who is joining us in the audience here. So let's start. Yes, so today we're going to talk a little bit about what Visible Learning is for you who hasn't been in contact with this previously. We're going to talk some about what a meta-meta analysis is and the focus there. Some pitching of own articles, obviously. So we're going to throw in some current states of educational meta analysis and then wrap it up with this workshop where we are planning to go through the code sheet and all the aspects we are thinking of when we are assessing this huge work and with some conclusions. So and I think that it's easiest if you have any questions that would take them directly so you can just write in chat or yeah say anything or so if you need more explanation in some parts. All right. So in 2009 John Hattie released the meta-meta analysis Visible Learning which by the time summarized 800 meta-analysis into 138 possible influences on student achievement. But Visible Learning grew very quickly and by several updates and in 2021 it contained 322 influences and based over 1800 meta-analysis. And now in 2023 we see that a sequel is also going to be released. So the idea of this list is that the influences were all coded to a standard metric, co-sd and ranked based on the size of the effect size. So it ranges from negative like retention and a little effect could be like student personality according to the list and a strong influence on students achievement which could be like response to intervention. And so the ranking now is used in over 23 countries around the world and the original book from 2009 received over 22 000 citations on Google Scholar. So it's easy to say that the impacts of this work cannot be overstated. It is also safe to say that Hattie's work in this contribute to a more evidence-based focus in educational research and especially with the summaries of all these quantitative analysis and it was although met with a lot of skepticism from start and it was considered very controversial. The criticism was fierce. Some of them went too far and called it pseudo-science due to the lack of scientific method and statistical coherence in the way of the synthesis was conducted. So a meta-meta-analysis is visible learning show similarity to this which is also known as like a second-order meta-analysis or some call it an overview of reviews. It could be an umbrella review or simply a meta-analysis of meta-analysis. And one of the major statistical goals is to determine the variance in mean effect sizes across different meta-analysis of the same relation so that it's due to sampling error and use this information to improve and estimate in each individual meta-analysis. So the effect size coming out could be more accurate according to Smith and Oh and however there is no methodological standards of how the reporting guidelines should be in this. We work a lot with materials from Cochrane that we try to translate as well as Campbell do into educational research and Cochrane states that the overview should not simply be a summarize of systematic reviews but rather integrated and synthesized evidence and one should not really try to rank the interventions. So to produce an accurate review you need to specify your specos very narrowly and in order to include overlapping data the recommendation is to use maybe one of the newest reviews that has most recent published or highest quality or the most outcome data. So it's used to map the available evidence or re-analyze specific data specific subgroups or so and there is no real satisfying method to actually integrate results across different meta-analysis as is done in this work. So the visible learning as presented it looks like this so this is an influence called reducing class size so this is the data that visible learning is presenting online as material and yeah for their analysis. So what they do is here they have gathered eight meta-analysis in the topic of reducing class sizes so we see that they the only thing they actually present is the journal title the authors which country the authors come from the article name which year the variables that are supposed to be extracted from the meta-analysis the number of studies included when calculating this variable the number of students numbers effects and also the effect size and what one can see straight from this is that it's missing data points from start we see some of the studies here actually stated there is no students included and and that could be a type error of course but when the full confidence of the synthesis is based on for example numbers of students and you don't report the actual numbers of students I think the the overall confidence in the sense it doesn't really make sense so this is how they yeah build up trust in visible learning the more the merrier it's always about the highest number is the best and not about the content per se so it's also worth to mention that the way they are synthesizing is by simply taking the average effect size of all included meta-analysis and just divided by the number so and there is no waiving or anything as that or no confidence intervals or anything so it's a very very loosely way to present data so yeah Thomas would you take over here some yeah with that in mind we hear in this group in our lab we have also scoured the area for systematic reviews with meta-analysis trying to say something about the current state of educational meta-analysis that is not part of this specific workshop but it's quite nice to say something about the work we have done so even though there are different standards it alright there are different level of quality in the meta-analysis that Andrea just demonstrated so we were interested in whether there are any systematic reviews with meta-analysis that have high quality according to for example Campbell organization the definition of methods provided by them or the clearing houses that work on similar topics because for example Campbell has quite a lot of rigor in the review process that needs to be followed and also transparency in reporting and sharing of data and so on so we did a small study over the last couple of years where we only searched for meta-analysis that evaluated effects from methods intervention and instructions it came up with 88 papers that met those inclusion criterias and included that we also hand searched for major review journals so we had quite a good sample to to assess out of these only 11% were assessed as having low risk of bias which is a bit remarkable in its sense yes we found meta-analysis of high quality but most systematic reviews with meta-analysis in this field have high risk of bias that is not good of course so we hope for improvement the coming years and there is a link to that pre-print where you can you know read a bit more so yeah we are interested in finding good quality in systematic reviews in education and you can change slide there Andrea so back to this project and the workshop we are talking about now the background for what we are doing this is that it began with that Richard who's also participating here and I went through visible learning and because it's quite a remarkable piece of work it's the largest collection of meta-analysis across any discipline and we found a great number of errors or bad practices and we want to grant from the Swedish Research Council where we should focus on critically scrutinize this list and to produce more rigorous evidence for example writing a best practice paper originally recalculating the effects from Hattis list and produce some new systematic reviews with high quality but due to the severity of flaws that we found in visible learning and they grew as time went went by we needed to abandon the idea of recalculating the effects because we had a naive idea that we should fix Hattis list and we couldn't do that so instead we went on trying to reproduce the list based on the statistics that Hattis himself provides in the book yeah so we started reading up about what has been published and also the criticism that has been published against visible learning and we ended up with wanting to ask the question what the quality of the aggregation and coding of the effects from included meta-analysis for the influences in visible learning with the regard to these quality indicators and this indicators are stuff that has been built up as a general criticism towards this list and for example to what extent the PICO the participant intervention comparison and outcomes in the list in the included meta-analysis are relevant to the actual definition that visible learning state and we have seen some there are previous example we will also see an example of this where it doesn't really match and also to what extent randomized control trials quasi-experimental studies and observational studies has been mixed which is also mistake to do when it comes to comparing effect size and that comes with the same idea of having effect size from different designs within between and correlational and so the main focus would be like can we actually reproduce the reported statistics in visible learning with the information given and to what extent the meta-analysis has a wide publication year range and if that has been mixed so the first steps was of course to organize all the influences with the associated meta-analysis so we have taken the latest update 1.11 and that included around 1900 meta-analysis with 322 influences so we searched through various databases and collected all the papers that we could find there are still some missings due to having included a lot of unpublished doctoral thesis or conference papers etc even monsters paper yeah even monsters papers and there's also yeah we will come further into depth what's going on here in a bit and we also want to combine this type of assessment that we're doing now with a relevance rating that we are working together with an expert panel that are working in educational research all over the world so we have a panel that has assessed each influence by relevance if it's implemented in their educational system if it's wise to do that etc so we want to have more information about these influences in general than just an effect size because comparing this as it is it's very misleading so we can start first off with the first question that we're asking so when we start encoding what we do first we base this on both Simpsons and Bergen's critical approach to this list and they said that they have found examples of analysis that did not share PICOS and are focused on different outcomes designs and coding and as well as analysis so what we wanted to do is to see to what extent does this match on the whole list previous criticisms have been also charged for saying that it's cherry picking of course you can find a bad study in our work that has included 2000 meta analysis that's not a problem but to what extent can we see that in general so the first thing we do is we read up upon the definition that are given by visible learning so in this case we're going to take you through our coding and regarding reducing class size and the definition would be that it reduces the number of students in the class often with the aim of increasing the number of individualized student-teacher interactions to improve student learning so the population here should obviously be students and what we found was that all all papers included the broad term students however it was a broad range of students everything from kindergarten to college and that's per se can be a hard crowd to work with I mean there is not unusual to have bigger secondary school classes in comparison to kindergarten when it comes to learning but as long as they have defined their PICO as students that's that's fine and we actually found that all of this had that and regarding where this was an intervention or exposure we both thought that since the terminology reduces number of students as an active choice we saw that this should actually be implemented as an intervention and however there were some meta-analysis we're not actually conducting an intervention but looked merely at exposed being exposed students being exposed to a smaller class size and then how that's correlated to progression or to student achievement in any any type so in this sense there is a kind of decisive no in relevant inclusion of interventions because of not being an intervention obviously we also looked at the control and comparison groups in all the meta-analysis and all the interventions had comparison groups that's a good good statement however since there were some correlational studies or regressions those didn't have any comparisons and even when we read up on the actual comparison groups from the interventions we could see that comparison groups varied from defining a large group as 20 in some papers but other studies defined a small class as 20 and below so here we end up with a comparison group across meta-analysis that actually are as the same size so that's that's not it's hard to see any changes there obviously because there are it's nothing happening and the idea of merging those two effect sizes together doesn't make any sense and also those who didn't have any it's hard to to to see that we also saw when we looked at the outcomes some outcomes were only related to year progression as a broad term but others was more focused on a test or it could be like reading words or etc so the those interventions had a specific test tied to the intervention while broader broader papers had an end goal of progression or or or any type of general achievement and that's not a problem per se you can absolutely measure student achievement by year progression however that outcome is not comparable to more points on a word count test that's the the idea of comparing those two outcomes together makes it hard to interpret the result should we just say that Hattie is you know aggregate everything into a single effect size regardless how many outcomes there are in single papers and so on and then all the things that we have mentioned now and Ray is always aggregated into one unit yeah one single number and as we looked earlier on the competence level the more types of outcomes they have the higher the confidence so the idea is there's a very very broad list of outcomes that are matched together into one single effect size which is I mean it makes it almost impossible to interpret what does that mean for me if I want to work with reading improving my students reading and then all the outcomes come from math tests it's hard to interpret the effect size so yeah so far this is what we see here if we go further we also want to code what types of study designs that are included and what types of effect size starts are included and this analysis actually ticked all boxes and that's not a good sign I would say in general terms in order to settle the causality between the variables it's a good idea to not mix observational and experimental study designs correlation is not always causal causalization that's a fact so the idea of mixing without having anything of that in mind might be yeah not so good you have to it's very hard to not be too critical when working with this but when too much stacks up you started to wonder general research ethics and so the effect size also we could see that it was mixed between between effect sizes and within and also correlational effect sizes but and the problem here is since coins d is expressed in standard deviation the different calculations can lead to a wide range of effects a study might have for example a large between participant standard deviation but a small within participant standard deviation and that will alter the effects greatly and it also answers two different questions whether you compare two groups with each other or a single student with a pre and post test and yes some areas usually work more with the other and some less of course I mean reading for example Thomas you have talked a lot about within effect size there right and here what do you yeah continue yeah so the so the idea is not there's no better or worse effect size type is just mashing them together that makes it not not a good idea different designs answer to these different types of research questions and they can't be matched into a single number basically that's the problem here yeah and so further on in the code sheet we also want to code whether the publication year where when are these types of articles published and what types of articles are there because there has been critics that said that the idea of Cochrane for example stated that you should have the latest meta analysis and that should be the most updated one we also see that some critics against not only using peer reviewed content might be an issue but what we see here for example is that if you if you have multiple meta analysis conducted on a certain topic from different decades it's not unusual to see the same paper coming up in every meta analysis and what we have seen here in also in reducing class size that we saw that there were some original paper included in in same different types of meta analysis so the data is counted several times which would give that study and weighted bias in in how to interpret it and there is nothing that controls for this in visible learning synthesis and in the end the idea of counting same studies twice that's not a good idea and the type of article what that we saw here was that seven of the published literature was in journal articles and one was a report from the European Commission so I don't have too much to say about that but the idea of peer review I think it's a good way and not included unpublished theses or master theses that haven't have any type of peer review in it it's debatable at least yeah yeah for sure yeah but we can say that the the report from the European Union their commission there it wasn't a meta analysis even it was just a report about different meta analysis so there were no effect that could be extracted from that report which was a bit strange since had to claims that there were an effect from from that report yeah that's that's comes to the to the last issue of this type of assessment is that we are trying to reproduce all the figures and all the numbers that comes from the analysis is so what we do first is that we have randomized the order of which article to code first so we have a randomized order so every influences has one sample randomized sample that we start coding with and in reducing class size we ended up with the European export network Leven and Osterbrück so here we see that it was published in 2018 and the item they were extracting was the reducing class size by 10 and they claimed to found 16 papers zero students 16 influences and an effect size of 0.10 so it's very important to still remember that we are not actually assessing the meta analysis is per se we are just seeing if we can reproduce the number had the extracted from these analysis into his synthesis so this is not yeah this has nothing to do with the conducted meta analysis or the synthesis is made that we had we don't assess that per se but since it was not even a meta analysis but merely a report there were no synthesis in it so we actually couldn't find any effect size that matched the one that was extracted and when we do the effect size extraction we have three different choices we have yes it does match the pico and yes it does not smash the pico because sometimes we've seen that we can find the effect size they have extracted but that doesn't really belong to the pico that the influence definition points toward and that could be examples of not being students or not having the same outcome sometimes it's just a general effect size for all the variables included presented and then they take that one and it doesn't really have anything to do with student achievement so but in this case we couldn't find any we couldn't either find the number of studies it was based on or the students since this was zero we do believe that students were involved in this analysis from start but there were no nothing to extract here and the number of effects was also very very hard to extract and so what we see here like the idea of doing this and looking at the total synthesis as we see here is that the average down here it actually adds more studies to the average and it adds population to the population size which also then in in this case increases the confidence of the synthesis made by visible learning but if you can't get the numbers right it's it's very hard to to see that so when we coded the first article we code to fail because coding everything would be a two-time consumed hassle so we always code until failed and whenever we can reproduce the first study we continue until the failed study so the rest of the coaches is based on the failed study we also do metrics where we see how many studies we coded the total amount of studies and how many papers that are missing in because there are a lot of missing papers due to errors in a wrong type of article or wrong type of name or anything or it's unpublished literature that we can't get hold of or it's published in book forms etc so it's the the material presented is not available completely and that's in general that's what we strive for to have open access data and I would say that this data set is very close and no one really knows where the figures come from or where the statistics has been extracted from there is no guidelines at all given by the visible learning team so it's hard to reproduce it takes a lot of time to check out or to see subgroup analysis that could be matching the pico and and so there's a lot of assessment in this type of reproduction and we're kind of actually nice in our way of assessing it we want much of it to work but sometimes it's just overwhelming in how to find it or how to calculate the same means so in conclusion our preliminary conclusion right now is that the hat the aggregation does not work at all and what we have understood stand is that if you want to do this you should actually conduct a new meta analysis instead of synthesis effect sizes from several meta sizes just use the same studies conduct a new meta analysis a new aggregation and the reporting standards in order to make it easy for the relevant information to be found that's a good way and in order to do a second degree meta analysis or meta meta is you have to really define the picos for your synthesis very narrowly it's very easy to end up with such a big sample that and when you match that together it doesn't really speak for any of the population that are included don't mistake statistical effect size to practical importance and relevance that's a given just a high number doesn't say anything without context I mean I haven't talked so much about that but that was the point of the first the book that he ranked the relevance of each influence based on the size of the effect size so and that makes no sense at all because you can't really pick among influences based on how large effects they have so that's the background to that point there we didn't talk about that so much during the introduction now that's true I mean the idea of mistaking a big effect size for a better influence that's a conceptually wrong assumption and especially like when we come to an influence like the one we have gone through reducing class size from what is it from 50 to 30 to 30 to 20 from five to one it's it's very hard to say without any information and it's very conceptually driven if you should reduce your classes or not like cohen's d of 0.5 doesn't it doesn't tell you anything independent on your own class yeah so please don't rank effect size based on high or low value and also don't synthesize the meta analysis simply by calculating the mean average of the included analysis it's there are ways better ways to synthesize meta analysis than this and if you are promptly eager to do this at least some weighting or some comparable picos should be included so this is more or less what we thought we bring up for you so if you have any questions to the work here or if you are working on a meta meta analysis yourself or are interested in this topic now is the time to talk nothing written in the chat but perhaps you have some questions do we have a mainly an american audience here yeah visible learning has received quite an impact in in europe for example what is the current state of visible learning in the united states given that he is not american yeah that is my impression as well that he is not as popular in the u.s. we also got a question in in the question and answer where it's like does he know about this research and is impacting in the 2023 release we haven't been in kind of which we've been trying to get in contact with him but we haven't had any success in that he usually is very good at dodging critics in general but from what i saw from the outline from the 2023 release there are a chapter that's actually called criticism but since it's releasing now and the latest update on corvin the one that have on the web page i don't see that these issues are going to be handled at all there is no there is no nothing points towards that direction more questions and we talked about all the challenges about assessing a meat and meat analysis really and how do you find the drive i mean the meat and meat analysis is ever if it's ever possible to be done effectively following yes i would say yes and in that sense it's more about describing the content and actually doing in that case maybe doing subgroup analysis from studies within meat analysis is what we have seen the recommendations from cockrain and Campbell is that if you want to do it don't just do a new synthesis or make it more exploratory in the same of like presenting all the ones in more of a report form than actually doing a new combined synthesis that's the tip we have seen so far all right i guess we have covered all questions do you have any anyone else that has a question or reflection or some thought about this yeah we have a question or more like a comment i think this also has implication for peer review of meat analysis we might add for that since visible learning is a book it has not undergone peer review if we were perhaps not very clear about that so i think that hat is dodging a lot of things by publish everything in book forms on web pages and so on and that hasn't attracted enough criticism i think but i can also agree that i mean having this type of assessment in mind while assessing ameta is very it's a it's a good best practice to to actually look at the pico's and compared outcomes that you are synthesizing and especially when you are reviewing ameta analysis i think it's there are some good points here you can write the best practice review report after this study lesson learned lesson learned uh there is another questionnaire thought i missed that uh based on your findings are meat ameta analysis ever possible to be done effectively following best practices perhaps you answered that earlier all right sorry sorry yeah they are not how do you close that right then we don't have anything anything else to say shall we end this session yeah thank you so much um for listening if we have any plans tomas to look at other popular studies books that may have not followed practices correctly to continue on this research it's a good question um i haven't seen the the same size of synthesis in other works in comparison to hattie i think this is one of the biggest living synthesis is there is right now and we are in somewhat tied to educational research so we're going to continue in that track and in that track there are nothing alike this one at least no not across disciplines either i think because this endeavor hattie has has done and we should give him some credit of course because as you stated earlier he really brought up the idea of evidence-based science-based teaching to an an agenda in several at least european countries in when the time needed it but it was so sad to say that he brought in so much flaws into it so perhaps the damage is greater than the benefits from it but we haven't seen any anything like this in in similar disciplines like in psychology perhaps there is but i think we should have come across that by now yeah a way our way to continue this work is more of the living review format that's actually that we combine original studies instead of doing a second-order meta-analysis so the idea of creating for example a community augmented meta-analysis platform where we can update with original studies from start i think that's a better way to go than just sediment it with these types of studies and this is the result so i think the future come yeah i don't think this is how you will conduct meta-analysis in the future anyways i think that is everything is going to be uploaded into a more integrated applications or websites in that sense there are enough problems to handle to conduct a meta-analysis and if you you're combined of different meta-analysis you're aggregating a lot of flaws into that synthesis that might be hard to discover even all right it's evening in sweden so at least i is going to yeah take in take the weekend off so see you guys