 It's my pleasure to introduce our speaker today, Dr. Ben Mottz. Ben is an assistant professor at Indiana University's Department of Psychological and Brain Sciences. Ben holds a bachelor's degree from Indiana University, master's degree from UC San Diego, and a PhD from Indiana University, all in cognitive science. Dr. Mottz will be speaking to us today about replicability, generalizability, and the many classes approach to open science. Ben, over to you. Thank you so much, Craig. It's such an honor to be here. It would have been my honor to be here anyway, to be able to hang out with Craig. It's great to keep in touch with you and to see all the work you're doing. But it's also an honor to be a participant in what you're doing with P.O.S. I think it's important. And yeah, I'm excited for the fact that it exists and also the fact that you're participating in it. It's my hope that today, because we've got a small crowd, can be relatively interactive. So if you have any questions, I'd encourage you to unmute as long as you feel comfortable with your unmuting. Being recorded. But if you'd rather wait until the end for Q&A, then yeah, I think that we'll stop recording then. But feel free to chat with me. I'd be excited to follow up on any of these things. These are topics that I care deeply about. So as Craig mentioned, my name is Ben Mottz. I'm an assistant professor here in the Department of Psychological and Brain Sciences at Indiana University. And I'm excited to tell you about many classes. I'll tell you more about many classes and the variety of people involved. But I want to start instead by framing things and maybe level setting by just mentioning this first word in the title replication. So I think that it's consensus that replication is in many ways the cornerstone of science. This is something that has been talked about very recently, certainly in psychological science. That's where I'm from. But it's also a feature of science writ large going back to like, I don't know, boil and before that to count. And the general idea is that if an effect is real, if there's something out there in the world that, you know, constitutes a pattern that's systematic and that's reliable, that it should be observable under the same conditions as when it was originally observed with adequate statistical power. So when we write scientific manuscripts, we include methods sections, not to just kind of say, look at all the cool things I did, but instead to be able to give people enough detail so that if they did the exact same thing, they would observe exactly the same results. And if you haven't already heard about it, there's been a lot of discussion about something that's called the replication crisis. It's just out of curiosity, like you've heard of the replication crisis before. So there's thumbs up. That's exactly what I wanted to hear. It's been estimated by many people in many different ways across many different fields, and to provide a course metric. If we were to do the thing that Simon's and Boyle says we ought to do that is to like try and reproduce the same results that were observed using the same method with adequate statistical power. We've observed in psychology a roughly 40% success rate. There's differences depending on what kind of psychology research you're trying to replicate. Lest you should believe this is just an issue with psychology, it's been observed to be even worse in other disciplines. In cancer biology, the replication, the successful replication rate is more like 10%. And Baker, who is an editor in the Nature Publishing Group, did an interesting thing in 2016. Instead of trying to measure the success of replications, Baker instead just went out to thousands of researchers and invited them to rate their responses to a variety of questions about replications and found that the large majority of researchers have attempted to replicate things and have failed at it. So even though there might not be published efforts at these, it's something that's commonly accepted across science that the things that we're publishing might not be so replicable. There's been a lot of discussion about why this is the case and I imagine this is something that you guys might be discussing and pose a little bit as well. I want to offer one candidate explanation because it's the one that we tend to focus on a lot with many classes, call it an escape hatch. Maybe it's the case that every scientific study that's ever been conducted and has observed an effect has observed a real effect. And the reason that we might not replicate is because any individual effect just depends on a vast constellation of situational factors. And one should not expect that those situational factors would just easily kind of like stay fixed in a replication attempt. Maybe we can, I don't know, get past the replication crisis by just saying, no, there's just this other variable that the person who was replicating didn't take into account. And if they had, then it would have been fine. There's been lots of people who've said this. There's a lot of people who've argued against it. I want to highlight one person in particular. His name is Tal, Tal Yarkoni. Tal Yarkoni had a very public Twitter presence for a long time that's since faded. He was an academic who quit his academic job to go work at Twitter before Elon Musk took over. And now he's searching for other employment. Tal is interesting for a lot of reasons. He's interesting because he was a self-proclaimed meanie on Twitter. He didn't like making people happy. He liked engaging in arguments, maybe this is the right way to say it. And he also really likes ice cream. The dude has lots of pictures of himself eating ice cream. And Tal took a stance that's kind of like what I had on the previous slide. And that's that there's no replication crisis. Instead, what we have is a generalizability crisis. He wrote this up in an important outlet in psychological science behavior and brain sciences. It was actually a target article that attracted a lot of attention. I've written commentary in response and a lot of others have as well. And the core idea behind what he said in his article in generalizability crisis is that what's going on is that researchers, when they make a claim about science or what's going on in the world, they're failing to specify the scope of their inferences. It's the case that they're saying platitudes when they observe an effect, like this thing affects this thing. But they're not saying the contexts and specific, I don't know, necessary and sufficient prerequisites for that effect to be observed. And maybe that's because people are scientists and in science we're in the business of identifying effects. And Tal went on to say that maybe if what people needed to do, instead of making theoretical progress was to actually improve things in practice, they'd be more careful about the scope of their inferences. This is a really interesting claim to me because I care a lot about improving things in practice. I am ardently theoretical. I care a lot about theory. I think that theory exists. It's important to build it. But I also care about change in practice. So I'll just highlight that when we talk about education and talk about improving education, in a lot of ways, what we're doing is making causal claims about things that will result in better outcomes for students. So in research on improving learning in practice, we might ask questions like what classroom practices result in better memory for the to be learned material? What interventions increase engagement? What materials improve performance? Yada, yada, yada. And all of these carry the same sort of pattern. As many of the theoretical research studies that Tal Yarkoni was focusing on, what X causes Y. And I care a lot about the method that would be used to demonstrate that causal relationship. Like I think it's general truth that experimentation is the best and most compelling method for demonstrating a causal effect out there in the world. It's not the only method, but it's pretty good. And experimentation is hard, especially in education. This is something that education researchers have been arguing for a long time. And that's why people like me in psychological science have oftentimes when studying human learning or even in studying instructional practice tend to take students out of classrooms and put them in laboratories and then hold everything fixed so that we can then examine what improves learning in a very sterile setting. I don't think that that's the right way to do things. If we're actually interested in improving learning in practice, it would be useful to be able to measure learning in situ in the places where it actually occurs. And things get complicated when doing field research, specifically using experimental education methods. I'm going to give you an example. And it's going to relate to all this stuff on replication many classes. And still, if you got any questions or anything, feel free to stop. OK, you're good with that. Have you heard of retrieval practice before? Retrieval practice. It's a thing. And it's a thing that's been known for a super long time. It's come under other names. People have called it the testing effect. Francis Bacon mentioned this general pattern where Francis Bacon was thinking about remembering text. So like if you had to give a speech and you wanted to memorize it. He wrote, if you read a piece of text through 20 times, you will not learn it by heart so easily as if you read it 10 times while attempting to recite it from time to time and consulting the text when your memory fails. That idea that you shouldn't just be rereading, but instead that you should be testing yourself on it or attempting to retrieve it from memory is really what constitutes the idea behind retrieval practice that will have better memory if we test ourselves on it. And this was also observed, you know, the other old white guy on the screen is William James. He also wrote, you know, 300 years later. In learning by heart, for example, when we almost know the piece, it pays better to wait and recollect by an effort from within than to look at the book again. There's been many studies on this. And Dunlowski and colleagues wrote a compelling review of cognitive methods for improving student performance in education settings and found that practice testing is an especially reliable way of doing it on the basis mostly of laboratory research. They wrote the practice testing benefits student performance across a wide range of criteria and tasks and retention intervals. Really across the range of cognitive psychology interventions that you could possibly do in a classroom context, practice testing seems like it's like one of the most efficacious things you could do as a teacher. Not just tell people things, but instead have them try and remember the thing that you told them they should try to know. Okay, so if this is the conclusion, what teachers ought to do is they ought to then incorporate practice testing into their classes. And there was a study that did exactly this. So in 2018, Rick and Gurung and Kathleen Burns came out with a research article where they put evidence-based claims to the test. They did a multi-site classroom study of retrieval practice and basically they took a nearly identical introductory psychology classes that had about 500 students total. And they split these eight and a half. They had four classes where they were encouraging students to practice retrieval on their homework quizzes. So they were encouraged to do the homework quiz and practice and resubmit doing it multiple times. I think that in some classes this was actually incentivized and in four classes they didn't do that. They made the quizzes available, but it was the case that students could just look at them or whatever. And to Rick and Gurung and Kathleen Burns' shock, they found that students who took the quizzes multiple times actually had lower exam grades and course grades than the students who took the quizzes once. So it was true at the group level, the four classes where they had students practice retrieval actually had lower performance. And the students who did this more often had lower performance than the students who didn't do it. So I want to make two claims. I want to first say Yorakone's wrong, that in applied research we're more specific about the scope of our improvements. So when we're doing education research in practice, it's oftentimes the case the same challenges of replication come up and that we're not super clear about where we'd expect to see in practice. But I don't want to just pick on Tal. I also want to say that he's probably right. The researchers generally just aren't clear about the scope of their inferences and their contingencies that we're just not measuring or paying attention to. So if we're trying to improve things, if we're trying to say that there's this thing out there in the world that we could do to students that would cause improvements on student learning, then basically what we're saying is that there are generalizable causal effects that we could leverage to improve outcomes. But there's two issues. One is that rarely in education science do people care. So the work that I just showed you by reading Goringa's rare in education science, there's very few studies that have actually tried to replicate an observed effect in education settings. And in those rare instances where people do this, the original results aren't particularly robust. So it more or less mirrors the general patterns in psychology. So about a half of published replications find evidence that's consistent with the original finding. And I'm very close with a funding agency here in the U.S. called the Institute of Education Sciences. They've been in the pattern of funding replications, and they found that one third of IES funded replications find evidence that's consistent with those original findings. And that's not so great. In fact, it makes the director of IES mad. He wrote a blog post in 2018, and he just kind of like laid it all out there. And I've got a quote from his blog post where he was claiming that research on student learning writ large is broken. This is crazy coming from a guy whose agency funds research on student learning. He wrote, A central goal of the mission of IES is to identify what works for whom under what conditions. Unfortunately, most of the studies that have found impact contribute little to helping us meet that goal. Many, if not the great majority of these projects were carried out in a single location and are tested using a relatively small number of settings and teachers and learners. Given the limited scope of these projects, it's usually impossible to judge whether the tested interventions work with different types of students or in different education venues. This is something I've observed in my research personally. I don't know if you've done research in classroom settings, but maybe this is something that you've experienced in your own work, that when you do an experiment and you observe that the theory or the past studies that you've tried to replicate, that that pans out when everything's successful. It's super easy to publish. Everybody believes it. But if you observe something different, it's like a ton of skepticism. It's like, oh man, it could it be that the original thing wasn't true or did I just mess up? And this has been something that I've experienced for like a decade and a half of my professional career. It's easy to publish stuff that's in agreement with what's known. It's hard to publish stuff that's not in agreement with what's known. And I only just in the past couple weeks learned that there's a term for this. So I'm excited to share this new term with you guys. It's called experimenters regress. Have you heard of this before? It's a interesting, there's like a whole philosophical debate on whether it's real, but it feels like a thing to me. So imagine that you start by deciding to do a study. Maybe it's that you're repeating some sort of paradigm that's been used somewhere else but making a minor modification. So you're repeating it's already been done somehow. And if you observe no effect, then there's two different things that you could do. You could conclude that the original effect is wrong, but you have weak evidence for that because you've observed nothing. So instead, it's more common for people to become skeptical about the method. So what they can only do is then repeat the study. And if they observe no effect, we're kind of back to the beginning. And this is a regression that doesn't have any solution. And this is what I'm saying is perhaps it's something with psychological science and maybe also what's going on in education science as well. So just to kind of bring us back to what Mark Schneider said and what I've been saying. When people do experimental research on classroom learning, it's practically always just conducted in one class. And when you observe nothing, it's unclear what to do with that thing. It's unclear whether your results are emblematic of what would be observed in other classes. And even if you observe something, it's not clear why, whether there's a causal effect that you've observed, or instead maybe there's just something weird and idiosyncratic about the sample that you've made. And if this is the problem, it should be pretty clear what the solution is. So instead of just doing research on one class, you should do it on many classes. And that's why we call our project many classes. So let me tell you about many classes. Many classes investigates an educational practice, educates a possible causal effect on a class of dozens of class contexts in a way that maintains the rigor of a randomized experiment. As I said before, I care a lot about experimentation being the right method for demonstrating what improves what else. And it also does it in a way that uses materials that are authentic to class norms. We don't use synthetic stuff. We don't take students out of their classes and put them into weird things. We don't inject our own materials into the class. We actually want to be observing what would be expected of the scope of the study. And in doing so, it's really important to us to have extensive and open documentation of every class or implementation. This is where we'll get back to discussions about open science. If you're going to do something that attempts to demonstrate a variance of an effect, it's super important to be transparent about what the implementation is. Otherwise, you've just got a big sample. The way that we do this with many classes is we take a class that is doing whatever, I don't know, like they've got weekly quizzes or they've got occasional quizzes or whatever they've got. They've got some normal learning object that's how the instructor normally does it. And what we do is we manipulate it in some theoretically relevant and interesting way. We're not introducing new materials. We're just taking what the instructor already does and manipulating it. And then we have some class assessment on what the students have learned from that original or manipulated object. And that would be the case that some students get the old thing and other students get the new thing. We have a crossover design so that students who got the new thing then get the old thing on a subsequent class object and the students who got the old thing now get the new thing. And then we have a new assessment of what they learned from that new topic so that everybody's got parity in what their experimental treatments are. It's not the case that some students got one way of learning, other students got another, everybody's got the same overall. And it's not the case that some students got a new information. It's not the case that some students already have a new understanding of what they learned at that particular time. As I mentioned when I got started, many classes is super collaborative. So in the article that we published on the first many classes study a few years ago, there were many dozens of collaborators. So we had a long list of a lot of contributing who I should really give credit to, specifically Emily Pfeiffer and Josh D'Aledo and Paulo Carvalho and Rob Goldstone and Janelle Sherman. So I'm giving this presentation, but imagine like their spirits all around us also being here too. It wouldn't have been possible without this amazing group. Okay, so let me tell you a little bit more about the sample from the first many classes study. And you'll stop me if you have questions, right? I'm gonna go quickly through it, okay. So as I mentioned, we had about three dozen classes. There were 38 classes specifically total and these were all from colleges across the US. So we sampled specifically from 15 campuses from five different institutions. So some of these institutions like IU and Penn State and Minnesota have multiple campuses that actually are interesting and diverse and we sampled from 15 campuses. The way that the process worked and I'll tell you a little bit more about them is that we started by getting approval at these campuses to do the work. In the US, we have this law called FERPA. It's the Family Educational Rights and Privacy Act and I'll tell you more about it. So we had to get approval basically from all these different campuses that we could do this research. Then we posted an open call for applications to the faculty of those institutions where it had been approved. Once a faculty member submitted their application, we basically led them into the study. We didn't say no to anybody. That would have been kind of silly. So yeah, we said yes to all the faculty who said that they wanted to play and we then asked them to add one of the researchers to their Canvas sites. So across all these different institutions, they all used Canvas. That was what we used to manage the study. We administered consent in a Canvas assignment. I can tell you more about that later if you want. They implemented the experiment and eventually what the instructor would do is map scores on actual assessments in the class to what the treatments were for those different learning objects. And then later we had schools export data for the students who did consent. We had about 80% of all students who were in the sample provide consent. So there was a willingness to play. And we had a really cool sample. So there was computer programming and humanities and natural science and physical science and social science. And the strangest class that we had participate was actually a speech class where there wasn't any written anything. Students had to give oral presentations. So all of our measurements of that class are scores on rubrics that were provided by the teacher. But I wanna go into that first step there, the approval for exception because I think that's the most relevant in discussions of openness and transparency. I mean, it's my guess that as you're listening to me, if you haven't already done classroom research, you're thinking that's probably the hardest part of this is if you're gonna be open about what goes on in classrooms, you're gonna be sharing private crap. And getting approval to do that might not be super easy. And we really wanted to share things openly. So as I mentioned at the start, we didn't just want a big sample, we wanted a diverse sample. We wanted to be able to represent the range of possible educational implementations and to represent the diversity of our sample and to avoid this kind of like unanswerable question about generalizability. We collected tons of data about the class and about the assignment and about the school and about the student. So about the student's past performance in the class, about the student's academic level. And we also made all these data public. As I mentioned before, there are laws preventing this, but there is an exception for research. So what we decided to do was just to be super transparent with the administrators at these schools to slowly give ourselves time to get sort of like administrative buy-in and just have it be very like, I don't know, voluntary. So we explicitly sought permission from every school, from every teacher and from every student. So again, there was an assignment in Canvas that said, if you want to, you can participate in this study. If you don't want to, then there's this other thing that you do and we won't look at any of your data. If you participate in the study, you understand that the university is gonna give to these researchers details about you, but they'll keep it very private. So I'm sorry, it's not that we keep it private, it's that we keep it anonymous. So we rigorously de-identified all the data. We were sure that there wasn't any sort of like concerns about minimum cell sizes or there was possibilities of re-identification. And yeah, again, we got to 80% participation among these students who are in these classes. So I think that there's a willingness to share their data for research if given the opportunity to do so. I've talked to the few people in Canada and they've said that privacy laws are hard there. I don't know if this is your experience too, but I would really advocate for a perspective that doesn't view these privacy laws as being obstacles, but instead sees them as more like opportunities to engage the potential participants who are providing the data. Because if you can get the permission, then there's no legal problem with it. You've been explicitly granted a release of something that was the participants. And if you can do that, then maybe the legal obstacles kind of go away. So engaging and having conversations about this is good. There's other people who disagree with that view. Like I think even Craig's former advisor who I have debates about this with, some people think that it's just a moral imperative that we should be analyzing all students' data. Like we have this need to improve education research and we shouldn't be asking for permission. We should just do it. I don't know, I think permission works well. So that's my stake in the ground. At this point, I should probably tell you more about the study that we did rather than just telling you about it. Is that cool with you? So let me tell you about the experiment. Okay, the first many classes study and there's been another many classes study which I haven't told anybody about yet and we can get to that later. But the first many classes study was about the timing of feedback and you can learn more at our website. And it investigates how the timing of feedback on student work affects learning from that work. There's an educational doctrine that really goes back to like BF Skinner that says that feedback should be administered as quickly as possible. That is to say that if a student makes a mistake, bam, you should let them know about it right away so that people don't encode wrong information and so that you can fix it before it sort of persists or yeah, it gets messed up. But there's been more recent research that suggests that delayed feedback may be more beneficial. It just hasn't been conducted, hadn't been conducted at scale and it was really cursory, the types of studies where that hadn't been observed. And it had a lot of benefits for the first many classes project. So one is that feedback is relevant to all classes. I've never found a educational system that doesn't include feedback. That would be crazy, like to say, we'll never tell you what you're learning or how well you're learning it. And also it had this cool benefit, technically speaking, that the experimental contrast whether students get their feedback right away or get it delayed by a few days can be implemented natively in Canvas. In fact, it's not just a contrast that's theoretically interesting. It's also something that happens already in practice. So I don't know, some people delay the feedback that students get on their quizzes even without thinking about learning benefits. Any guesses why a teacher would do that? Or do any of you do that already? I suspect it might be because they don't have time. Well, that's a great answer actually. So I should have said this earlier. We're mostly just looking at like auto graded multiple choice quizzes that obviously didn't happen in all classes. There was this weird speech class that didn't have multiple choice anything. But yeah, it was the case that there were things that could be administered right away. But even in that situation, people sometimes delay it. The reason that some people do that is to prevent students from sharing the answers with each other. So let's say that you've got a quiz that's open on Canvas for the weekend. If a student presses submit and then immediately saw the right answers, that's possibly they could go and share that with their friends. So a lot of teachers delay feedback already just because they don't want the answers out there in the wild. So to be really clear about the contrast, in Canvas, as soon as a student presses submit on a quiz, depending on what the settings are, they could either have the feedback delayed in which case they would see what you see in the left where it says your quiz has been muted. It'll be unmuted at some later date. Or in Canvas, they would see the results of that attempt. They may not be able to see feedback if that had been included in the quiz or whatever the assessment was. So as I mentioned before, we did this as a crossover design. So it had a lot of detail. I'll walk you through this. It was the case that we started with 38 classes. And every class was randomly assigned to either an incentive or a no incentive condition. Bear with me on this. Here's a possibility that we wanted to avoid. Maybe students would never look at the feedback. So if that was the case, then our manipulation would have been not super great. So what some classes did was they incentivize students to look at the feedback. Usually, we left it open to the teacher to do whatever they wanted to do. This was another element where we were more interested in variability than in control. I should say what they normally did was they had a follow-up homework assignment where students were instructed to reflect on the feedback that they got. So there was some points associated with that follow-up assignment. It was usually embedded in some downstream normal assignment. And that made it so that students were more likely to look at the feedback for sure. So either the class used incentives or didn't use incentives to have students just look at the feedback. After we figured out what class it was, all students did this thing that was, they didn't form consent in a way that was actually hidden from the instructor. It was fancy the way that we put it together. We created a large bank of like 10,000 random codes and said that some of the codes were associated with providing consent. Others codes were associated with not providing consent. And the instructor didn't know which was which. We delivered the codes in a way that the instructor couldn't see. So the student would type in a code that's associated with whether they did or didn't want to provide consent. So the instructor could see the code, but they wouldn't know whether, only we knew whether a student was opting in. Okay, and then after we got consent, everybody did this, they had one homework where they got delayed feedback or immediate feedback. And then there was an assessment of what they learned. And then there was crossover. Then they got, if they had delayed feedback, then it got immediate feedback. And then there was another assessment. Then we pulled all the data. We did the variety of moderator analyses, herder, genetic analysis and also omnibus analysis of global effects. Okay, but we didn't just do it right away. It's worth kind of like also saying we registered it. So have you heard of preregistration and stuff? Yes, it's been wild. Ever since I started doing preregistration, I can't stop. It's like a drug. It's the case that if I did a study now that I didn't register, it would feel weird. So once you get started, it's worth kind of like, yeah, it's worth getting started, maybe is the right way to say it. It's a cool gateway to open this. So it was easy with many classes because our design was well established long before recruitment. As I mentioned before, we were actively seeking like universities to agree to this before we were gonna be able to like do anything. So we knew what we were asking of participants and of universities and of teachers. So when we did many classes, we took the additional step of first preparing a complete analytical plan based on simulated data so that we knew how we would analyze the data when it came in. And we knew specifically what data we would need to ask from the teachers and from the institutions. We also took that analytical plan and we submitted a registered report. So this is a manuscript that you type up that is data blind. So it's interesting and strange experience. But again, once you do it, it feels normal. You write a manuscript without knowing what the results are and you kind of just leave the results blank. And then in the discussion section, you say what your discussion would be depending on what the results were. And once that registered report was accepted and we pre-registered everything. So yeah, we pre-registered the stage when accepted registered report, the analytical plan and also the method and all the materials that we're gonna give to teachers. And this had a few benefits. One is that it helped dramatically during recruitment. When we went to a university and said, this is what we're gonna do. And then you don't just have to trust our word for it. We've got this public registration of it. We've got this provisional acceptance in the journal. So it was very easy to, it was easier to do recruitment of institutions and teachers because we had been so transparent about what the research was. And it also helped for security approvals. Again, to have these people, me have these researchers say here's exactly what I'm gonna do. Here's exactly the data points that I'm asking for. And I've thought through what the relevance is was really useful. And also having done that made it so that the analysis downstream was pretty easy as well. We had already thought through what the analytical pipeline would be. So maybe I should tell you what the results were. First I'll show, I don't know if you've seen the forest plot before, but this is what the plot will look like. So imagine every row of this plot is at class and we could plot the class here. And if this is pretend data, if the class falls on the right side, it's the case that immediate feedback is better. If the class falls on the left side, it means that delayed feedback is better. And these are estimates at the class level. And the dot is the median estimate I'm sorry, it's actually the modal estimate. I should have said that. And the line that you see is the range, it's called the 95% highest density interval. It's basically our credible interval of where the effect actually lies 95% of the time. If that 95% highest density interval overlap zero, like here in this class, we would say there's really no difference between the different classes or it could be the case that it's in the direction of delayed feedback. Oh, and if it's a blue class, then the incentive that did have incentives, if it's red, they didn't have incentives or there could be trends in the direction of delayed feedback. You get the basic layout. Okay, drumroll please, here's the results of 38 classes, all doing this experiment. It's super much over zero. Like it's really over zero. And it's not the case that there's a ton of uncertainty here. In fact, if you see that bottom line in black where it says overall effect, if I just plot the posterior distribution, I'm sorry, we do all of these analyses in Bayesian terms and I'm kind of not gonna talk about Bayesian methods that much. But if we just plot the actual like distribution that we've estimated there, it's pretty tight around zero. So we can make the conclusion based on these data that there's just no broadly generalizable difference in learning performance when students receive immediate feedback compared to delayed feedback. So before I get too deep into the theoretical relevance of this, I'll also say that we did the thing that you might imagine any reasonable scientist or curious person would do. And that's to say, well, are there differences between the classes where like in some classes, it was bigger than in other classes or in some classes, it's immediate and in other classes, it's delayed. And we looked across a lot of moderators and found no strong evidence of systematic differences in the effects of feedback timing between students or classes. As I mentioned before, we were pretty aggressive with data collection about the implementations of these experiments. So we had lots of variables and across all of these variables, we found no strong evidence. Finding no strong evidence and what I mean by that is like significant moderators. We decided to look much more closely at some that we thought maybe really should show an effect. So we looked more closely at ones that were specifically about the implementation of the experiment and how students got feedback. And here I'll show you some scatter plots and estimated trends that come from that. So on the left graph, you see on the x-axis, the number of treatment quizzes and on the y-axis, you see the benefit of immediate feedback. So if it's positive benefit toward immediate feedback, if it's negative benefit toward delayed feedback. And what you can see kind of is that in situations where there was incentives, sorry, I said it backwards with the red and blue, didn't I? So red lines are incentives and blue lines are no incentives. What you can see is that as we get more treatment quizzes, as there's more opportunities for students to get feedback, there's an increasing benefit for delayed but there's super few classes way out here in the sample space. So it just wasn't significant. And also as the delay is increased, so as we increase the number of days between due date and delayed feedback to like four or five, we see benefits that might point again in the direction of delayed feedback, but there's just very few classes that are way out here. So like three or maybe six classes out of 38. So there's just not enough data to be able to have strong inferences. And then also we observed that if there was a lot of retrieval based exam items, what I mean by that is like students had to actually like respond with a response as opposed to like marking a response. So instead of multiple choice, if they had to like fill in the blank or make an essay or something, then there was also a potential benefit for delayed feedback but 75% of the sample just didn't do that very often at all. So directional trends suggest that increasing the dosage of feedback and like the delay result in benefits for delayed feedback, particularly in situations where the feedback is incentivized to look at more tests of retrieval based, but there's just sparse coverage in these areas for the current study. So I wanna make some conclusions. So first I'll say that there doesn't seem to be a single invariant benefit to immediate or delayed feedback. So it's kind of the tall Yarkoni was right. It might be the case that when we describe an effect, we're describing something that doesn't exist. And that instead it's the case that you might observe some effect here and you might observe some effect there, but there's no global effect that we should be expecting to observe in any individual experiment. Pinning down the settings where benefits might be observed, especially in education settings, we'll be requiring many more data points than what we were able to collect in the first many classes study. And we do observe some hints of improvement for delayed feedback. It's just an undersampled context. So we need to do it more and we need to do it bigger. Nevertheless, many classes one demonstrated that large scale distributed experimentation in online learning settings is feasible, that we can share students data, their actual exam scores publicly, as long as we de-identify it and students are okay with that and institutions are okay with that as long as the students are okay with that. But we've also observed something else and that's that it was super hard. So I wanna be totally transparent about that too. Hope and this is good and being transparent about the personal trials and tribulations of an experiment is worth kind of also being transparent. So when I think about the lessons we learned from the perspective of having administered a many X study, we think that there's a lot of good reason to design research with variation in mind because again, global effects just might not exist. And it would require us to seek broad involvement from stakeholders to do that. We have to talk with all the different places where a research study might be implemented about the terms of the research and what the policy should be and how they would be involved. When we first got started on many classes, we reached out to the Center for Open Science. We talked with Brian Nosik. We actually had lunch with them and David Miller, who's the policy director at the Center for Open Science, said this one thing that we didn't believe. He said, plan on taking twice as long as you expected to take. And we were like, no, what could, you know, it couldn't be that hard but he was totally right. I think it took like four times as long as we initially estimated. The whole study took 36 months to put on. So it was like a three-year study. It's not super sustainable if it takes 36 months to conduct one research study. So a takeaway is also that if we're gonna do experimental research in education in a way that's open and in a way that's distributed and in a way that accepts the variability of different education settings in a way that, you know, documents, the materials that are out there, we need better tools. We just need better tools. Others research areas have great tooling. There's lots of microscopes out there if you were a biologist and that's useful but you can expect that other biologists all have microscopes. But for people who are doing social and behavioral sciences there's not the tools aren't as advanced as that. So while for this study, we did develop some interesting stuff for encrypting consent and for condition management. For current and future studies, we've got a tool that we're relying on much more heavily and I'll briefly tell you about it. I call it terracotta and you could learn more about it at the website. And coincidentally, I'm having anything to do with Pose. I'm also talking with some faculty at UBC about potentially making terracotta integrated with UBC's campus site. So I don't wanna make too many promises there but at least it's under consideration. And the idea behind terracotta is that we can automate all the stuff that was hard with many classes if we just had a tool that was sitting in campus. If teachers could implement the study on their own, specify what the conditions are with our help like make that contrast and then click a button. And literally terracotta is awesome. There's this button that is export data and it exports the data in a standardized format that's rigorously de-identified. There's no identifier for the Canvas course site or for the instructor other than those that are internal to terracotta and otherwise meaningless. So terracotta gets the job done in a cool way and it makes it so that we can be more open. And oh, I should also say that in the terracotta data export, it's not just the students like responses and their scores. It's also the materials that gave them those scores. So it's also the instructional intervention. So it's really easy once you do an experiment in terracotta to just kind of like post a data set up on the web that should be able to be public as long as you've got approvals from your institution and that really does a good job of documenting what happened in a way that others could then look at and maybe reproduce. We recently entered terracotta into the X Prize Digital Learning Challenge. This was a US nationwide competition to see who could make the best tool for running education experimentation. And I'm happy to say that just like two months ago we were named the runner up, which is fine. I'll take runner up on anything nationwide. And yeah, we got a good pot of money to continue to develop terracotta. So it was a wild competition. They really put us to task. The idea behind the competition was to systematically replicate unexperiment in a real education setting at least five times and no more than 30 days and to do it with at least three different distinct demographics. So we went to K-12. We had different geographic areas. We had wildly different populations involved. And we managed to run a distributed many classes experiment in something like six months where the actual implementation was bounded in 30 days. It was awesome. So I'm suggesting that it gets more feasible with tooling that things can become more possible and it's worth saying terracotta is totally open source. So if it turns out that UBC doesn't want like me to host terracotta for UBC UBC could also just spin up its own instance. So it's all on GitHub. So this is also a feature of openness that I think is important. People should just share their stuff. Okay, zooming out the big pictures as it relates to science is I wanna make the claim that small sample research on what works just is not suited to answer the question. You can't find out what works writ large by conducting a single small sample study because it might be the case that you didn't observe where something works. So many classes is one way to facilitate experimental research that addresses this question of what works what are the causal influences on student learning performance at scale. And I also wanna say many classes is just one model for this. So I've seen really strong examples that come out of the Education Endowment Foundation in the UK. They do distributed experimentation too. I don't think it's as open, but they do it. And there's also an organization called Character Lab at University of Michigan. There's an effort called Seismic that's also trying to do this type of distributed research at grand scales. So it's heartwarming to see that this is something that other people are taking up. And that's as it relates to science. I also wanna mention the open science picture. So I think that one of the cool things about many classes is that by distributing things across multiple instances and doing a good job of documenting these implementations and making those implementations publicly accessible, we're avoiding unanswered questions about generalizability. It's not the case that we'll just kind of scratch our heads and ponder why we got this effect over here or that effect over there. Or in this case, no effect at all overall because we know what was going on in these classes. Also, a really cool benefit of openness is that we're not just publishing a research study and asking people to just believe us at the end of the day. That is to say, it's not like we're just giving people a fish like look what we found. Instead, we're providing a fishing pole that can potentially help with other people who might be interested in doing this kind of research. So for example, I mentioned that we do Bayesian data analysis. It turns out that's not super common. And it's especially not common in the complexities of running distributed hierarchical within subject designs. So we've worked through those complexities. And now you don't have to. If you were curious how to do Bayesian data analysis, I could just show you where our scripts are and this is how it works. So having examples out there serves an educational purpose that I think isn't that common in conventional ways of doing research. And also we've made our IRB protocol public so that if you're curious how these weird people have managed IRB stuff, that's also something that you could figure out. The other cool thing that's I think interesting about how we've done many classes is that we've normalized public pre-registration in a space where that's really uncommon, especially in education when you're doing research. It's sometimes believed that like we can't know in advance of what's happening. It's a dynamic flux of the students who happen to be in the classroom. I kind of reject that. I think that there are system atticities to human learning and that if we're planning on systematically measuring and inferring those things, we could also be really clear about what hypotheses we're testing and confirming. So yeah, I think many classes helped with that. So with that, I wanna say many thanks to you guys being here. And I'm excited if you have questions. If you're curious about the experiment that we ran for the XPRIZE, I could also tell you about that. Again, it's super fresh results, so I don't have much. Thank you. Great, thank you very much Ben. I really appreciate it hearing about all your work and sharing your time with us today. We're gonna stop our recording here and then we'll turn over to Q&A.