 And Scott's going to talk to us or give us a presentation today. Help on doing an impact evaluation. What evidence do I need? Before I hand over to Scott, I'm just still admitting people. I'd like to acknowledge the traditional owners of the lands that we're meeting on. And today we're meeting on a variety of lands, not just the local South Australian Ghana lands. But I'm there, so I'm going to acknowledge the owners of those lands. And I'd like to recognize their continuing connection to land, waters and cultures and pay my respects to their elders past, present and emerging. And also, I recognize that many of you are on different lands and I recognize those as well. I'd like to acknowledge them. Just slides. And before I hand over to Scott, I just like to alert you to an event that's happening on the 5th of July, so coming up soon. South Australia is collaborating with the Melbourne AES on a number of events and this is one that's coming up. They're going to be having a panel discussion on recent evaluation publications and research. And that's looking at what's happened in the last little while 2015 to 19. So what are the implications of the findings? It was a systemic review raised for evaluation research and practice. What has research told us about the importance of values and valuing in relation to evaluation theories, methods and practices and how has the research influenced or informed your own practice. So a bit of a chance for a discussion as well. So if you're interested, and I encourage you to be interested, please visit the AES website and register. The link is on the website. So it's a long link, so I didn't paste it in here. And now to talk about Scott a little before he begins his talk today. Scott, you'll notice has a slightly different accent. He's originally from Canada, but has been in Australia for many years. He's based in Canberra now, although has been in other parts of Australia as well. He's had over 30 years of experience in evaluation and has worked in quite a number of prominent areas. He's worked in the Oxford Policy Management Area, the Australian Department of Foreign Affairs and Trade, the Asian Development Bank, the Victorian Department of Human Services. So he's been around, done all the work. He's also a fellow of the society and he's published widely on evaluating. I'd like you to welcome Scott and I'll hand over to him now. We're just going to swap the screen so bear with us while we do that. We'll have a chance for discussion at the end of Scott's talk as well. Over to you, Scott. Thank you, Mark. And hello everyone. I'll just get up my presentation screen. All right, can you see that? Okay, Mark. Not yet. Oh, it's up on my screen. Have you hit the share button? Ah, where the heck do I find that? On the Zoom screen, the share button is on my menu in the middle at the bottom. Share screen. Share screen. All right. Excellent. Now I can see it. We're cooking with gas now? We're going. Excellent. Okay, over to you. Thank you, everyone. My presentation is really targeted at those of you who hire consultants to do impact evaluations, undertake impact evaluations using your own staff, people who review evaluation reports and you're wondering if they're credible, defensible, and those of you who are interested in the strengths and weaknesses of different impact evaluation methods. In the next 45 minutes, I'm going to show you a way to cut through a lot of the confusion that currently exist. We have arguments about RCTs versus surveys, qualitative versus quantitative. It's process, tracing the next new thing is contribution analysis still a great and all this confusion and debate. I'm going to show you a way to bypass and rise above those debates. And we do this by including three specific pieces of evidence in our evaluations. And if you adopt this approach in your own impact evaluations, your work will be better than 90% of what's being published today. So I'm going to start off with some practical examples to illustrate some some points. I'm going to talk about philosophy, the philosophy and impact evaluation methods. There's actually a lot of overlap between philosophy and impact approaches. And I'm going to talk about the evidence requirements for impact evaluation and that's where I'm going to give you three criteria and then we're going to practice using those three criteria. And then again, I'm going to give you a summary. There's a question for you to work on yourselves and we'll have time for questions as well. Okay, let's make a start. This is real world data from the US. There's actually a strong correlation between people who die by falling out of wheelchairs with the cost of potato chips in the United States. This is true. And I thought about this for a while now. And I think here's what's happening in supermarkets they put high profit items at high level and essential things that are low profit like bread or milk that tend to be low or high on the shelves. So if you're in a wheelchair and you're reaching for a high profit margin item you actually have to reach up high. And so what I think happens is you reach up high you fall out of your wheelchair you hit your head and you die. Now if this sounds silly well it is because I just made up that explanation. The correlation between deaths and the cost of potato chips is real but my explanation I just made that up. What I want to make is this two fold. Anytime you see a trend, two variables that have a common trend, they will correlate even if they have nothing to do with each other just by pure chance they will always correlate. And secondly, you know how you see you look at clouds in the sky, and you can see animals shapes, your former wife, whatever. The human mind is a master at detecting patterns and imposing meeting, we will create meeting, even if none exists. And so that's what I just did with my explanation about the relationship between deaths and the cost of potato chips, I made that up. This is something that is all very too easy for evaluators to do as well when they see some patterns of data, we can impose a meeting on it. In some ways we can avoid making these sort of false claims. How about this one? My daughter actually said this to me one day when she was about 12 she had a math exam that afternoon and I said to her in the morning Lisa, have you studied for your exam today. And she says no not really but I'll study hard next week and then I'll get a good grade in my interview exam. And I said well hang on you're going to study hard next week when your exam is this afternoon. Now on the surface of it that's quite absurd, but let's tease out what makes it absurd. It's absurd because we have a common understanding that something that happens today is the cause of something tomorrow, but something that happens today does not cause an event from yesterday. And that has a forward moving relationship, and that's quite fundamental and I'll come back to that. There's another example, the young persons on the plane and they're complaining that every time the fast in your seatbelt sign comes on the plane starts getting really rough and bouncing around. Of course the young person is making the error that it's the red light coming on the fast in your seatbelt sign coming on that's causing the turbulence, let's not there's a third thing, the external turbulence. Leads the pilot to turn on the sign, and then the people inside the plane experience a bumpy ride. So the young person's misunderstanding. What is causing what so there's actually a third thing that's causing the light the pilot to turn on the light and then they get the turbulence. Now, in the field of philosophy, so I'll put it this way, impact evaluations are about causal relationships. Does this program cause this specific effect. And philosophy actually is the study of causal relationships within philosophy there's a number of schools of thought of what causal relationships are, what they mean, how you study them. And what's not generally always wildly known is that each of these schools of philosophy has a corresponding evaluation model. Yeah. Um, so at one extreme, we have people who believe that the external world is governed by universal laws like the law of gravity and evaluators from this school of thought. Borich and the MIT poverty lab, they tend to prefer true experiments and quantitative methods. On the other hand, there's people like Guba and Lincoln qualitative folks who believe that the observable world is a social construct. The concept of concept of cons causality is a myth. They think there is no such thing. And so they prefer qualitative methods such as participant observation studies and they portray people's experiences and perceptions. When it comes to undertaking impact evaluations, in my view, I find one evaluation model, particularly helpful. And it's not very well known in Australia. It's a critical multiplicism and critical multiplicism believes that there is an external world that is independent of ourselves, but we can never know that world perfectly. The best we can do is study it from multiple perspectives. So this then takes you down the path of multiple methods. In this approach, causal relationships are probabilistic, not deterministic. For example, if you smoke a packet of cigarettes for 30 years, your chances of getting cancer are about 20%. But not everyone who smokes for 20 years gets cancer. And also some people who never smoke also get lung cancer. So the relationship between smoking a packet of cigarettes a day for 30 years, the relationship between it's the causal relationship between that and lung cancer is probabilistic. If you do that behavior, you get a much higher rate of lung cancer, but it's not 100%. So in that sense, it's not a deterministic relationship, the way some people think causality works. And in the field of critical multiplicism, the question of determining whether or not a program has had a particular effect is identical to establishing whether or not the program is the cause of a specified effect. So cause effect questions ask if the program led to a specific change in the target group. For example, do training programs improve employment rates for long term unemployed youth, or do low interest government loads lead to an increase in the number of self sufficient small businesses. So applying this in practice is in order to conclude that program X causes outcome why and I use the symbols X and Y because if you read a textbook about this that's the symbols that are generally used. So in order to conclude the program X is caused outcome why three criteria need to be met. And this is the heart of my argument. So you can say this program has caused an effect to specific effect. You first of all need to demonstrate an association between participating in the program and the outcome. So you might get a group of people who participated in the program. Another group of similar people who did not participate in the program and correlate assessed the relationship between participation in the program and achievement of the outcome. That's the first evidentiary requirement. The second one is establishing time order that is to say participation happen in the program happens first achievement of the outcome happens second. And the third is a need to rule out all alternative explanations that is to say all X explanations aside from the program that might have caused the outcome. So this is the heart of my argument. The three examples I gave you later, but earlier in the day potato chips and death. That's the criteria one is demonstrating a relationship that the example of my daughter and her studying for the exam after the exam was order was over is criteria number two, establishing time order. And number three the young person on the airplane thinking the fashion you seatbelt sign causes a rough ride is an example of number three where it was the external turbulence. So I appreciate I'm asking you to think about impact evaluation and what causality means in a different way than what is normally used and so that could be a bit challenging and I accept that and don't worry if you feel this is a little bit confusing just at the moment. We're going to work through some examples and I think you might find this is actually a really helpful way of thinking. So onto some practical explanations. Whenever you see a correlation between program participation and the achievement of some effect. Some outcome. There's only four possible explanations. There's four and only four. And the first one is the relationship is put pure chance. It's methodological error. It means nothing. It's like my wheelchair and cost of potato chips. Example, it just means nothing. There might be a correlation between the number of blue wing butterflies in Ireland and the price of scotch in Sydney. It just means nothing. Anytime you get too long term trends in the same direction they'll correlate. They may or may not actually have anything to do with each other can just be pure chance. That's the first potential explanation. The second one. And this is the one we're usually interested in this program. Causes the effect the outcome. We're going to try and find out if that's what's driving things. A third potential explanation is that people hire on the effect outcome actually seek out the program. So in that sense, the causal relationship is reversed. This appears to be what happens with some micro credit programs where people with lower income but not the lowest of the low income go and seek out the program. So in some examples with adult reading programs where adults with underdeveloped reading skills but not the worst of the worst. They go and seek out reading programs and so it causes a false impression of what's actually going on in terms of effectiveness. And the fourth potential explanation and this one is the hardest of them all is there's something else that causes both program participation and the effect, the outcome. And actually the correlation between the two is as they say spurious. These sort of alternative explanations are really, really common in social programs and they're also very hard to sort out a common example is where Z is motivation. So highly motivated people seek out the program and because they're also highly motivated they get good outcomes. It's not the program that makes the difference is their level of motivation. That's just a really common example. So don't worry if this still seems a little bit confusing yet we got plenty of time. We're going to go through some examples now. So I'm arguing that to be credible and trustworthy and impact evaluation must bring these three pieces of evidence to bear in making a conclusion. Demonstrate association established time order and rule out these alternative explanations and there's technical terms for each of those three things but I'm not I'm trying to avoid those. So let's practice. This is true. There's a positive correlation between iPhone sales and the number of deaths from falling downstairs. My wife has sort of warned me about this a few times because I'll be busy watching my looking at my phone I'm walking down the street in fact once I even tripped over a curb. But you know what, this is another one of those trend things we have two common trends. We have one trend of iPhone sales increasing over time and we have another trend of increasing number of deaths due to people falling downstairs which is driven by population increases. One's not causing the other. They're just two things that are independent of each other that just happened to be occurring at the same time. How about this one this one's kind of interesting. And this is also true. The more firemen that are sent to a fire, the more damage that is done. On the surface of it, you might think oh gee, don't send me too many firemen with my houses on fire. But in fact, this is a Z example. What's happening is if you have a really big fire. That's the Z. You then send lots of firemen, which is X, which leads to more damage to your house, which is the why. So this other thing is driving both of them. How about this one. This is also true. There's a strong correlation between autism and organic food sales. You might be tempted to reach a quick conclusion and say well, gee, don't feed organic food to your children. But as you might have guessed already. It's again, it's just two trends. They're independent of each other. There's no mechanism that links these two things together. It just happens to be chance. But it's very easy notice to overinterpret what this means. This is a common one. You see this one quite a lot in the literature. But there's a strong relationship between participating in job training and then getting a job six, 12 months later. I was at the European Evaluation Conference or G 10 years ago and someone did a big fancy statistical analysis on this. And then I said, well, yeah, but how have you controlled for motivation to surely the most highly motivated people will be the ones who seek out the job training and complete the training. And also the most highly motivated people are the ones who are most likely to get a job. So how have you controlled for that. And they were a bit annoyed with me because they didn't have a good answer to that. And it was a little bit embarrassing for all concerned. So one of my favorite examples, you know, the Australian doctor was about 10 years ago. He established that stomach ulcers most types of stomach ulcers are caused by a bacteria. And he drank the bacteria, God gave himself a stomach ulcer, swallowed antibiotics, killed the ulcer and the bacteria drank bacteria got himself an ulcer again took antibiotics got rid of it again. And he did several cycles of that. And he established quite clearly that these ulcers at least the ulcers of that type were caused by a bacteria. And yet here's the interesting part for me anyway, there's been research done. And I have to get this right now, 80% of patients and 70% of doctors following surgery for ulcers will report that the surgery was helpful. But that does not work. It absolutely does not work surgery for ulcers is a waste of time, because it's caused by a bacteria, and yet 80% of patients, 70% of doctors will say the surgery was helpful when it did not work we know it doesn't work. So I'm arguing that one needs to be a little bit careful in taking satisfaction and perception data as being equivalent to impact. These are actually different concepts in my view. And this is another example and this is real world example and this is my own work actually from a long time ago, I was working in juvenile justice, and I developed an intervention program for hardcore juvenile offenders and northern WA. And the first, the baseline year there was 440 juvenile offenses. And we introduced the program. There were street workers, recreation activities I was running a part time school for these kids I was organizing voluntary work for them. And two years later, we'd moved from 440 fences a year down to 350. I thought I was pretty good and I admit I probably felt a little bit snug smug about this. So, 440 offenses down the 350 20% reduction. Is that a good basis for me to claim success that my program worked. I deserve credit for that. It's actually not as simple as what it might seem at first. This is what was actually going on so this gross change of 20% reduction that gross observed change is actually made up of three things. There's the net program impact, the actual causal effect of this program that I developed if there is one. There's also the effects of other events and processes. Differentiation motivation differential selection other programs that are operating and also methodological air, perhaps the actual 440 down to 350 weren't those numbers were a little bit. There was some air in those numbers. And so the name of the game if you're going to use before and after comparisons is to try to keep methodological air to a minimum. Control for the effects of other events and processes and then what you're left with the final figure that you left with is your net program impact. So that then raises a question well how do you control for the effects of other events of processes. Well, as they're saying at the bottom of this slide. One way to do that is to randomization that is the whole point of randomization is to control for these other events. Another way is you can try and use control groups or comparison groups. Another thing you can do is try and identify and measure these other events and processes and if you can do that you can control for them statistically. If you can't measure them at least you can identify them and try and argue why they are relevant or not relevant in a logical sense. Or, as we often see in international aid where I work, people would just ignore them. Oh well there's been an improvement over the last 10 years therefore we deserve the credit and you just pretend that these other events and processes and influences never existed in the first place. I read a major journal study yesterday on this that had that exact example. They said, the higher the volume of international aid provided to a country, the better the governance processes were, and the better the economic development, the faster the economic growth was. And then at the very last paragraph of the 30 page journal article they said, of course there are all these other effects and events and processes that might be operating but we didn't measure any of them so who knows. I thought that was a bit of a fudge after having to read 30 pages to get to that. So I'm arguing that for policy making we want to know what works for whom and what circumstances how and what cost and what I'm calling interactions are pretty much the norm and by interactions I mean the program works really well for individuals but not females. The program works really well in rural areas but not urban areas, or for this group of people, not that group of people. You should expect that it is almost universal that this happens it is so incredibly common. It even happens biologically actually it's interesting. It's high blood pressure in the United States. If you're Caucasian European descent you get one drug if you're African American you get a different drug, because they work differentially for different groups of people. And it's the same with programs it's very common. I'm arguing that in order to conclude the program X causes out on why three criteria must be satisfied. I've said association time order and rule out alternative explanations for me this is the absolute heart. So, when you're planning your evaluation. I'm suggesting you be thinking about how am I going to get information. What's the evidence on association. How am I going to establish time order. What am I going to do to rule out alternative explanations. When you're planning your evaluation, think about these three things. Similarly, if you're contracting out your evaluation in the evaluation plan from your contractor, you want to be seeing these three things clearly specified. And if you want to consider the merits of a particular evaluation approach in a given context. Your contractor says oh well we're going to do process tracing. You want to say okay what are the strengths and weaknesses of process tracing in our context, in terms of getting those three pieces of evidence. What are the contribution analysis, or most significant change, how would they deal with those three things is to my, my argument is those three pieces of evidence those I'll call them evidentiary criteria, they're the main game. And to do this work, I'm offering you some tools. My first tool is this XYZ diagram. I would strongly suggest every single time you put your mind to this does this program cause this effect, you draw an XYZ diagram and you start adding the arrows. And this book I've mentioned parliament can see the whole book is just about these diagrams, how you use them. What works what doesn't work. And what do you do when you run into data problems. I would very strongly recommend this is always your starting point. A second tool I'm suggesting for you is to think about when you're working with observe changes from time one to time to this three bits of information in this. And you think about how you're going to identify and control for the effects of other events and processes. And at the same time you're minimizing methodological error, so that what you're left with is the net program impact. My argument is, I don't have a particular loving or hating for any one particular method. They all have their places. And methodological decisions always have to be context dependent. And so arguments in my view arguments about expense experiments are awful qualitative interviews are fantastic or surveys for statistical adjustments are the answer. I don't like those sort of arguments because methodological decisions always have to be made within a context, and what works is going to depend upon that context the data the maturity of the program all that sort of stuff. And that would be what you want to do is to select the strongest possible research design that you can, and then add additional research designs, measures and elements, until you've adequately covered off the three evidentiary criteria that to me is the name of the game. And this reference here is a real world example where someone has done that. It's a basic comparison group and they added one feature on top of another feature on top of another feature until at top of another feature, until they done quite a reasonable job of establishing association, temporal order and ruling out these zeds these alternative explanations. That's really worth a look this particular reference. It goes without saying that I'll just to reinforce it conducting an impact evaluation assumes that there's fidelity of implementation and consistent intervention. That is to say, there's something stable that we're actually evaluating the impact of and that this thing was implemented with integrity. I probably read three to four impact evaluations done by other people every week. And I think there's some common things that get confused. And I think it's important to distinguish between impact evaluation and theory building. These are quite different things you can. You can do an inductive theory building three evaluation. That's great. You've built a theory, but that's not the same as impact evaluation. Because for an impact evaluation, you need the three pieces of evidence that I've spoken about. Similarly, seeking evidence to support or confirm a theory, again, is not the same thing as impact evaluation in particular, seeking to confirm a theory does not deal with the issue of rival explanations. The Z, you're pretending that they're not there. Another common one is portraying the views and experiences of program staff and participants as if that is the impact. Unfortunately, there's, well, it's not unfortunate, but there's a very large volume of evidence in the field of cognitive bias. It makes it pretty clear that human beings are absolutely terrible at receiving causal relationships. And I don't mean to say that the experiences of program participants and staff aren't important. They are particularly for voluntary programs. But their perceptions are not the same thing as program impact. If you collect that information, great, share it as their perceptions, but don't try and then morph that into, oh, and that's the same thing as the impact. Another one is the opinion of experts. Martin Ravelon from the World Bank did some work on this and job programs. And there was a few different World Bank job programs and he got experts independently of each other to rate the success, the impact of these programs without having actually any data just seeing the programs themselves and their design and that sort of stuff. Unfortunately, experts are no better than laymen at perceiving program impacts. That is to say, his experts were basically guessing they didn't know. And to be fair to them, they can't know unless you've done an impact evaluation, you don't, you're not going to know. There is no basis for you to know. So I'm arguing, let's be careful in our reports and not confuse theory building confirming a theory portraying people's views getting expert opinion as the same thing as impact evaluation. They're not the opinions of experts are the opinions of experts. That's not the same concept as program impact. And when you're planning your evaluation, there's some standard quite common points of comparison you can use. You can have treatment versus control or comparison groups. You can compare intervention A versus intervention B versus intervention C. I think that's hugely underutilized quite honestly. You can have dose response patterns that is people get no program, no program whatsoever people will get a little bit of it people who get a bit more people get a lot intensive. You can look at the program and you can look at their outcomes and how they change. You can look at outcome trends or trajectories over time how the outcome variable has changed before during after program participation and that can often be quite insightful. And also look at outcomes across different locations, because different groups and time periods. These are just some common, some common ones that are worth thinking about. I think it is now fair to say, over the last 25 years, there's quite a bit of research been done. And we can say with some confidence based on this research, we know now pretty well what the stronger non experimental research designs for impact evaluation are. So regression discontinuity designs where you have a decision criteria like people below this score get the program people above that score don't, or people score high on this thing getting an award or a benefit and people below that we have one of these programs where eligibility is based on a cut off score. They are very, very strong and you can get very confident results from those sorts of studies. We've got an uninterrupted time series we've got an ongoing time series. Could be say 1520 time periods with a. And there's a program introduced in the middle of that and then you've also got a comparison group that doesn't get the program. Those are also very strong. The research literature calls multiple baseline designs or economists call pipeline designs is when you have a series of phases to a program in phase one, the program goes to area A in phase to the program goes to area B phase three the program goes to area. Three and so forth. Those sort of designs are also really important, because if you track the patterns in the outcomes you can lead to some very, quite strong conclusions. cohort designs, or where you have a group say finally your high school students and you compare the outcomes from one year to the next to the next to the next. And there was a program introduced somewhere along that pathway, alternating treatment remove treatment designs, it's not something you see much outside of medicine. But with a single person. If they were in need of particular medication they could get it their symptoms improve you take away the medication the symptoms got worse. These are very strong conclusions, because it helps you to establish time order association and it does rule out alternative explanations but not so relevant for social programs. And finally comparative case studies or dose response models based upon a well developed understanding of the program with pattern matching. These can be quite strong to but but I say be careful because it requires a well developed understanding of the program so that you know what alternative explanations you should be looking for and measuring. That way you can compare patterns over time or the different outcome variables Campbell in his book calls these not equivalent outcome designs. So in terms of general guiding principles, you need high quality outcome measures match groups, if you can program and comparison on pre treatment outcomes. So control for self selection that is differences in motivation and incentives across the groups. You want to measure outcomes several times both before during and after program participation, or at least four times if you possibly can in total before the program, the middle of the program short term after the program longer term after the program to get a sense of trends because I can really help you understand what's going on. And to allow for interactions that is say some groups of participants do well others not so well in the program. I'm coming up to the end of the points I want to make. I'm saying that method lead approaches to impact evaluation are not the way to go. We all know people who say right, no matter what the question is a context I'm doing a case study or I'm doing some econometric analysis or now I only want to know about spare experiments Recently I've come across a few process tracing advocates who think that's the only approach. I do not believe in method based approaches for the reasons I've outlined method methodological decisions need to be context dependent in relation to the program. And the question is not the method. The question is how you get this three relevant pieces of evidence. Similarly, I'm not a, an advocate for program theory lead. There's a few reasons for that and cook in his 2000 paper. What do you call it the false choice between experiments and program theory evaluations program theories can help you decide what to measure. But unfortunately, it's not the same thing as impact evaluation. What John Maine was doing in contribution analysis was basically he wanted to use impact. So he wanted to use performance indicators to make impact evaluation conclusions. And so what he did was he'd say okay well you have a theory of change you measure the different levels of your theory of change and then you reach an impact. Inclusion. Unfortunately, it took the field of evaluation nearly 10 years to realize that that wasn't enough on its own. And so towards the after about 10 years main himself started writing about well you also need to rule out alternative explanations for any changes, but he didn't actually offer any views and how you did that. And he never came to grips with the idea of temporal order that is to say the program happens first outcome has to happen second so. The mirror in the European Journal of Evaluation wrote an article about contribution analysis two years ago saying it was a useful adjunct but on its own it wasn't quite enough to reach credible impact evaluation conclusions and I know some people might not like what I've just said. But my point is not to be driven by particular methods. But to be driven by our evidentiary criteria, association, temporal order and ruling out alternative explanations that to me is the main game. I don't mind if you do case studies econometrics or whatever else. I just want you to make sure that you've got those three evidentiary parts. Covered off. I'll stop there I've got a question for us to discuss but before I do. I like to invite john to let me know if there's any questions in the chat sort of people want to ask any questions before we move on to the next little bit. John has posed John pillow has posed a really good question as has Ian so we've got two questions for you. Okay, far, far away. I'll read it and if john wants to come in online and augment by all means and the same will apply for Ian. So john's question or comment is if the universal effectiveness of an intervention is proven, and it is to be implemented for the first time in a new environment. Is it is program fidelity sufficient to assume that it will also work, e.g. flu vaccinations, no need to re demonstrate effectiveness just need to ensure program fidelity. John suspects there still might be risks. What do you think Scott. Yeah, I'm with john I was reading a really interesting article two or three days ago from Howard white from the Campbell collection on this very issue. And he would say well we've established that the program works in this context. Let's treat that as a hypothesis as to perhaps it might work in this other context that we're familiar with but we can't take it as a given. So we better treat that as something we need to test and examine, rather than just say, because it worked in place a it's going to automatically work in place be. Maybe the contextual environment is different in second place or maybe the implementation will be different in some important ways in the second locations. So, yeah, I think it's a working hypothesis rather than a given. Does that mean we need to collect the same information all over again. Well, I would be interested in the second place. What is knowing what is the extent of the need. How do we know how much need there is. If you're trying to service 1000 people 10,000 100,000 people. In terms of the service delivery mechanism there could be cultural considerations. If you're trying to deliver vaccines to rule parts of png highlands refrigeration is going to be a problem. Are you going to manage that. So, not necessarily the similar but you need to be conscious of what the risks are. Is there something else I can help with. Okay. Thanks Scott. So, Ian has a question providing those funding for new programs. Perhaps those providing funding for new programs should apply the principles outlined today before they allocate funds. I think it may improve the post program evaluation process, particularly in relation to assigning meaningful timeframes and appropriate funding to assist the production of meaningful outcomes. What incentives or forms of motivation would you apply to the politicians and funding sources to do more than just throw money at the project rather than taking a more responsible approach to the to the end to end evaluation Have you any examples where the value evaluation process has been clearly defined at the time of the program funding. Oh wow you got some great issues and that is, I love it. Um, just for context here in Australia we've recently had the government announced we're going to have an Australian Center for evaluation. And the purpose of this Australian Center for evaluation is to build capacity in common federal Australian government departments to conduct and use evaluations, and also to do some randomized control trials to produce high quality evidence about what works. I think they've got 10 million over the next four years. Around the world, we've seen a number of different initiatives to drive evaluation. For example, in Australia from 1986 to 1996 we had the financial management improvement program that made it compulsory to do evaluations on a three to five year cycles. And that worked pretty good in the initial years. And then the Department of Finance use the results for those evaluations to adjust budgets. So programs that were performing well could keep money programs that were performing not so well either lost money or had to be redesigned somehow took about three years for government agencies to understand what was going on. And then after that every evaluation said my programs fantastic because they were trying to protect their budget. This has been a dilemma for how to get the political system to value evidentiary information in the decision making process, every Western country and some others are struggling with this. The one, the World Bank has who g 45 years of experience and evaluation capacity building now around the world. And there's probably, I'd say there's three, four areas you can focus on one is leadership demand for an ability to make use of performance feedback. The second is the supply of performance feedback evaluation work. And this is all underpinned by what I'll call the institutional infrastructure your policies your systems your resources and so forth. Every country is done just about every country around the world he's got into evaluation capacity building has done the same thing. They train staff, they form a community practice. And they produce more evaluation reports. That's just about what every country has done and it has failed consistently around the world for the last 30 years and yet we keep doing the same thing. Countries that have made good progress. Chile 15 years ago, Malaysia, 10 years ago, South Africa right now as we speak, they all have one thing in common, and that is senior level commitment to the evaluation process. So government and MPs are demanding evaluation, and they are using it for their own purposes. That is a consistent finding around the world. Let's go further. It would seem that Chile is really quite interesting because what they did was high a profile impact evaluations that had major policy consequences or potential budgetary consequences were controlled by the Prime Minister's office. They did it. They often use academics or consulting firms but they wrote the terms of reference they oversaw it they funded it, they did it. And then they would let operate departments do implementation type studies. That sort of distinction I think is quite good because it helps to align the incentives. It continues to surprise me that we can have major departments. I've worked in several of them as a consultant so I'm not going to name them all, but they don't have an evaluation plan. I know their staff capacities are limited. They have their systems aren't necessarily set up for this. So there's things that can be done. For example, all new policy proposals in the go to cabinet could have an evaluation requirement in them saying yes, there's a new policy proposal we need this much money for the program. Here's how much we're going to set aside for the evaluation and we're going to do our first evaluation here in a second one that that could be part of it. Why aren't secretaries held to count for major programs I can think of a couple that are worth more than $100 million than operating over 20 years and have never been evaluated. How is it the secretaries aren't held to account for that sort of behavior. So yeah the incentive there's a combination of incentives and capacity gaps that I'm talking about in general, and we have a lot of international lessons. But there's that marrying of the political agendas and the more technocratic bureaucratic agendas. Now you get them to speak to each other. That's fascinating. I know I've rambled a little bit but it's interesting topic. Thanks Scott. I liked what you were saying there particularly about how do we get the senior leadership to take some responsibility. It's always been a problem. We've got two other questions one is could you please repeat the author you just mentioned who said to treat our working program in one context as a hypothesis in another. Well, Howard White, WHITE. He used to be the head of the three IE the international impact evaluation, they do all sorts of impact evaluations around the world, and then they say right here's what works in wash right here's what works in education. He's now head of the Campbell collaboration. And so he publishes blogs and websites. He writes some really good stuff often from an econometric perspective, but his commitment to evidence and using that to inform decision making it is beyond question in my view. Okay, thank you. That's useful. Emma Freeman has asked what can you do if it's very difficult to rule out alternative explanations, a very complex environment where there might be many factors involved, and I was going to ask something similar. What about the motivation side of things how do you know, or rule out if it's motivation causing a result because without a program, the outcome perhaps wouldn't have been achieved, even if people are highly motivated. So, I'll be to you Scott. Yeah, I'll answer in a couple of different ways. I think that we, I'm not trying to be an advocate for experiments because I just don't want to be in that position and it's not my belief, but having said that. I think in the early days, we don't make nearly enough use in Australia of pilot projects pilot programs, and we test things. We don't do that nearly as much as what we could do. Another example is there's a research design called the factorial experiment where you can have multiple different combinations of the main intervention plus various combinations of supplementary bits. If you could implement one of those and test it that would give you a nice sense of what combination of factors works and what doesn't. And if you have the money and you can do that in two or three different types of locations like a major urban, a rule, a remote area that will give you some sense of what works. But it doesn't let you off the hook of still having to find two things in the local context, but it gives you a start. My diagram on gross obs, what I called gross observed change was about that issue of oh okay well maybe there's a program effect but then there's the influence of other things. And I gave you a list of what of what to do about the other things. One way is randomization. That's not always possible or feasible or desirable. Another way is to identify them and measure them and then you can control them and we eliminate their effect by doing that job training program for young people if you could. Before going into the program if you had a measure of their motivation and their academic background and skills you could then figure that into the equation. Along with well here's the effect of the program here's the effect for motivation here's the effect for education and skills on job outcomes. So randomization measure things you can if you identify and measure them you can control for them. Another way is to try and use logical argument. Campbell in his different books argues about this. He would say well, if the effect was not the program but motivation we would see expect to see the following pattern of consequences on different variables. And then he writes people to articulate what that pattern of effects would look like if it wasn't the program but it was a motivation effect and then go and look for that to see if that's what's happens. It's a little bit like what process people process tracing people try and do when they look for a series of logical relationships this affects this but affects that. But it's a heck of a difficult issue there is no doubt these alternative rival explanations are by far and away the most difficult thing to get a handle on. And then there's the logical argument approach where you say well, for this logical reasons I don't think motivation is a factor here. Oh, well, why would you say that well we have a mixture of highly motivated and lowly motivated students in our program or whatever the argument might be. And then the last one, which really annoys me is when researchers evaluators just ignore these alternative explanations pretend they don't exist. I think that's doing a disservice to everyone at least acknowledge them in your report that there are potential even if you couldn't adequately deal with them. Thanks. Okay, thanks Scott now that's a fair comment and good response so appreciated. We haven't had any more questions we did get a thumbs up from Emma on that so that looks good. You had a question that you wanted to pose to us over to you Scott. Maybe I'll just, I think we've got a moment, I might say something just very briefly about experiments and ethics because people often wonder about that. People will say randomizations on ethical improper. I don't, I don't share that view and I just wanted to explain why. If we knew a program work. I would agree that it was an ethical improper to deny people access to that program. But most of the time we don't know whether a program works or not, in which case it makes to me it's quite reasonable to have an experimental group that gets the program and another group of people who doubt. And imagine this. What if the program is neutral or actually hurts people and that happens surgery for ulcers is an example. So in that case, not getting the program isn't unethical. It's actually an advantage. I can give you another example to imagine that you have 10,000 people who really need a program, but you only have enough resources to service 1000 of them. In that case, randomly allocating who takes up the program is actually probably the most appropriate thing to do of all. When people say experiments run ethical if that argument assumes that the program works, which may or may not be true. And if, and if we believe the program works, well then we don't need to do an experiment anyway because we already know. So I find that argument, people who use that argument haven't quite thought it through completely, but it's complex. Here's an exact discussion question I was going to invite people to wrestle with. This is true. In 2006, the World Bank found that better monitoring and evaluation is associated with improved outcomes of their development projects. And so my question is, does this mean that M&E is an effective way to improve development results. I wanted to invite people to wrestle with this issue. And my approach is always is to ask you to think in terms of X, Y and Z. In this case, the intervention is M&E. M&E is your X. And improved development results is Y. So I've used words but what I've said is there's a relationship between M&E and improved development results. Does that mean one has caused the other. How do you want us to contemplate that question, Scott? Do you want us to break out some groups or just people to turn their microphones on and offer up a comment. Maybe we should give people two or three minutes just to mull it over and then we could have a group chat. That would be good. And while people are mulling it over, I noticed John Pillow has also posted a comment suggesting what you've said is what prosecutors might be doing. Particularly when they only have a circumstantial case, they need to prove association, get a time order and allow alternative explanations and have a logical and proven sequence of events that leads to the conclusion that the butler did it. So that's an interesting observation because Michael Scriven runs this argument that evaluation is like detective work. I don't agree with them. He'll use the example that somebody's dead and another person is holding a smoking gun. And we can use that sequence of events to reasonably include that the person holding the smoking gun shot the bullet that killed the dead person. My problem with that is that's not how evaluations work. So he starts his argument with a known effect, a dead person, and a potential cause, a person holding a smoking gun. Our work as evaluators is exactly the opposite. We start with a known program and an unknown effect. And Scriven's argument is exactly the opposite, a known outcome and a possible effect. So I actually don't like his argument on that one. But anyway, lots of people disagree with me on that. I can see the logic, no, fair enough. Mark, shall we give people, I don't know, two or three minutes to mull over this issue and then we'll come back and kick it around? Please do. So, Vicks, you've got a question and then Emma. It was a comment. Thank you. Hi, I'm Catherine Pontifix. I'm here on Gunnerland this evening. I guess the Zed that immediately came to my mind was that an organization was sophisticated and many arrangements may broadly be a more capable organization, more sophisticated organization and therefore implementation may also be effective as a consequence of that organizational capacity and capability. That's the Zed. I just wanted to say I work in an epidemiology branch and they use directed acyclic graphs. I had to look it up, but we all call it DAGs, which I find a really useful way just as you've done with your XY and Zed to contemplate all the possible Zeds and whether they really are rather than throwing into a statistical mix to eliminate confounders throwing everything into the kitchen sink, really being intentional about what you think the possible confounders might be. So anyone wants to do reading type in epidemiology DAG, and they're quite a good tool. Yes, the book reference I gave you from Pearl, that's the source. That's where DAGs come from. He's the guy who originated this type of thinking. And I agree. I really like what you said about being purposeful and identifying what these Zeds, these alternative explanations might be, and then trying to nail down the evidence. Well, if this is alternative explanation is the one that's really driving it, I would expect to see this consequence, but if it's not, it's something different. I'd expect a different pattern of results. So I really like how you put that. Thank you. Okay, Emma. I had to say the same thought as Catherine, which was that there was likely a shared cause of the two. And the only other thing I had to add is that I guess the time order of these things might suggest that better monitoring evaluation doesn't include, doesn't cause improved outcomes. And that's sort of the mental process. I'm always applying every time I have one of these situations. I literally get a piece of paper and I draw X, Y, and Z, and I start saying, well, which one's X, which one's Y. In this case, I'm saying M and E is associated with improved outcomes. So M and E is your X, improved outcomes is your Y. What does that mean for time order? Is it necessarily one occurs in time before the other? Or do you get better outcomes? You get better budgets and then you get more to spend on your M and E. Does it work that way? Or does good M and E happen first? We don't know because I didn't tell you. And I didn't tell you on purpose just to make it, you know, have to think through it. Or maybe there's something else that drives them both. So thank you for your comment. That's really good structured sort of approach. Other views or comments people want to make? I haven't seen anyone. Please feel free to take your microphone off and make a comment or ask a question. I certainly wonder whether evaluation results in the collection of better data, particularly if it's planned and funded, and therefore you can actually see your outcome. Yeah, used to drive me nuts when I worked in DFAT, the number of development projects that we never collected baseline data on. And so we'd be five years into a project and they'd hire a firm to do an implementation outcome study and they'd go, yeah, but we really don't know what the change is. My analogy is, well, I've been on a diet for five years, but I've never weighed myself until five years in. It's kind of hard to know if I'm improving or not. Yeah, agree. Oh, we have a hand. Yes, Melinda. Please ask your question. I'm feeling like this might not be correctly philosophically pitched, but we've talked before in research about the Hawthorne Effect, you know, where by virtue of participating, by virtue of asking particular questions or engaging with the subjects or writers of a particular program, the fact of the questions themselves or the fact of creating reflection can sometimes change the trajectory of outcome. So when I look at this question, I think what comes to mind for me is a lot of people really emphasize that notion of the Mel cycle. I mean, if you talk about impact, as you're saying, right at the very end, then yes, there are challenges with talking about correlation. But if activities of the M&E are occurring through the course of a program and people are getting feedback, I would hope that there are improvements in what's going on as people reflect on what they're doing. But I don't know if that's kind of missing the point philosophically. No, no, you're in the right neighborhood. I too hope that M&E, better M&E leads to better outcomes because people are getting feedback and using it. But how do I know it's caused it, or maybe something else has caused the better outcome? I'll tell you the answer. In the World Bank, they have these big databases and they rate their projects every year until they've finished and they track outcomes and effectiveness. So they had a database and it was 5,000 projects. And what they found was, yes, there was a strong relationship between the quality of M&E and improved development results all around the world. But the thing was, there was a Zed and the Zed was the skills of an experience of the team leader leading the project. So highly skilled team leaders had better M&E and they had better development outcomes. There was a small effect just from M&E on its own, but most of it was driven by the Zed, the team leader skills. So is that about, then, do you reframe the variables that are related to cause? Do you know what I mean? Like it's Zed, you're suggesting that Zed's influencing both those things. But then is there another way of framing it if you were to move forward in the future and to say that people with strong management and evaluated skills are going to see better outcomes in their programs in the first place or something along those lines? And that the M&E is a variable, is an actual component in that success? Yeah, if you wanted to run an argument, the key to success is highly skilled team leaders and highly skilled team leaders have ABC and C, ABC and D skills, including good at M&E. And that's where you get good development. I'd accept that argument. Does that mean you could consider if you were thinking in a clinical way that M&E is potentially like a placebo effect? Do we need to run a trial where M&E is conducted, but there is no program that should work to see whether we do get a positive outcome? Oh yeah, I see your argument. There was a small independent effect of M&E leading to better outcomes that was independent of the team leader. But it's interesting because it was the team leader who, A, put the M&E would make sure M&E was in place and also the team leader who would make sure the results of M&E feedback was also used. So in that sense the team leader was central to the whole thing, but there was a small separate effect that was independent of the team leader. It was only small. Okay. Any other questions? It's Ian's here. I can't show you the picture because I'm in a state of undress, but that is it. Thanks, Ian. Too much information. My question, first of all, Scott, absolutely brilliant. Never spent a better hour on a webinar in my life. Really fantastic stuff. My question to you though is going to be along the lines of everybody listening to this or participating in this webinar, all do things differently. And if we talk about good M&E team leaders, I would have thought that we'd have a high number of motivated people teaching and training everybody to be the same, but as we did know, as we all know, they're not. So therefore the biases of these team leaders come into play as well. So in your Zed factor, I'm taking into consideration who was available at the right price to do the particular study. And that sounds cynical, but I have an accounting background so I tend to be cynical. I take your point. Yeah. In which case, how do you determine what good M&E, what a good M&E leader is and is that your bias coming into play? Yeah. No, that's a fair comment. I haven't given you all the background just to save time. But in the World Bank, just like DFAT actually in the Asian Development Bank, they all do annual ratings of all their investment projects and they rate a whole set of things, the relevance, the efficiency, the effectiveness, the sustainability, the monitoring and the value. And so this research team with the 5,000 projects to look at had all that data to work with, and then they themselves went and developed a measure for the skills and experience of team leaders and then brought that into the equation. So in that sense, there was pre-existing 5 years on the projects, but the data on the team leaders' skills and experience was brought in after the fact. So yeah, there are some potential biases in there. Can I just use this? Sorry. With the countries that these programs are being performed in, also that impact on the way the team leaders are, because the team leaders would be working with the particular countries. So that was the most fascinating part. The impact of the team leader was far and above the impact of what country was involved. So whether it was China versus Pakistan, this is Halao, was actually much less important than the team leaders' skills. And I wouldn't have guessed that. That to me what was, that was the huge insight for me, that the skills of the team leader is what was the overwhelmingly important factor that drove good outcomes. And I didn't expect that. Thank you. So if the skills were the key impact for the team leader, were there one or two particular skills that were identified as being essential? Yeah, I think they were. But to be honest, I can't quite remember what they were. But yeah, no, you're right though. They did, they did talk about that. Sorry, I just can't remember the answer though. That's right. No problems. Any other questions or comments? And thank you Ian for your last question. Well, can I say I'm going to offer to share a copy of my PowerPoint with people. I also have a bibliography of references on impact evaluation and a little thinking piece about things to keep in mind when you're planning an impact evaluation. So I'm going to offer to send these as PDFs to Mark. And maybe you can liaise with the AES and I'm happy for them to be shared with everyone who registered. That's a very kind offer. And that is appreciated. I'll make sure that that happens. So thank you Scott. If there are no more questions or comments, I'll just double check the chat. I've got a. All good. You're getting some thank yous in there. I'd like to thank you for the wonderful presentation Scott and the thought provoking questions. And I know it's difficult for you to see rounds of applause. But if people thumbs up or put a thanks in the comments, that'd be great. Or a clapping hand and people are doing your time today and also spent preparing for this is greatly appreciated. I think we've all learned something you've made us think and that's what we've come here to do. Many, many thanks. And I'll make sure that we disseminate the information that you kindly share with us. And to those of you who posted questions or come online to speak. Thanks for your participation because that helps helps make the event really worthwhile and engaging.