 Okay, Paul, when do you think are we ready to go? Or maybe one short introduction on the meeting. So we started the meeting last year. As you know, it was the first version of the causal data science meeting. This is the second one. And yeah, we are really excited to see how this growing. So it once was an idea to start a 20 30 people workshop. And now we had at the peaks. 1300 registrations. So yeah, really on a journey. Also thanks to you, Gilo. And yeah, so with the causal data science meeting, we want to advance the connection between academia and industry. So to have more conversation, to help each other and to advance the research, but also practical frontiers. And we from our, our side try to do that. And by working in research projects with colleagues by providing workshops with academics and industry. But also by offering soon like a little free tool for practitioners across the data science canvas. So we are going step by step and try to make our little contribution with the meeting. But for now, we are extremely excited to welcome you, our special guest. You have been a professor of economics at Stanford University since 2012. You earned your PhD from Brown University in 1991, and longer served as the editor of econometrica since 2019 already. In 2021, you were co-award the now I need to get this right. Nobel Memorial place in economic sciences for your methodological contributions to the analysis of causal relationships. And you also joined me with Joshua Angriest and David Cart. By incidents where you were born just one hour drive from Maastricht, one of the cohorts here of the data science meeting, you were born in Gelderok Netherlands. And yeah, I think with that being said, we are very thankful for you for taking your precious time to join us. And the stage is yours. Thanks, Paul. Thanks for inviting me. So you guys can hear me okay. So I'm very glad to be here. It's obviously been a very busy month. The last month, but it's, it's very nice to actually be back to doing presentations and back to doing work. And so the theme of this paper actually fits in very well, kind of what you were describing kind of the, the whole conference is about them. I want to talk about some aspects of synthetic control methods. And that's kind of a very interesting area in the way the applications have been far outpacing theoretical work kind of in, in economics has started kind of with Albert Abadi's work in 2003, kind of where he tried to estimate the cause or fact of terrorism in the past country. And then he wrote some more papers in 2010 and 16. But there hasn't really been a huge amount of theoretical statistics and econometrics work on this. But in industry, there's a huge number of applications of these, these methods. There's lots of cases where this is a very natural set of, of methods to use. And, and as a result, people are kind of really using these things very widely, even though that doesn't necessarily show up in a lot of the academic work. And so here, and I should actually say this is joint work with the two students of mine at Stanford, they are Botmar and young space and the Merrill Warnock and a colleague of mine at Stanford young space. Actually, Lea Botmar was a student in Maastricht before she started the graduate program here at Stanford. But so here we're going to look at a particular aspect of synthetic control methods and relating it to randomized experiments and to see what synthetic control methods have to, to offer there. Here's an outline, I want to just give some general comments, make some general comments, kind of show where this is going. Then I'll talk a little bit about the synthetic control methods in general, the way these were introduced into the econometric literature by Albert Abadi, Diamond and Heinemuller. Then I'll talk about the design aspect of, of the paper and what the properties are of synthetic control methods. If you actually have random assignment of, of units to the treatment and we'll see that the properties there are not necessarily as, as attractive as you might think. But then we show that by more of, by minor modification of the synthetic control estimated, you can get back to some of the attractive properties of, of the difference in means estimator and the random assignment. And we show that in the end, this gives us an estimator that in arguably realistic settings is going to do much better than the simple difference in, in means estimator. And so, you know, as I sort of said at the very beginning, the synthetic control methods have, have become very widely used in, in recent years, kind of very quickly after their, their introduction really. And they're often used in a setting where we have a relatively small number of units. Then with, we have, we observe both pre and post treatment outcomes for, for these units. And often there's only one or a single on, or a very small number of, of treated units. So one of the earliest application was where Abadi was interested in estimating the effect of terrorism on the Basque country. And so he had a one region in Spain, the Basque country that had been exposed to terrorism, and he had a bunch of regions in Spain that were not exposed to terrorism, at least not to the same degree. They're trying to estimate what the cause or effect of that was on the Basque economy, using data on the GDP, both before and after the terrorism, and the information on other characteristics of these regions in Spain. Another canonical example is, is a paper by Abadi Daim and Heimler where they look at the effect of some anti-smoking legislation in California, and they have data on the smoking rates both before and after the regulation in California, as well as for all the other states in the United States. And kind of the third canonical example, also in a paper by Abadi Daim and Heimler, they're interested in estimating the effect of the German reunification on the West German GDP. So question is, had East and West Germany not gone back together, what would have been the West German GDP relative to what it is now given the reunification? And so they have data on West German GDP both before and after the reunification, as well as data on GDP or per capita GDP for a bunch of other OECD countries. So all these cases, all these examples, there's a relatively small number of units, there's only one treated unit and there's a number, a modest number of pre and post treatment periods. And so all these examples are observational studies, there was no randomization. The question is, what I want to talk about today is whether synthetic control methods have a role to play when we actually do randomized experiments. I'm going to look at the properties of the standard synthetic control estimator and want to see if we can maintain the guarantees that the conventional estimator under randomization enjoys and ultimately better when we do randomize experiments but if we should still look at the synthetic control estimators even though they were intended for when they were developed for observational studies. And so they're going to give a preview of some of the results here. It turns out that randomization doesn't really validate synthetic control methods. In general, the standard synthetic control estimator is biased even if you have complete randomization, kind of random selection of the unit as well as random selection of the time periods. And it's obviously very rare to actually have both but even in that what is arguably the most favorable case in general, the synthetic control estimator is biased and it can be substantially biased. But we can modify it. We can ensure that it's unbiased under randomization by imposing a particular restriction on the synthetic control weights. And that gives us an estimator that is like the difference it means. That is unbiased under randomization and we can do inference for that estimator. We can characterize the variance and we have an unbiased estimator for that variance. That unbiased estimator is kind of a little funny in the sense that it's not guaranteed to be no negative but in practice that turns out not to be much of a problem. There is some sense in which it's not possible to get an unbiased estimator without allowing for the possibility that you get a negative variance. But on the positive side it turns out in realistic settings the root mean squared error for this modified synthetic control estimator is much better than for the difference in it means estimator. I'm going to illustrate that and I'm going to give essentially what is the bottom line of the talk. Here we do an artificial experiment we look at a setting where the units are the 50 states we have data for 40 years and the outcome is the average log of 50 states by state and year for these 50 states in the United States and for 40 years up to about 2020. What we do then in the simulation exercise is we repeatedly pick one unit at random from these 50 states we pretend it's being treated in the final period we use either a synthetic control estimator or the difference in means or a proposed modified synthetic control estimator we take one of these estimators estimate the effect of the treatment we know there was no actual treatment so the true treatment effect is zero we compare our estimator to zero we square the error and average that over all the 50 states and if we do that the difference in means estimator is by design exactly unbiased and it has a root mean squared error of 0.105 the synthetic control estimator has a slight bias in this case the nature of the bias depends on the data in fact it's possible depending on the data that the bias is much bigger but in general it's not unbiased under this randomization but its root mean squared error is actually quite a bit smaller than that of the difference in means estimator it's about half that so we clearly do much better using the synthetic control estimator but we don't necessarily have guarantees there but then we can modify that synthetic control estimator and I'll show later exactly what we do there but if we use that modified synthetic control estimator we get an estimator exactly 0 can you still hear me because of this yeah we can hear you but your video is gone my video is gone okay let me sorry I think that was when I was writing on the screen here so this proposed estimator has a bias that is exactly 0 but it still maintains the improvement over the difference in means with the root mean squared error 0.048 again about less than half what the difference in means estimator is and so what I'm going to do in the rest of the talk is kind of talk more in generality about synthetic control estimators where the bias is coming from and properties are under randomization and then introduce this modification that ensures unbiasedness under the randomization distribution but in many cases does much better in root mean squared error in root mean squared error sense than the difference in means so let me now kind of step back a bit and kind of look at the general setup here so I'm going to look at a case where we observe for N units and T time periods some outcome by IT index by the units and the time periods and in some time periods some units are being treated I'm mainly going to focus on the case where that's just a single unit in a single period but in practice that's the last period the units move into the treatment but don't back out but in general it can be I'm going to look at the case where there's a single pair that is being treated so there's a single element of this N by T matrix W that is equal to one and all the other elements are going to be equal to zero and that allows us kind of to simplify some of the things we have in the paper we have some results for more general cases but a lot of the the intuition and the insight is coming from looking at this case where it's just a single the treated unit time periods the pair and so the rows of Y and W are going to correspond to the units and so in the example that I'm going to use that's the 50 states of the US and the columns are going to correspond to time periods and so that's years in that example but these applications are often characterized by both N and T being relatively modest these are not big data settings the observations themselves may be averages for example in the smoking case in the application here it's average log wages which could actually be based on a large number of units but in the end when we apply these methods both N and T are relatively modest and that's going to matter moreover N and T are often approximately the same size and so that's going to also affect our ability to use particular methods if we try to predict conditional expectations of outcomes in the last period given the past we're going to be doing this in the setting where the number of observations and the number of predictors is of a similar magnitude and so we're going to need to worry about regularization so thinking of this in terms of potential outcomes I'm going to assume that there's no interaction between the units where a particular unit gets treated doesn't affect outcomes for other units and there's also no dynamic effects so only the periods during which a unit is treated is there a causal effect of the treatment both those assumptions are obviously very strong and not very realistic the things you can do in both cases but the insights I'm going to talk about today are not changed by having those complications but they're harder to see but again if you think about the Basque Country example clearly the terrorism in the Basque Country also affected things in other parts of Spain when the German reunification happened that had effects on other countries we're going to ignore those these here and we're going to make such type assumptions that other units are not affected by these treatments moreover we're going to assume that the effects are not dynamic that they don't last multiple periods because in most cases the treatment happens in the last period allowing for dynamic effects is just going to change the interpretation of the results it's not really going to change what you actually do but for some of the insights I want to talk about it's important that we allow for the possibility that the treatment happens in earlier periods and to discuss that it's easier to look at the case first where there's no dynamic effects but this is a very active literature and there's a lot of work going on looking at more complicated settings including allowing for dynamic effects and in practice in many cases that's very important not in all cases in many there are a lot of cases where the effects really are just contemporaneous and have no dynamic component so that's what I'm going to focus on here so given these two major caveats kind of we have these two potential outcomes we're going to be interested in estimating the average effect for the treated we could look at other effects but again for expository purposes it's much easier to focus on the average effect for the treated and so given that we see the outcome for the treated unit time period pair having been treated what we need to do is just impute kind of the single missing value here single question mark in the Y0 matrix given all these other data we may also have other characteristics other covariates I'm going to abstract from their presence as well because it doesn't change the conceptual insights that I want to talk about here so the statistical problem is going to be imputing this question mark given all these other observed values of Y0 I'm not going to use Y1 directly to do that imputation I'm just going to use the observed Y for that particular unit to estimate the average effect but it's all about imputing this missing value given these other observations on Y0 and so to think about that problem one standard case actually we go back for a moment the one standard case is to assume unconfoundedness type methods where we take as the unit of observation the rows in this matrix we take as the outcome the last the period in which the unit was treated so if that's the last period we take this column of values as the outcome and we estimate a predictive model using this as the outcome and using the previous years as the predictors and so that goes by many different terms unconfoundedness, ignorability, conditional dependence assumption you can use a backdoor criterion we're trying to predict the values in the treated period using the units that are not treated in that period and the outcomes from previous periods we then impute the missing value for the treated unit using that estimated model and if we use a very simple model if you just use a linear regression model that amounts to doing a regression on the control units of the outcome in the treated period on all the like outcomes and then using that to predict the missing value for the treated unit what I want to stress here is if you do that in the regression setting of course in the modern literature people use much more sophisticated double robust methods but if you do this in a regression setting you would be running a regression with the NC observations the number of control units in this case and minus one and it would be T regressors in that regression function the number of pre-treatment periods T minus one plus this this intercept and so that's all good but there's another way you could do this which is essentially flipping the role of the columns and the rows in that matrix and that corresponds to the data control estimator so there we're going to estimate the treatment effect by taking the outcome for the treated unit time period we're going to subtract weighted convex combination of the outcomes for the other units where we choose the weights so that in all other periods the outcome for the treated unit in that period corresponds to the weighted average of the outcomes for the control units as you can imagine that if you had enough units if you had enough control units you would just pick one control unit or a couple of control units that look exactly like the treated unit in the pre-treatment periods but that's kind of difficult if T is relatively large then we're trying to match and the number of control units is relatively small so in these settings where both N and T at the same magnitude it's going to be very hard to find a unit that looks exactly like the treated unit so think about the German reunification case if we look at the time path of GDP it's going to be hard to find a country that looks exactly like Germany for the last 40 years of per capita GDP so what the Aberdeen co-authors said is well instead of looking for a single unit let's look for an artificial synthetic kind of Frankenstein type version of Germany that is just a combination of these other countries that's a little bit of the Netherlands that's a little bit of Austria Japan, a little bit of the UK so that that combination matches the Germany well and obviously that gives you a lot more degrees of freedom and so you're going to in principle do much better in terms of finding a good combination questions exactly how do you implement choosing those weights and this is kind of why I first started with the Unconfound in this case you can write the characterization of the weights that the Aberdeen Heimler used as the solution to a regression problem where now instead of regressing the outcomes instead of using the control units to estimate the regression where the outcome is the last period outcome and the regressors are the previous period outcomes here we do a regression where the time periods are the units of observation and the outcome for the treated unit in those periods is the outcome and the outcome for the other for the control units in the same periods are the regressors so back to the to the matrix why now we use the unit of observation all of these time periods and the outcome is just this last row we run this regression now with t-1 the unit observations and with n-1 regressors and then the in the original Aberdeen Heimler paper they impose a couple of restrictions on these coefficients they don't just do this least squares regression they insist they impose the restriction that all these weights are non-negative so you use a convex combination rather than any linear combination that's going to help that plays a couple of different roles it's kind of a very clever restriction because it plays the role of regularizing these estimates so instead of getting very wild weights you're restricting them all to be between 0 and 1 and so that's going to ensure that they don't get too large they also impose a restriction that the sum of the weights is equal to 1 so it is just strictly a convex combination with weights summing to 1 and so that regularizes things it also leads to estimates where most of the weights typically are equal to 0 so there are a couple of weights that are strictly positive so if you're trying to impute to construct this synthetic version of Germany and you start with 40 potential control countries you don't give positive weight to all of them you end up with positive weights for just a couple of different countries and so you can actually look at those you can kind of see if they make sense that countries you end up with and in that particular case for example they end up with positive weight on Austria Switzerland, the Netherlands countries which you make a lot of sense from a substantive perspective if you're trying to impute what's happening in Germany but what I want to stress here is that this synthetic control regression approach relative to this ignorability or unconfundedness approach exploits a very different pattern in the data the unconfundedness approach says we're thinking that there is a stable relationship between Germany and these other countries a relationship that's stable over time the unconfundedness regression is that there is a stable relationship between the outcomes in the last period and the earlier outcomes a relationship that is stable across all countries that is the same in Germany as it is in Switzerland and Austria and whatever so they're exploiting very different patterns in the data I kind of be very used to thinking of having a large number of units and assuming that these are somehow exchangeable unconfundedness methods don't treat these units as exchangeable they say well California and New Mexico are more similar over time in every time period then say California and Delaware or California and Florida and so it looks at different structures in the data now this is actually not the way Aberdeen Diamond Heimler originally wrote their estimator in the case because they actually immediately incorporated the presence of other covariates but it's useful to write it this way because it kind of suggested you could actually generalize that regression by allowing for an interset in that regression so estimating the model with an interset there and making it a little bit more flexible but still imposing the restrictions that weight sum to zero no negative and sum to one now I'm going to kind of look at these estimators as well as the difference in means estimator not kind of under model based assumptions but under assumptions on the assignment mechanism and in order to do that exploiting the fact that it's just a single treated unit and a single treated period I'm going to write the matrix W in terms of as the auto product of two factors UMV where U is the n factor of unit assignments I pick a unit to be treated and V is the t factor of time assignments where I pick a particular time period subject to the treatment and so the estimate I'm going to focus on most is just the average for the treated so I just pick only the treated unit and the treated time period and look at Y1-Y0 so even though this is written as a sum in the end it's just the treatment effect for the single unit time period path and looking at other targets you could look at the overall average effect or you could look at the average effect for all n units in the treated time periods tau the superscript V or you could look at the average effect for the treated units over all t periods tau the superscript H they're all going to sort of the choice of estimate here all of these last three are going to be harder to estimate than this first one so we're focusing initially on this particular target but the choice of the target here matters and you could well imagine that substantively you're more interested in tau with superscript V the average of all units but that's going to be much harder if there's heterogeneity in the treatment effects in many cases it's not really clear why there would be an interesting thing in the German reunification example we're interested in the effect on Germany it's not really clear what it means for the Netherlands or other countries to be treated there so I want to look at two assumptions here on that assignment mechanism one is that I randomly picked which unit was going to be treated and second assumption is I randomly picked which time period was going to be treated neither of these assumptions is necessarily in line with the way synthetic control methods were originally introduced because it was very much for observational studies but I want to look at what the properties are of these methods if in fact we make these assumptions in particular the second assumption is very unrealistic because it's almost always the last period or the last few periods that units are treated rather than periods in the middle at the same time I think it's very important assumption to consider because a lot of the attractive properties from synthetic control estimators come from the fact that the treated period in some way is like the other periods assuming that there is this stable relationship between units in the stable relationship that is the same in the treated period as it is in the other periods and one way to formalize that is to say well the treated period was just randomly drawn from all possible periods and that makes at least an expectation the relationship there is the same as it is in other periods. Now here are a couple of very simple estimators not synthetic control estimators yet we could look at just the difference in means we could look at the average using only the outcomes in the treated period so we could average the first estimator we take the treated outcome for the treated unit time period and we subtract the average of all the other pairs the second one we only estimate the outcomes for the control units in the treated period in the third estimator we could take the average for the treated unit in the other periods or we could do this two way fix effect difference in difference estimator and we can look at the properties of these estimators under the two assumptions and so in general the simple difference in means estimator is unbiased for the population estimator only if you have both unit and time randomization the difference estimator though is does much better it's unbiased for tau itself even if we have only unit randomization or only time randomization or both okay so now let me introduce a more general class of synthetic control estimators and what I want to do here is think about what the synthetic control estimator is going to be like depending on which unit is selected for the treatment and which time period is selected for the treatment and so I'm going to characterize that estimator in terms of this these weights M that I indexed by in general by I, J and T by I is the unit that gets treated T is the time period that gets that is treated and J is the J indexes the unit that gets the weight so I'm going to estimate the average effect as a linear combination of outcomes by JT where the T is the treated period and J runs for this index and runs over all units when the unit I is the treated unit so M say M1 to 5 is the weight the unit 5 gets if the treated period is period 5 unit 1 and so we're going to choose these weights in order to to minimize some objective function that is based on all periods other than the treated period and so that's kind of in line with the synthetic control estimator that allows us to consider a more general class of synthetic control estimators the two restrictions we always impose kind of the weight for the unit that's being treated is always equal to 1 the weight for all the other units is non-negative and it's going to sum to sum to 1 then within that the general characterization of the original Aberdeen Diamond Heimler synthetic control estimator imposes two more restrictions one that the intercept there is 0 and second that the sum of the weights for the control units is equal to minus 1 which means that the sum given at the weight for the treated unit is equal to 1 means that the sum of the total sum of the weights is equal to 0 all I've done so far is given much more complicated way of characterizing the synthetic control estimator but doing it in a way that allows us to look at the bias and properties of both this estimator and other estimators and so the modification that I talked about before was that we allow the intercept to be different from 0 but I want to introduce one more modification in addition to having the weights over all control units sum to 0 also when I have the weights over all the donor of all treated units sum to 0 and what I this doing is saying that in the standard synthetic control estimator it may be the case that some units get used as a control much more than other units so if let's kind of think of the units being the 50 states in the US let's kind of think of matching essentially based on proximity and maybe that Kansas gets used as a match both for states on the east coast so it may get used as a much much more than say Alaska which is way outside Alaska is never really a very good match for any other state and that's kind of that creates a problem that creates a bias because Alaska doesn't get used as a control as often as it gets used as a treated unit in the randomization and so what this restriction is going to do is say wow if you have a randomization distribution where there's a 1 in 50 chance that Alaska gets used as a treated unit we also need to have it used as a control 1 in 50 times we can't have a unit that is possibly a treated unit that doesn't ever get used as a control and that's what this restriction imposes and by including this restriction we get rid of the bias of the synthetic control estimator and then the estimator we actually favor the most is one where we get rid of the intercept and we impose that restriction that gets rid of the bias and now here in some sense kind of is the key result if we look at the synthetic control estimator that turns out to be biased even if we have unit randomization it's also biased in general if we randomly pick the time periods that the the unit is treated it's in fact it's still biased if both the unit is chosen at random and the time period is chosen at random even if there's many time periods this unbiased estimator on the other hand this modified unbiased estimator is unbiased if we have unit randomization it's also unbiased if we have just time randomization but there's many time periods and it's unbiased if we have both unit and time randomization or both unit and time randomization and large number of time periods the only case where it's not exactly unbiased is if we only have the time randomization and we have a small number of time periods but and kind of from the other case you can see what the allowing for the intercept moving to the modified synthetic control estimator buys you unbiasedness on the time randomization the large number of time periods and imposing the additional restriction gets you unbiasedness just on the unit randomization and the modified unbiased synthetic control estimator has both of these properties so it's unbiased in all the cases where one of the other estimators is unbiased you can kind of see you can characterize the bias directly there you can see that it depends kind of on these weights and the restriction we're imposing here is that this set of weights is equal to zero and sort of gets rid of the bias now given the randomization distribution we can characterize the exact variance of the estimator the expression for the variance isn't all that interesting but turns out you can actually estimate that as well you can get an unbiased estimator for that variance it's unbiased for finite N finite T it's kind of a very messy thing because it evolves a bunch of bias corrections but if you actually use the weights corresponding to the difference it means estimator weights are 1 over N-1 then you get back to the standard variance estimator so it kind of does have some intuition that this is actually the natural estimator but the problem is there is no guarantee that it's non-negative and so in principle in very small and small T cases you could end up with a variance estimator that is negative which is not so great let me end with some simulations here for these estimators so here we went back to this state year data so we have data for 40 years for 50 states we looked at a couple of different outcomes and we did these placebo experiments to see how well these estimators did in terms of root mean squared error relative to the difference in means estimator and so we see that both for log wages we do substantially better for ours we do the improvement is not quite as big but it's still substantial and similarly for unemployment rates we we do considerably better than the difference in means the variance estimator does very well we end up getting very close to the variance of the estimator across the simulations so let me wrap things up here so what we set out to do is kind of understand whether the synthetic control estimator had a role to play if he actually had a randomized experiment and it turns out these estimators can be very useful there they can do much better than the difference in means estimator in terms of root mean squared error but the standard version of that does come with some problems that there could be a bias and so it doesn't come with the guarantees that the simple estimators the difference in means estimator comes with and so we figure out a way of getting rid of that bias and now we have an estimator that just like the difference in means is always going to be unbiased under randomization but that has a much better root mean squared error than the difference in means and so whenever you're doing experiments with very few treated units so that you may not have enough data to get the good comparisons for the treated units using the synthetic control estimator is likely to give you much more precise estimates of the treatment effects there