 Hi, this is Jess Nessary and this is week 9 of PolySci 506 Bayesian non-parametric and computational models. And this week we're going to be talking about matching methods and their application to political science. Matching methods have really caught on in political science in the past few years. There have been lots of papers on matching published in the methodological journals. Some papers using matching methods in our prominent general interest journals and lots of papers being presented at conferences like PaulMeth and APSA and MPSA. Unfortunately, one of the things that has come along with matching is a great deal of confusion about what matching does and what matching can do and how it does, what problems matching can solve and how it solves those problems. And the most important thing to keep in mind as we talk about matching this week is that matching is a way of non-parametrically controlling for confounding variables and confounding causes. In the same way that entering other variables into a linear regression is a way of controlling for confounding variables. The advantage that matching has is that this control is fundamentally non-parametric. And what I mean by that is that it doesn't imply a certain or applies at least less of a restrictive structure in the relationship between these confounds and y and also between x, the variable of interest and y. A model like linear regression or also loget, probit and either GLM family models imply much greater structure on those relationships. There are some form of linear, some form of polynomial I should say. And the advantage that we get from matching is that we can get a bit further away from some of those restrictions. It's common, however, probably because matching methods are associated with the name and Reuben causal model and because it's sometimes preferred to as causal inference to think that matching methods are a solution to the fundamental problem of causal inference or that they solve all kinds of problems that they actually don't solve. One of the most common things that's talked about is endogeneity, which I defined in this case to mean a relationship between the variable, the independent variable of interest x and the dependent variable of interest y where the influential relationship flows both ways. So in brief, y causes x and x causes y. Matching methods, at least as they are applied in my understanding of them, are not a solution to the endogeneity problem of this sort. What I mean by that is if you create an endogenous relationship between x and y and what you want to do is only isolate just the degree to which x influences y while filtering out the degree to which y influences x. So if you want to cancel out that influence and just focus on how much x changes y, matching methods are not going to be able to do that. And we'll see that in an applied study that a little Monte Carlo simulation that we'll do a little later in today's talk. So they are not a solution to the endogeneity problem of this kind. There's another set of problems that matching is not really designed to handle that well. And those are various forms of sample selection bias and censoring. And there are lots of different examples of how selection bias can change your results in the different ways that can factor into a model. Here's one example of how sample selection bias could play a role, a negative role in this case in your estimation procedure. So you're studying the relationship between education and income and you collect some data and you find some kind of relationship between education and income that looks like that maybe. And you say, well, there's a relationship there, but it's not very strong. It may be the case that your sample does not include people who, well, if we're measuring income, it may not include people who have chosen not to respond to the survey or have not entered the labor force at all. And so have incomes of zero, but really what they have is censored incomes. They have incomes that have not been observed. So you might have these cases where I know people have certain education levels, but I haven't observed their income because they've chosen not to enter the labor force. This could cause a problem for the simple reason that these people, if their income were observed, it might change the inference we draw. So if I draw a line through the observable data, I get a reasonably flat regression line. But if I could observe these people who are missing, I might see that their income, were they to enter the labor force, would be different than the pattern of people that I've already seen in such a way that my estimate would be substantially different. It may be something more like that. So the fact that these people are not observable on the dependent variable, if that's systematically related in particular to my independent variable of interest or my dependent variable of interest, then that's going to be a problem in terms of drawing inferences about X and Y. Another version of this is a censoring where, for example, we have some kind of relationship between education and income. And we have a full observation of the data set, but we have people who I do observe their income, but I observe it at zero because they are unemployed. So in this case, I've decided that they have an income of zero and I've classified them as zeros. Now you might say, okay, well, if I draw a line here between this relationship, it's going to look something like this. I've probably misestimated that a bit. But this line is being affected by the fact that were these incomes not censored at zero, they would actually be negative. And so this implies a relationship that fits the data much better once we observe these censored outcomes. Certainly going to be more accurate on average, accounting for the censoring. And ignoring that censoring problem is going to cause a bias in the estimate that we get out of the censored data set. So these are just two ways. There are other ways in which selection bias can cause a problem. One classical example is when scientists were studying the cause of the Challenger-Shuttle Disaster in the early 1980s, their initial conclusion was that there was no relationship between temperature at launch and rocket booster failure events. But the reason they drew that conclusion was because they only looked at failure events. They neglected to look at the same time at successful events, the base rate of success, in other words. And once those successful launches with no problems were included in the data set, you immediately saw that cold weather caused the O-ring seals to stiffen and cause a much greater propensity for failure. So that's another version of selection bias. All these problems have in common is that they involve the systematic omission of information about observations from the data set or the omission of observations writ large from the data set. In a way that is systematically related to the dependent and or independent variable. And so this is why selection bias is often characterized in some of the references I've consulted as a form of omitted variable bias in the sense that the selection problem is a feature of some unobservable or unobserved characteristic which is causing a systematic problem with the data. So for example, in this case right here of income and education, we're not observing people who are not in the labor force because there's some alternative they can pursue out of the labor force which is more remutative to them. And so they're choosing do I enter the labor force or not as a result of this background choice that they have, this exit option that they have for not entering the labor force. The censoring example, the thing we're omitting is the variation in income below zero. We don't think of people being paid, I'm sorry, of paying for the privilege of working although in the modern economy there are some perverse examples of this related to internships and apprenticeships and the like. But in principle it's possible and as a causal matter it can change our inference and it even substantially makes some sense because we do think of maybe paying for an apprenticeship education where we go to work for someone and either get paid nothing or even pay expenses so that we can learn a trade. So that's a, it's sometimes characterized as a form of omitted variable bias and matching cannot correct for this sort of bias. It's not a substitute for the Tobit estimator for example in this case of censoring. It's not a substitute in this case for some kind of selection model like the Hexman selection model. It's not a substitute for multiple imputation in the case of cases where data is missing in some way. And it's not, and I guess the third thing I should mention is it's not a correction for the more general problem of omitted variable bias. It doesn't solve that either so it doesn't really solve this, it doesn't solve omitted variable bias either because matching is a way of, as we can, as I've loaned it here, it's a way of reconstructing experimental conditions via observable contextual variables which is to say it attempts to create randomization between a treatment group and a control group in the same way we would in an experiment but the trick is we can't randomize the way we can in an experiment when we're looking at the real world so we have to do it statistically. But it does that via observable contextual variables by trying to make the context between the treatment and the non-treatment group, the control group, equal and if we don't observe those contexts we can't randomize them. And hence that's why omitted variable bias is not corrected by matching. So the thing I want to start off with is by setting reasonable expectations for what matching is. Matching is in some ways a causal inference procedure but what that means is that under the conditions under which it's designed to work one can in principle retrieve causal relationships. That does not mean that applying this technique will always, finding a relationship via matching implies that that relationship is causal. It does not absolve us from thinking about the problems of endogeneity in this reciprocal causality sense. It doesn't absolve us from thinking about problems of omitted variable bias and it doesn't absolve us of thinking about problems of sample selection bias. And when we get to our Monte Carlo simulations or our examples we'll take a look at all of these different things and show that matching doesn't solve them. So why do we call it a causal inference procedure? Well because this nonparametric control of confounding is pretty robust to pretty complicated relationships between why and observable contextual variables that we don't care about, which is a very good thing. But I have to say that linear regression is also a causal inference model if the conditions of the classical linear regression model are met. What I mean by that stating simply is that often the five key assumptions for inference under the CLNRM are sometimes trotted out. If those five assumptions are met, linear regression is the causal inference model. Of course some of those assumptions are quite difficult to meet in practice. For example, complete and correct specification of the model. Having all the right variables and putting those variables into the function in a way that matches the data generating process. That's a pretty tall order particularly because if we're running a statistical model that means we probably don't know much about the DGP we're trying to find out. Matching methods are a little more robust than linear regression. They don't require quite so much in the way of correct specification of the model. But we still have to have the right collection of variables. And the things they tell us are narrow in scope in the sense that they don't tell us everything that's conceivable to know about the DGP. They just tell us about certain treatment effects which we may find particularly useful from an inferential or policy perspective. So to recap here, what I'm trying to say is let's consider matching models for what they are a very useful tool, but not one that magically solves the problem of all problems of causal inference or the fundamental problem of causal inference. Of course a very good thing about these models is that they are closely tied to a rather robust notion of causal inference, the Neyman-Rubin causal model which many researchers have found very useful. So let's talk a little bit about that Neyman-Rubin model of causal inference and how matching aligns itself with that model. Alright, so as I said the matching model is tied to a theoretical model or a conceptual model of what it means for something to cause something else, the model of causality. And so I want to begin by addressing the question of what causality is. Now of course I don't purport to answer this question, but rather to just lay out really a definitional issue of how we're going to define what causality means in this case. So consider a set of treatment conditions. So for some set of treatment conditions, T in a set, you know, T1, T2, and so on all the way up to K, many of these, Tk. When a measurable quantity or behavior or something you're studying, some dependent variable Y, of a single object of study denoted I, differs only when the treatment condition has changed, sorry I should say when only the treatment condition has changed. We say that the change in Y is caused by the change in treatment condition. So saying this mathematically, if we've got some object of study I and we're measuring its outcome or some dependent variable Y, we want to associate each Y with a particular treatment condition, T sub J. So this is defined as the state, in particular the state of the DV, of object of study I under treatment J. A causal effect is denoted d sub I between treatments J to K of the difference as treatment J goes from J to K. I apologize, it's from treatment going from K to J. So what I mean by this is simply that we define causality to mean the change in a dependent variable that occurs when a change in some treatment condition occurs at the same time or sort of related to that. And the idea is that only this treatment condition has changed between the two observations and the object of study is the same. So the only thing that's different between the two cases is the treatment. That's the key, only the treatment condition is different and so any difference in Y has to be ascribable to the treatment. And of course in experiments we typically only have two treatment conditions. We've got the treatment condition and the control condition and sometimes there's like two treatments at a control or something but you know the canonical experiment is we leave one group of people alone and we do something to another group and we see how they're different and the experiment is designed so that the only thing on average, on average that's different between the two people or the two groups of people rather is the treatment condition. So the treatment effect for a person, so the treatment effect for person I is defined as what they would have done under the treatment minus what they did or would have done under the control. In essence we compare treatment conditions to control conditions and see how they're different and that's how we determine what the treatment effect is. Now that's a very easy thing to say and it hopefully makes intuitive sense to you but there are some deep reasons why we think that this is a good way of getting at causality. Now like I said I hope this makes sense to you but you know it actually as when you initially think more deeply about it it may initially not make sense to you upon consideration because you may notice that all of these effects are phrased in terms of individuals. So we need to know how an individual is different as they go from the treatment to the control or vice versa as they go from the control to the treatment. And the fundamental problem of causal inference lies in the fact that we cannot do this different things to the same person at the same time. So if you think about what an ideal experiment would look like you can imagine there being two people actually I take that back you can imagine taking one person, person I and splitting them into two somehow duplicating them, cloning them such that the clone is identical in every way to the original. It's a perfect copy, a perfect clone. And to one of these people you do the treatment condition and to the other of these people you do the control condition and then whatever it is that you're studying about them you measure that thing. So I'm gonna just pencil in here these are some kind of detector so some kind of measurement and of course in hard science experiments these might be literally detectors and sometimes in political science experiments we might have an fMRI machine or something but very often these are just survey responses or behaviors in some kind of experimental game that we're running on a computer or something like that. And then as these people get exposed to the treatment we measure what they do. So this is their measurement here and this is our measurement of the key thing this is YIT and this is YIC and we can compare what we measured them doing in the treatment versus what we measured them doing in the control and we can come to a causal inference. The problem of course is and the fundamental problem of causal inference is we can't expose one person to two treatments at least not at the same time so we can't rerun history to see how it would have been different if something else had happened that's the fundamental problem. Now you might say we could do the C and then the T well that's true but of course if you do C and then T you're not really doing T you're doing C and then T so there's a potential ordering effect if you're gonna do them sequentially in some way T and then T might be different than T and then C now there are various ways of getting around that but nevertheless even if you can overcome that ordering effect you're never gonna overcome the fact that these are done at different times and the background conditions at that time may have changed so in other words if I do one experiment in the morning and one experiment in the afternoon other things may have changed between the morning and the afternoon that caused my outcomes to be different that are not attributable to the treatment remember the ideal experiment is one in which the only thing that's different is T and C literally we've constructed a laboratory so one way you might pencil this in is imagine this is all taking place in an identical sealed actually this is all taking place in an identical sealed room a laboratory where we can filter out all the external things that might influence behavior and if we change the time of the experiment we're no longer doing the same thing to the same person at the same time we've changed the time we've changed what's going on in that person's head we've changed maybe even who that person is in the sense that they've biologically changed they've aged by a couple hours they are thinking about different things so you can get better at this but you can't overcome the fundamental problem of causal inference so what of course we do is we say okay well we can expose person I to treatment T and we can expose another person J to a treatment C and then compare and J well okay that's fine what you're going to get out of that is the following Y I T minus Y J C so person J's response to the control compared to person I's response to the treatment now that might be a useful quantity but it relies on or it suffers for some downfalls most appropriately or most the most immediate one being how do we know that the outcome Y is attributable to treatment differences and not other differences between I and J I'm going to spell differences here so obviously what we want to do is we want to say when T goes from when the T is applied that some change happens we need to know that hey this is we need to know that that treatment effect is actually attributable to the treatment and not to something that's different about person I and J person I and J maybe different ages, different races they certainly have different histories and backgrounds different levels of education maybe they came from different families how do we know how that person I reacted to the treatment is attributable to the treatment and not something about person I of course this gets even more complicated when you consider that treatment responses may themselves be dependent on who someone is so-called contextual effect so person I's response to the treatment might be different than person J's response to the treatment so it's not just that I and J are different people and these are attributable to the these are potentially attributable to the differences the differences may be attributable to those differences it may be also the case that IT-IC is different than JC-JT so there are some significant issues going on here that we have to overcome and what we're going to do is we're going to basically make we're going to try to make comparisons that overcome these individual differences by averaging so let's let's set up a little bit of a framework so we're going to assume that Y is a function of S mathematically where S is the state of the world that includes a whole bunch of things so for example I could say Y is what's is a function of the treatment condition and Phi where T is the treatment condition and Phi is everything else that matters about the state of the world that includes background conditions who the person is everything that might factor into the outcome is all put into Phi so a valid treatment effect is something like this where we have Y of T plus some change in T holding Phi constant holding Phi constant over DT that is DY DT at Phi okay that's fine of course the trouble is that in real life when we're comparing one individual to another we don't have this we have Phi 0 and Phi 1 because person 0's state and Phi 1 is person 1's state and as a rule Phi 0 is almost never equal to Phi 1 have you ever met someone whose life is exactly the same as yours in all respects it's rather unlikely so we've got to find some way of overcoming this and basically what we're going to do is a clever combination of averaging and certain assumptions that we need to make this work so in brief what we're going to do is instead of focusing on treatment effects we're going to focus on average treatment effects across group across in particular the group of the population that's the real treatment effect we're getting at so the average treatment effect is just the expected value for any individuals delta I for their individual treatment effect which is the expected value that we would observe for Y t given Phi minus Y c given Phi and the expectation as you might imagine is over Phi all of the background conditions that vary across the population so we average across individuals and we average over their individual characteristics Phi so if we're looking at a continuous case this would look something like this F of t and Phi minus F of c and Phi quantity times the distribution of Phi d Phi so the integral integrate over the distribution of characteristics in the population and of their background and all the background characteristics and the treatment effect looks like and of course there is a sample analog to this I'm sorry I'd rather a discrete analog to this if we have a small finite population in other words what we look at instead is just the sum of F of t given whatever their background characteristics are times 1 over n if n is the number of subjects and we look at 1 over n times the sum of the observation for the control group given their characteristics and whoops what this does is if we can assume that so this is I equals 1 to cap n t I and actually I should really do it this way for clarity sake Phi I and then here we have 1 I is 1 to cap n c I and actually really all the people are getting the same treatment and all the people are getting the same control the only thing that is different about them is their Phi's it's really kind of extraneous so all we need here for this to work is for how do I put this we need as n goes to infinity for the group of people in this treatment group sample to be a good representation of the characteristics of Phi in the population and the same is true for these people in the control group sample so this is like your treatment sample and this is your control group sample and what we need as is for n as n gets really really large we need the number of people with some characteristic so like Phi equals Phi 0 say um divided by n we need that to approach g Phi the reason being if you break this expression up here down into its components you can see that we can break the integral into 2 the integral of Phi t I'm sorry of f of t and Phi d Phi minus the integral of f c and Phi actually I take this back I have to g Phi d Phi this is g Phi d Phi as well you can see that these discrete calculations rely on the idea of approximating these continuous calculations and in order for them that in order for that to work the need is for these sums to be in effect integrating over the right values of g Phi so for example for a particular value of Phi 0 we need the weight of the treatment condition put on that group of people to approximate the weight of g Phi so putting this into English what this is saying is that if we have an experiment where we get a group of people that have size n and we randomly assign some people to a treatment and some people to a control and we compare those two so for example we subtract them to find out the average treatment effect between them what we need is for the distribution of Phi background characteristics and whatever else in the treatment condition to be the same as Phi in the control condition and for those two to both be the same as Phi in the population because if that's so then this simple average here is identical to this integral calculation for both the treatment and the control conditions and hence we have a good calculation of the average treatment effect another way of putting this is we are constructing an average right everyone knows if you've taken some statistics that sample averages approach the true population mean as the sample gets larger and larger as long as the sample is randomly selected as a consequence of the central limit theorem we need the central limit theorem to apply to our experiment and in order for that to happen we need to make sure that our sample is representative of the population and our samples are similar to each other so if that's the case then we are going to be able to estimate an average treatment effect if we can make a very small number of reasonable assumptions so let's talk about what those assumptions we have to make are and why we think they're reasonable so when is it reasonable to assume that we can calculate this average treatment effect without an issue well one thing we know for sure that we need is the following we need there to be no relationship between phi for the treatment and phi for the control and we need them both to be representative of the distribution of phi that we're targeting i.e. the population distribution another way of saying this is we need there to be no covariance between phi and the treatment effect so T and C whether you get T or C and yet another way of saying this is we need the expected value of phi given T minus phi given C we need that on average to be 0 so characteristics should not be characteristics not having to do with the treatment background characteristics people who are in that treatment and control condition should not be related to whether they're in the treatment and control condition so in other words on average the same group of people should be exposed to the treatment and the control if you're running an experiment this is comparatively easy to do because we basically assign treatment at random people walk in and we just point them to the left or the right depending on what order they come in and as long as our sample is well chosen which is to say as long as our sample is representative of the population then our division of them into a treatment and control group randomly will propagate that randomness through to the treatment and control conditions and hence the population will look like the sample which looks like the treatment group which looks like the control group so the key way of doing this in experiments is random assignment of T and C now of course that still leaves the matter of whether your sample is representative of the population and of course sometimes it's not for example it's pretty common to use convenience samples where you're just studying undergraduates for example and those don't look like the population at large maybe they look close enough or maybe they're enough alike in the relevant PHI characteristics but it can at least get you out of the problem worrying about spurious correlation between these characteristics PHI and the treatment condition T so in other words whatever population is represented in your sample we should with an experiment be able to recover the treatment effect of the treatment in that population whatever population your sample represents so we're definitely going to need that so another way of stating this assumption or of stating this thing that we need is the I'm going to move this down a bit the so called stable unit treatment value assumption which is in the terms we've introduced in this little lecture that the data generating process F of C given PHI if you're assigned to a particular treatment so if you get the control and you're assigned to the control so I'm going to actually let me note this is A assignment equals C that's going to be the same as the data generating process for C and PHI if you were actually assigned to the treatment what this is saying is that you should not react any differently to the treatment condition if you actually got assigned to the control then the people who got assigned to the control on average another way of putting this is if I got assigned to the control my reaction to the control should be no different than the reaction to the control condition of people who are assigned to the treatment so in other words we need it to be the case that people react the same how's the way of putting this in the case that reaction to the treatment or the control is not a function of the assignment process so in an experiment SUTFA is basically true by design I suppose it is an assumption but we've almost made it have to be true because the assignment process is a dice rolling process or computer generated random number something that by design has nothing to do with the treatment there may be some subtle way in which getting assigned to a treatment changes your psychology you'd really have to think deeply about it to make sure that there was no systematic difference between people as a result of the assignment process but in other words as long as your experiment is a good experiment if it's well designed SUTFA is true by design which leaves us with the only remaining requirement that this is like we can assume this to be true in experiment that means that the only remaining requirement is that G of Phi in the people who are assigned to the control is the same as G of Phi in people who are assigned to the treatment which is equal to the distribution of the population that's the other part of what we talked about up here and this is just a matter in an experiment of making sure your sample is balanced across the treatment in control conditions which should happen automatically by randomization and that your sample is representative of the population which should be true if your experiment draws out of the right target population however in an observational study this assumption can be very problematic to make in an observational study and the reason for this is the reason for this is that the treatments in an observational study are assigned by some process we may or may not understand and is unlikely to be random and not only is it not likely to be random but it's likely to be related to Phi so going back to our previous example education is a treatment college education primary school education whatever and secondary education or I'm sorry collegiate education so post secondary education in particular that treatment is a function of who you are you're more likely to get it if you went to a certain school if you have parents with certain wealth there are probably racial differences as well sex differences are getting less common but they're actually at this point is a slight favor of women getting college educated so the fact that you receive the treatment says a lot about you in addition to saying something about the effect of the treatment and so what this means is that SUTFA is not a non problematic assumption in the case of observational data if the data were collected appropriately process 2 might be something we can assume it might be the case that Phi across the treatment and Phi across the control is the same but again probably not this treatment assignment process is again probably not random and so it's probably the case that Phi looks different when you're looking at people who are assigned to the treatment versus not assigned to the treatment you expect wealthier people to get more college educated so Phi is just not the same so what we have to do is in order to make causal inferences in observational data what we have to do is we have to say well we have to assume that reaction to the treatment or control is not a function of the assignment process that's the stable unit treatment value assumption and then we make that assumption and we say okay if that assumption is true then the only remaining task to achieve is to make Phi look the same for people who are treated versus people who are not treated so we have to make an assumption in order for this process to work which is to say that for example if it could be the case that people sort into education based on maybe how well they expect to do and let's just take for example their educational background it could be the case that people who did worse in high school are less likely to have lower GPA than high school are less likely to sort into college but for those few people who do go to college with poor high school backgrounds we need their reaction to the treatment condition to be the same as the people who had poor backgrounds and didn't go to college now you might have immediately said well maybe there's something that's hidden or unmeasurable about the people or unmeasured at least about the people with poor educational backgrounds that didn't go to college versus those that did that would be a problem that would break Satva the answer to that problem hopefully is to measure that characteristic put it into the parts of Phi that we can observe and then reassert Satva so maybe there's some kind of ambition or intelligence or wealth or something about those mediocre high school students that we can measure and then we can say okay the intelligent wealthy mediocre high school students do well in college the same as the intelligent wealthy mediocre high school students who don't go to college the effect of college on those groups is the same but nevertheless we have to still assert Satva in a way that we don't have to in experiments because randomization or random assignment breaks links between Phi and the treatment process in a way that observational data can't so Satva is easy to assume in experiments and hard to assume in observational data but we nevertheless have to assume it if we want if we want the calculations on the previous page to work in other words if we want this average to work it has to be the case that being put into T versus putting into C does not affect your treatment power the effect of the treatment on you in Phi so we assume this and try to counteract with measurement and then to achieve this condition which is that Phi has to be the same in both the control and treatment conditions we use matching to make it so if it's not already so which it often does not so so what we have to do in short is to in order to calculate the average treatment effect and take the average over Phi we need to use matching methods to compare groups where Phi is the same that have been treated and that have not been treated that's the essence of what matching does now you'll note that it does assume something it assumes this Satva condition which is partially a function of whether we have measured all of the relevant aspects of Phi and this raises the specter of omitted variable bias as a threat to Satva if there are things about an observation that we have not measured that determine assignment to the treatment and that simultaneously determine response to the treatment or how well you respond to the treatment Satva is broken and we have a problem matching will not work well it will work but it will give you something other than the average treatment effect in the same way that omitting a variable from regression causes omitted variable bias by sucking covariance in the omitted variables into the ones you did observe matching does not solve that problem and this gets back to the very first page where we said matching is not a panacea cannot solve omitted variable bias but you'll also note that if we have got all the relevant character parts of Phi all of these calculations we've made these average treatment effects that we've made make no reference to what F is the F is the function that maps X T plus Phi on to the dependent variable Y we don't need to know what F looks like at all we just need to know that it exists that's pretty handy we're able to abstract away from some of the parametric things that we would ordinarily have to assume say in a GLM or some regression context and that's good of course we're calculating average treatment effects which is a pretty gross or coarsely grained measure of treatment effect it's averaging over the population but nevertheless still useful if you're trying to figure out what a policy intervention will do or if you're trying to figure out how some change in the environment, institutional environment will affect outcomes it may not be able to tell you all of the fine grained what happens when Z is Z zero how much does DYDX change that's a little harder to assess but it can get you the gross average measurements which can be or typically are very useful so as we discussed the average treatment effect is causal is a good estimate of the causal relationship when the stable unit value assumption is true which we discussed at length and when the sample you're using to estimate this relationship is balanced which is a way of saying that fine needs to be the same between the treatment conditions all of the external characteristics of the sample need to be the same and so when we go to calculate treatment effects using matching what we're going to do is use matching to construct a data set where phi is equal between the treatments in order to allow the calculation of treatment effects so you know I've seen some presentations and some slides that are online by Koski Amai where he talks about matching as a non-parametric data pre-processing algorithm and what he means I think by that is that what matching does is puts makes the sample into one where basic comparisons can be used to derive derive inferences so let's do a little example to illustrate what we mean here so let's consider the example of trying to figure out whether private or public schools contribute more to the SAT scores of high school students so what we want to do is we want to look at the difference in effect that we expect to see on an SAT score from a public school versus actually I'll put private schools first since we expect I guess the effect to be positive the effect of private schools on SAT versus the expected value of SAT scores for students in public schools now that's a naive way of trying to figure out hey what's the difference we expect to see in SAT scores between private and public schools that's attributable to school difference but as we know private and public school student bodies don't look alike for one thing it may be the case that private schools have wealthier students on average than public schools and that's relevant because wealthier students have more resources at their disposal for preparation they may not have to take a job so they may have more time to prepare for the test and so on so we want to look at the difference between private and public schools but we don't want the confounding effect of student wealth to get in the way student family wealth to get in the way so what we're going to do is use matching to construct a data set where wealth is equal between the treatments and allow us to calculate treatment effects so what we want to do is we want to say okay I want to know first what's the difference between a private school student who's rich and a private school student I'm sorry a public school student rather who's rich and we'll call that D rich that's the delta on SAT scores of a private school for a rich student and then we want to do the same thing for poor students or breaking students somewhat unrealistically into rich and poor and we'll relax that rather restrictive binary assumption here in a moment but for just for now we're assuming there are two types of students just rich and poor and we can break them up accordingly and then the comparison of the poor students will give us delta poor and so then we could say alright well the average treatment effect that we can get out of this two match samples so we've got a match group one on the rich people on match group two on the poor people the average treatment effect is just going to be the rich effect times the probability of being rich in the population plus the poor effect times the probability of being poor now what that's going to tell us is okay we used matching to match up I'm going to compare we used matching to create these two different comparisons rich to rich and poor to poor of private and public schools and that enabled us to get estimates of the difference in effect on rich and poor students of a private school okay great then we use those estimates and reconstruct what the effect we would expect if we took a student in the system and shoved them into a private school what effect would we expect to see on their SAT scores we reweight these effects that we've estimated by the population proportions so in other words this is the effect we expect to see on rich students this is the effect we expect to see on poor students and that's the average treatment effect we expect to see if we did it to all students equally now it could be the case that wealth is it's not could be the case it is the case that wealth is continuous and so probably what we want to do in that case is not break students up artificially into groups but compare private students at wealth W to public students at wealth W getting us an estimate of delta W and then average over delta W times the population distribution of W to get the average treatment effect across all levels of wealth so again just to recap what are we doing we're matching like students with like students calculating the treatment effect for each student separately or for each type of student rather separately and then recomputing a population average by using the distribution of weights in the population now constructing this kind of sample with matching is fairly easy and however it's I have to say that getting to an average treatment effect from the matching procedure involves an additional step let me explain why it's probably first it might be useful to talk about the average treatment effect on the treated first the average treatment effect on the treated if we're continuing our previous example would just be what is the expected effect of SAT scores on private school students who actually went to private school comparing that to the expected value of SAT scores of private school students I'm sorry yes of students who went to private school but if they had gone to a public school instead so in other words the average treatment effect on the treated takes the population of private school students and says how much better did they do compared to if they had gone to public school this it's how much we expect private school to help the students who are in private school now that's the average treatment effect on the treated this is something which is typically much easier to get out of a matching procedure with something something just a simple mean comparison test or a weighted regression on weighted regression on the matched data the reason for that is this is how the matching procedure actually works typically so you've got a bunch of treatment cases and you've got a bunch of control cases and we'll just consider them on some dimension of characteristics phi here so we've got these treatment cases blah blah blah blah and we've got a bunch of control cases often you have more control cases than treatment cases though not always what we do is we consider each treatment case and try to match it up with the closest control case that we can which if you have more treatment cases than control cases means that each treatment case gets matched with a unique control case but not every control case gets matched with the treatment case so when you create the matched data set these gray boxes indicate there's another point right there indicate the matches that would be made by some kind of algorithm but you can see not every control case gets chosen there's some control cases here that don't get chosen some control case is there a control case right there and so on and so what that means is the final data set that you've constructed by this matching procedure is every observed treatment case with the best matched control case so if there are K treatment cases and there are M control cases where M is greater than K the final matched data set in this procedure is just 2K which is actually less than the total size of the data set this is ideal if what you want to do is say here's the population of treatment cases here's a population of control cases that looks just like the treatment sample on PHY on the observable relevant characteristics and the difference between those two groups is the difference that we would expect to see from moving all the people who were treated into the non-treatment into the control category in other words it's the average treatment effect on the treated so matching typically enables ATT estimation quite easily by just saying look at the matched sample do a mean comparison test or maybe a simple regression analysis on that matched data set and then look at the difference that's it so that's a fairly easy thing to do when we want to look at the average treatment effect however we have to think a little more carefully about this because the comparisons that are getting made in this matched data set are not exactly the same as the comparisons which we would need to do to calculate an average treatment effect because in short cases that we've matched look like the distribution of in this case if phi is wealth so going in here and calling this wealth the distribution of wealth looks like the distribution of wealth in the treatment not in the control so if we wanted to sort of compute an average treatment effect for the entire sample or for the entire population rather we would need to do some reweighting in particular what we need to do is say ok well maybe we can assume for example if this is high wealth and this is low wealth we can assume that this difference here is a good estimate of the treatment effect of low wealth people and this estimate here is a good estimate of the treatment effect of high wealth people we may need to up weight these cases because if you think about the controls here we may have an excess of controls down here at the low wealth case because more poor people are in public schools compared to private schools and we may need to down weight the high wealth cases because there are comparatively fewer of these in the aggregate population of treatment population of private schools so to calculate an average treatment effect we can still use this data set that we've created but we're going to need to think about reweighting the treatment effect when we go to calculate some difference calculate some effect for average treatment effect on the treated the procedure is much simpler because the matching procedure has already created a sample that looks like the treatment population so all we have to do is literally compare the treatment cases to the control cases that we've selected by matching and that's it fortunately there is a handy way to calculate the average treatment effect and it starts by looking at an alternative way of calculating the average treatment effect on the treated the average treatment effect on the treated can be calculated with a post matching regression analysis technique I'm actually going to use that color here and it makes intuitive sense although you don't really need to do it to estimate an ATT but you're going to see that if we use this to estimate the ATT and then we sort of jiggered around a bit to estimate the average treatment effect on the control cases we can come up with a grand average that works out to be the ATE so imagine taking this data set that we have after matching and run a regression using the matched data either just on the matched control data or on all data if the data set is small and in this case if we were doing this wealth analysis if we had tons and tons of data we would maybe we would just run a regression of SAT against wealth but only for the public school students the alternative way of doing this if the data set is smaller is to run a regression of SAT against the treatment which in this case is school wealth the reason for using the control data is what we're going to do is we're actually going to create a counterfactual set of control cases to which we can compare the treatment cases so now what you do is use the model you just fitted to generate fitted predictions for the treatment data so if you consider this first example we ran a regression just on the matched control data then we're going to take the private school students data and we're going to plug their values in for wealth and we're going to get estimates in effect of what those private school students would have achieved if they were in a public school if you're doing it the second way here what you do is you take the private school students you plug their wealth values in here and you set the treatment equal to zero as though they were in a public school then you take the predictions you just fitted for the treatment group and you compare them to their observed values so you take the SAT scores that were observed for the private school students and subtract them from the SAT scores you just predicted using the private school student data but assuming they were in public schools if you take this and you do this for every observation I and then you just take the average of this where n is the number of students in the private school sample you will get the ATT now that's a little bit belaboring the issue in my opinion because we could just as easily have gotten this with a simple regression up right here and look at the value of the treatment coefficient and that will also be an estimate of the ATT so this is a bit overkill in my opinion but all the same that this is a way of estimating the ATT now if we repeat this procedure for the students in public school so we repeat this procedure but instead of using this process that we just did, get rid of this right here why not instead of using the control data we use the treatment data then use the fitted model to generate predictions on the control data and then compare the fitted predictions to observed values for the control cases actually I don't like this on the treatment value parts a little misleading they compare fitted predictions to observed values there we go what have we got? well what we've got is the difference in SAT scores for public school students if they had gone to a private school that we just fitted for every student and we're subtracting that from the observed public school scores for on the SAT and this is going to give us the average treatment effect on the controls and if we sum them up and divide by M where M is the number in the public school sample we get the average treatment effect on the controls so that suggests a way of getting the ATE namely take this add it to this and divide by M plus N so in other words sum of SAT I public private minus the SAT public plus so this quantity plus this quantity SAT private minus SAT private public if you divide this quantity by M plus M you get the ATE so in other words what we're doing is we're just saying the average treatment effect on the entire sample is the combination of what the treatment effect on the treatment would have been the treatment effect on the controls would have been pretty simple so it's relatively easy to get the ATE just using some kind of simple regression analysis on appropriately matched data and generating the relevant computations but it's a little more involved in getting an ATT or an ATC which comes directly from a matched data set I should say actually one thing you audit the data set that you're doing this on in this second example here is that you would want to match treatment cases to the control cases for the ATT you want to match control cases to the treatment cases as we did here so in this second computation what you're doing is create a matched data set of size M which is going to be bigger than the original data set which will give you a representative a representative set of cases you don't necessarily have to do that you can you can use the same matched data set you did before and you could just say okay well I created a matched data set in the way that I did here what I'm going to do is put in the regress on just the treatment data I'm going to use the fitted model to generate predictions on the control data and then compare but one thing you'll want to make sure to do is the control data should include data that was not matched so in other words you use the regression on matched data only but when you fit predictions you fit on the entire sample including data not matched that's necessary because you need the right number of cases to get appropriate weighting when you come down here and compute an average like so so you can either do a complete rematching on the control data or as I believe Gary King recommends you can just use the same matched data set run the regression on that matched data on the treatment data but then generate fitted predictions using the entire data set that's not going to be a problem up here because all the treatments is matched already so there's no extra data to worry about but as we saw up here in the control case there is this extra data right here to worry about and so we need to calculate fitted predictions for those bits of the sample if we want a valid estimator so now what we have to do is to figure out how to perform this matching procedure and there are lots of ways to do it and I'll discuss a few of them in a second but before we get there I want to talk about the possibility of multiple potential confounders we've talked in the previous examples about matching on one characteristic like wealth but obviously there are lots of different characteristics in the private school example that might determine your likelihood to attend a private school including things like how many private schools are in their area or their secular schools as well as religious schools the wealth is previously courted but also your age you might be more likely to attend a private school if you're in elementary school if we're looking at different grade levels if we're looking at the same grade level you might look at racial effects you might look at number of brothers and sisters siblings all sorts of things might affect your potential to attend private school how religious you are so lots of things and we want to match on all the things that might have a relationship that would confound inference on the independent variable of interest so in the previous example anything that would affect SAT scores and the likelihood of attending a private school so when you have more than one confounder what we can do is something called propensity score matching and the general idea of a propensity score is to determine the probability of being in the treatment group as a function of the confounders of interest the phi's but we want to map phi into a probability space as opposed to mapping matching them directly and so we might for example map through the like so we might say okay I've got lots of different background characteristics so background confounders so in the private school example there are tons of these things so what I'm going to do is I'm going to estimate the probability of being assigned to the treatment condition in this case the private school condition as a function of all these confounders then I'm going to match on the estimated probability of treatment this is an estimate so I'm estimating that with a hat instead of directly on phi now the reason why we're doing this is because as phi gets more and more numerous it's harder to find good matches so for example continuing with the private school example it may be easy comparatively to find someone who's similar in wealth to you if you're looking to match private school students and public school students but it could be considerably harder to find someone who has the same level of wealth as the same race lives in the same geographic area has the same religion and so on and you know imagine the entire population of private school students it's going to be really really hard to find good matches on all those dimensions for every single student so the propensity score matching idea collapses all of these factors down into a single characteristic PR treatment which enables us to compare people who are not the same but who are the same in the way that matters which is to say their propensity to be assigned to the treatment because remember the reason why we're doing this matching is because it could be the case that these background characteristics are related to related to why and are simultaneously related to the probability of slugging into the treatment if we can break the link on selection into the treatment then we can break the link of confounding we can control in essence for the treatment propensity we can eliminate that potential source of confounding so propensity score matching is one way of doing this but there are others there's a recent paper by King et al on course and exact matching now exact matching is the idea of matching case for case which is to say getting cases that are the same on Phi but as we just mentioned when you have lots of dimensions of Phi it's going to be really hard to find a case that's exactly the same on every dimension so what course and exact matching does is it stratifies Phi on every sub dimension imagine for example we've got two characteristics that we're comparing on Phi1 and Phi2 and we want to match on both those characteristics and we've got data points for the control and we've got data points for the treatment you'll notice none of these points which are rather sloppily written I have to say none of these points overlap but maybe what we can do is we can say okay well rather than matching these things one to one which is going to be rather hard what we're going to do is break into subcategories and then we're going to match all of the categories or we're going to match cases rather based on whether they're in the same category so for example these cases are going to get matched together these cases are going to get matched together these cases are going to get matched together and so on now the exact you notice that some of these strata have no cases at all in them so these here some strata have only control cases or only treatment cases something like that these strata are going to get discarded they're not going to be included in the estimation in essence there's no good match for these cases there's also you can play around with exactly how these strata get determined for example the match it program that we're going to talk about shortly uses some automatic techniques to determine these strata one way of doing this is just to say use the histogram procedure to determine via these various methods like the Sturgis Breaks the ideal placement of breaks for accurate depiction of the frequency given the number of cells you wish to specify so in essence what you're doing is just controlling the number of cells that you wish to specify you might want fewer or more depending on balance on how well your sample is balanced in the sense that how well the treatment and control cases are matched in such a way that phi is the same on all dimensions for both treatment and control so we'll talk a little bit about how to specify the nature of the match or I'm sorry the nature of the strata when we start applying this technique in software like I said there have been some papers about course and exact matching that compare its performance to propensity score matching and find that it is better results in better bias and variance characteristics meaning lower bias and tighter variance characteristics as compared to propensity score matching and we'll do a little bit of tentative exploration of the differences when we get into the software techniques so just to sum up a little bit actually not really to sum up but to talk a little bit about I guess it's summing up a little bit some of the eventual matching problems which is to say when you have lots of different confounders there are different ways of achieving matches and the point is to get good balance which is to say to get a good match between phi the confounding characteristics in your treatment sample versus phi the confounding characteristics in your control sample in match it which is the software program that we're going to talk about today there are multiple techniques used but we're going to talk about three of them nearest neighbor matching by a propensity score which is the propensity score technique that we talked about this nearest neighbor bit refers to the fact that propensity scores are estimated and then cases are matched according to the propensity score that's closest to themselves so in other words every person gets the a match to a case with the propensity score closest to it there's the course in the exact matching technique which we've just discussed there's also this matching via genetic optimization which is discussed in a paper by Sekhon and a co-author that I'm forgetting at the moment and the genetic optimization technique looks at various tests for balancing balancing metrics and tries to change the characteristics of the matching procedure to optimize these balancing metrics and in particular they use genetic optimization which is a little bit beyond the scope of this discussion but in short genetic optimization is a technique designed to find optima in very complicated spaces using what amounts to differential reproduction of candidates via their fitness which is to say their performance on the optimization scale how how optimum are these metrics and this is implemented in the origin nude package in R so it may be a good automatic way of achieving balance these nearest neighbor matching and course in exact matching techniques don't have automatic balance characteristics we have to assess balance ourselves whereas the genetic optimization technique assesses balance automatically that doesn't mean that it always gets it right it just means that there's an automatic procedure that it does that takes some of the user choice out of the process so now with that in mind let's actually go into R and start implementing some of these matching techniques and test them out in controlled scenarios where we can assess their performance okay so let's take a look at some examples in R and the key package you're going to need to run these examples is match it which is made available by king at all to perform matching inference and so I'm going to start off with a couple of simple examples I'm going to start off with an example of a data set where what's happening in the data set is that people who expect or units that expect better outcomes are selecting into a treatment which has no value in and of itself so for example you could think of this as the expense model of college selection you don't actually learn anything in college but smarter people go to better colleges and so the quality of your college degree signals how smart you are even though it doesn't do anything to help you so what I'm going to do is just draw some data w and z are covariates and x is going to be the treatment variable of interest and what you can see here is that it's going to be selected with probability positively related to z and w and z and w in turn are going to be positive related to y which is the outcome variable of interest and so if we do a plot of for example w against y you'll see a positive but weakly positive relationship and the same for z there's a positive weakly positive relationship and you'll actually see the same for x x is a discrete variable in this case so it might actually be advantageous for you to do a box plot to see this x has a weakly positive relationship with y but the x relationship is spurious it's spurious because as you can see in the DGP that relationship is actually zero the only reason we see a difference here is because x is related to z and w and z and w are related to y so there's a pass through in that sense so it's a back door correlation as Gidea Pearl might call it so I'm going to create a data set called dat which just incorporates y, x, w and z and I'm going to now match my sample using match it I'm going to match the treatment variable x using w and z so if you do a help on match it here you'll see that it has a lot of different commands available for one thing the formula here, the first argument tells match it what you want to use to create the matching model is what's the phi that I'm trying to balance between the treatment condition and the control condition in this case I'm trying to balance w and z, these are the background variables across the treatment groups defined by x my data set is just dat this method tells match it, what method to use as you might expect to do the matching the nearest neighbor propensity score matching where the default propensity score is calculated by a loget you can also calculate the propensity score using some other method which I've specified a probit because the actual selection model is a probit and finally there's this replace command replace tells whether to conduct the matching across treatment and control picking the best match every time, regardless of whether it's been used or not or picking a new match every time and taking that match out of the control data set I'm going to use matching with replacement which is going to allow the best match every time even if it's already been used the reason I'm allowing replacement in this case is as I can show you the performance of the estimator is much better with replacement so what I'm going to do is save the resulting data set in a data set called match.dat and the match.data command here allows me to effectively just get the data that I just matched and if you look at match.data you can see the syntax here group says which data you want to get the treatment group, the control group or everyone everyone because I want a match sample and I'll talk about this weight thing in a second but for now just store those, those weights will be important but I'll talk about why in a moment so if I do a plot of match one what this tells me is the all is the entire data set before matching and the matched is the data set after matching and what I'm looking for here is two queue plots which is to say quantile, quantile plots that tells us if we're for example if we're in the second quantile, how do I say this right I thought of a way to say this so the rows of the control units and the the columns here are the treated units and what this says is is the for example is the 20th percentile in the treated units equal to the 20th percentile in the control units on propensity score a perfect match should be a perfectly balanced data set should have a 45 degree line so that the 80th percentile for example propensity score in the control in the batch of control units should look just like the 80th percentile in the treated units and the idea here is a balanced data set the distribution of phi across the treated units looks just like the distribution of phi across the control units that's a 45 degree line on a queue queue plot now in the all column here what this is basically saying is what does the data set look like before we attempt matching and as you can see on the Z variable the data set is quite poorly matched which makes sense because you can see that matching when we create this data set X was very strongly linked to Z, 3Z was the coefficient in the propensity of being assigned to a treatment can equal to 1 X equal 1 so it's a little bit unbalanced on W because you can see there was a relationship to W here but it's highly unbalanced on Z matching did the second column here gives us the matched units and you can see that the queue queue plots look pretty good on W and a little better on Z but they're still not exactly perfectly matched they're well matched but not perfectly matched ideally we want to be inside of this pair of dotted lines which is just a broad indicator of is the match good enough so these are a little imperfect but look out what happens if we change replace to false so I'm going to recalculate everything here not all treated units will receive a match the matching is considerably worse because we're forcing it to choose a new case every time as opposed to just matching to the best case regardless of whether it's been chosen so in this case I'm choosing to use matching with replacement to get a little bit better performance but this is going to have a consequence and this is why I have to estimate these weights which I'll talk about in a second the summary match command which I just printed out here is another way of assessing balance what I'm looking at is whether the distribution of W and Z looks the same for the treated cases and the control cases for all data you can see there's a pretty big difference in this is distance here that's the propensity score metric there's a pretty big difference in the mean propensity score and there are big differences on W and Z that's not so good after matching the means on W and Z and on propensity score actually do look quite good they look pretty much the same so we're getting good mean balance on the other hand the quantile plots don't look quite right because we're not really getting ideal balance at all parts of the distribution another way of saying this is hey the densities well actually you can just do the cistergram plot to see what I'm talking about the means of the treated and matched control groups are similar but the distribution inside of that is quite different so this is the distribution of treated cases this is a distribution of control cases by propensity score the mean propensity score is the same but you can see that the control cases have a much more even distribution of propensity score whereas the treated cases tend to have a very high propensity score on average I tend to have I should say the distribution of propensity score whereas the control cases are more even so we've we've fixed it to a degree we've shaped the control distribution much better than the raw the raw case would look like but still not perfect you can see in this jitter plot which is another option oh actually I should need to clear these so we get a good there we go this shows you the matched and unmatched treatment control units and what you can see is we all of our treatment units have been matched because as I mentioned in the lecture the software matched and included tends to pull the treatment units to give us a good estimate of the ATT the control units many of them they remain unmatched because we're matching with replacement you can see the size of the circle indicates how many times that particular case has been used as a match and this case right here this one all the way to the right has been used a ton because it's such a good match to so many of the treatment units these ones down here have been used less frequently because there simply there are more cases available there and I'm sorry there are more control cases available there and fewer treatment units to match so matching with replacement has gotten us a case where we're over using the rare control units and under using the common control units but that's actually fine there's nothing intrinsically wrong with that and in fact doing that enables us to get better matches for these particular treatment units that are high on propensity score so now what I'm going to do is run a regression which basically just amounts to a mean comparison test on the raw data without matching and what you can see is that on the raw data we're getting a 35.35 difference in Y between the treatment and control group even though there's no real difference or no real causal effect of X in other words what we've got here is a real basic case of Spurry's correlation owing to the omitted variable bias of neglecting W and Z but if we use the matched data instead of the raw data we get a difference of 0 which is exactly what we should get now you'll notice I've used the weights in here as regression weights and I've calculated these weights out of the matching procedure the reason I have to use these weights is because what we're effectively doing is comparing the mean of the control group to the mean of the treatment group and this is mentioned in Gellman and Hill's treatment of matching models which you can take a look at they mention that it's much better at comparing groups to groups because we can make phi look similar between the groups it's not so good at comparing cases to cases because it's pretty hard to get an exact match on a case so what this is what this means is that it actually is okay to match it with replacement because all we're trying to do is make sure that the two groups look alike however if we're comparing the mean of the control group to the mean of the control group you can see that we've used a couple of these cases from the control group a lot and we don't want to count them as independent observations when we estimate the mean of the control group quite simply because we only have one case in that we don't have 10 cases or 15 cases we have one and so if we count them as 15 cases we're going to underestimate uncertainty in our estimate of the control group mean the weighting scheme just downweights observations that have been repeatedly observed the more times it's observed the more often or the lower that weight gets so in short what we want to do is just include those weights in the regression as a way of downweighting downweighting so that we don't overest or underestimate uncertainty in our estimate of the control units now we could have done this exact same thing without any matching by just properly specifying a model that included WNZ correctly specified and that works perfectly well as long as you know how to specify the model of WNZ if you don't know how to specify the model of WNZ this may or may not work let me give you an example of what I'm talking about so suppose there's non-linearity in the response function on one of our key variables on Z this is a confounding variable so I've got selection on Z but just as before 2Z is the relationship for selection here so if I did a plot of Z on X what I should get I'm sorry no no Z on X Z on X, no X on Z X plot X on Z you can see that what happens is that as Z gets bigger the probability of selecting X also gets bigger so that when Z gets close to 1 we're almost always picking X but when Z is 0 we get some balance here so I'm more of a balance sample now suppose the effect of Z on Y is non-linear so here's the effect of Z on Y it's quadratic and if I plot out the response against Z you can see that what's happening is as Z gets further from 0.5 in either direction Y gets bigger so this is a typical non-linear response between Y and Z now what we're really interested in is the treatment variable X which is correlated with Z so if I just do a raw Y on X comparison I'm going to get something which makes no sense it's 0.637 is the estimate and that's not right but if I also add this in if I also just do a naive model where I'm saying okay let's run a model of Y on X and I know Z matters so I'm going to put Z in there as well and W matters so I'll put W in there too it's not going to make a difference but what the heck I'm also going to get bad answers look at this I'm getting an answer that still has correlation between X and Y even though I put Z in there the reason is that the response of Z is non-linear and X is sucking up those non-linear relationships so what I want in effect or what matching gives me is a way of controlling for Z without needing to know what the shape of that response between Z and Y looks like so I'm going to just as before I'm going to put everything in a data frame and I'm going to do matching again now I'm going to start off with probit nearest neighbor distance matching but let me actually put in a couple of plots here you'll see that this actually doesn't perform all that well in this case look at the plot of match one we're getting pretty bad matching balance in the match data set on the Z variable we don't want that this is bad so what I'm going to do is I'm going to try to get better balance which again is the whole point of matching try to get better balance by using course and exact matching so what I've done here is I've constructed a grid or strata for Z and W using the natural breaks created by histograms the breaks command just says you know if I draw a histogram of Z with 100 lines where should I put those 100 breaks and it just calculates that on its own using an optimum calculation for the break placement now I've used 100 breaks because I know from my distance matching that Z is having we're struggling to get good matches on Z so I want a pretty fine match on Z I want a pretty fine grid W I'm not having so much problem matching on so I'm just going to use a 10 break grid and so when I do my matching I just tell it I want to use the cut points for my strata grid equal to these breaks that I've created I tell it I want to use course and exact matching K to K matching just means that I'm returning an equal I believe it's an equal number of observations in each cell you could set that to false it's not going to matter a whole lot for this particular calculation so I'm going to calculate my save my data set and save the weights for the data set using match.data just as I did before and now if I do a plot of balance I get at least better balance characteristics you can see that I'm getting a pretty good relationship between control and treated units on this QQ plot much better than I did before what happens if I turn K to K matching off well let's see the same it's not going to make a difference in this case probably because the X and Z variables are so evenly distributed alright so let me save this information here and now I'm going to calculate my mean comparison my basically my regression my very basic regression with the match data set and what do I find well I find that now I'm appropriately estimating a relationship between X and Y which is a problem before I was not getting that answer I was getting with the naive data set an answer of 0.6 so I was actually underestimating the relationship with the match data I get the correct coefficient of 1 if I just naively put in Z and W I get the bad coefficient of 0.65 just as I did before now of course if I properly specify the relationship between X and Z 1 which is just a way of saying you don't need to do matching if your model is correct if your linear model is correct but of course it if you don't know the right way to specify the covariates your background exogenous estimators then probably it's a good idea to do matching I should also say that I can do matching including not just the treatment variable but also the background variables as a way of in effect additionally correcting for any problems in the matching procedure that are any problems I should probably put this any covariance between X and W and X and Z that wasn't captured by the matching procedure and you can see I get a different estimate of 1.04 as opposed to 1. I'm sorry as opposed to 1.03 so I'm not improving very much in this case I bet I could do even better yet if I put in Z squared Z squared Z squared is not in match.dat okay so I need to add Z squared into match.dat I'll figure out a way to get these together alright there we go I just added a I just squared the Z variable in the match.dat and added it there doing fine getting actually appropriate estimates for Z here for the relationship between Z and X and Z and Y and also getting correct estimates for the relationship between X and Z in the Gellman and Hill book it's mentioned that in other places I've seen that matching estimators are so called doubly robust the reason we call them doubly or the people call them doubly robust is that as you can see if I get the matching model right it doesn't really matter if I get the regression model right at least on here we go at least for estimates of the relationship between X and Z but it also I could get the matching model wrong as long as I get the linear model correct and I'll still get good answers and so you can see when I don't use any matching at all I'm in effect getting the matching model extremely wrong as long as my linear model is specified correctly I still get good answers so one way of thinking about matching is as a robustness reinforcement do matching on your data set and then also write down the specification of the model that you think is best things will work out because you've got two layers of robustness built into your model two different places to catch you if you fail so last I want to show an example of endogeneity in how matching can't resolve these problems of which way the causality flows so I've designed a data set where Y causes X but X does not cause Y so in other words this is a data set where endogeneity is set up in such a way that the causality arrow flows exactly the opposite of the way we would like it to where X causes Y there's no effect at all of X on Y but Y does impact the probability of being selected into the into a certain X treatment so what we'd like to see from our matching procedure is absolutely no effect of X on Y at all so I'm going to use my typical matching procedure just like I did before and my match looks pretty good you can see here I'm getting a good balance characteristics on my match if I just use the naive data you can see I get a strong and statistically significant impact of X on Y exactly what I don't want to see but if I use the match data well I pretty much get the same thing I get a strong and statistically significant relationship between X and Y now what this is telling me is that matching has no way of figuring out whether the correlation between X and Y flows in one direction or the other at least not on this basis and you can see even if I include controls with my match data for Z and W I'm still getting bad results here and the bad results are as a consequence of the fact that matching does not solve endogeneity problems matching can solve parametric problems but not endogeneity problems what about omitted variable bias problems suppose that for example it's the case that X is correlated with Z and W as it is here but it's but W is not included in the matching model so I'm going to just copy this endogeneity example I'm going to change it into an omitted variable bias example so call this omitted variable bias and what I'm going to do is Y is no longer going to affect X just C and W and Z is going to also have a relationship we'll say it has a relationship there's no relationship with X so this is a null model but what we're going to do is when we go to construct our matches we're not going to include W in the matching model so I'm just removing W from the matching model entirely and if I recall correctly CEM does not like to perform matches with only one specified variable actually it seems to work fine so we can just keep using oh no we're using W there that's why we've got to take W out of the match model yeah and it doesn't like that so we'll switch over to some other method for example we can switch over to nearest neighbor based matching and we'll use the distance probit and replace is true so we'll go ahead and do the calculation and we'll check to see whether our balancing procedure is working it's working okay could be working a little better the balance could be a bit better so maybe we can try using one of the other matching procedures for example we could try using subclass matching subclass matching is sort of resembles CES matching except on one dimension and I believe we have to tell the number of classes we have to want to use such as 10 and I think there's a option where we tell it to sort according to well actually let's just see how this works out no replacement oh hold on it's not in that class it's something else it's called subclass so change subclass to then do our matching procedure here and then if we look at a plot of the matches what you'll see is it'll allow us to see subclasses individually so if we look at the first there's the first bit of match classes there's the second, third, fourth, fifth sixth, seventh, eighth ninth and tenth and they're all pretty balanced so we've got a pretty good balancing characters this is sort of similar to CEM but on one dimension so now if we run our model with the matched and unmatched data what you'll see is that we are actually getting a correlation between X and Z I'm sorry X and Y on the unmatched on the matched dataset even though the true relationship between X and Y is zero the reason is we've got an omitted variable is W and if we don't match on W matching has no way of fixing that so we have a problem now on the other hand if you put W into the regression model and you've gotten the matching wrong but you get the model right it actually works out you can see that X is properly recovered as having no relationship and W is recovered as having the right relationship so what's interesting about this is that this is what we mean by doubly robust if you get the model right but the matching wrong it still works out in terms of estimating the relationship between X and Y but you have to get it right somewhere and if you omit W from the matching stage because you don't know it exists or because you're not aware that it ought to belong in the model and in the matching procedure somewhere then matching is not going to magically make the situation get any better you're still going to have you're still going to have omitted stop you're still going to have omitted variable bias just like you would in a model that didn't involve matching so in summary don't expect matching to be a magic bullet that solves all your problems see it as a way of making your model more robust to misspecification but not to omitted variable bias not to endogeneity and not to a variety of selection problems alright that's it for this week thanks for tuning in and I'll see you next week