 Hi, this is Dr. Justin Essary, and this is week 8 of Poli Sci 506, Bayesian Computational and Non-Parametric Models. And this week, we're going to talk about measuring uncertainty and statistical estimates with non-parametric approaches. One of the key features of statistical analysis is that it not only tries to model what we do know to extract information from a data set, but it also tries to tell us what we don't know or what we're uncertain about. Now, obviously there are limits to this. A model that could tell you everything you didn't know would be a pretty good model. If anyone ever comes up with that, they really have something. But one thing we certainly can do is say that given a set of assumptions that we've made about a model and given the data that we're seeing, we can narrow down a knowledge to a certain range. We can say that our best estimate given the model and the data we have is inside this range, but we're less certain where a relationship lies inside of a particular range. That idea, the idea of quantifying uncertainty, of not only trying to recover information about what we know from a model, but recover the fact that we're not certain about what we know from that model is really an interesting and important part of statistics. The focus of this week's talk is trying to do that approach by adding less parametric structural information from the model. Now, before I get into how a parametric approach to estimating uncertainty might differ from a nonparametric approach, I just want to talk a little bit about what's going on here and what are the sources of our uncertainty in a statistical model. Like I was just saying, we know already that any statistical model is going to give results that are imperfect. By imperfect, they could be structurally mistaken, but they also, even if they're structurally correct, are going to be somewhat uncertain. There's going to be some error in the prediction, most likely. Our goal of a model is to try to minimize those imperfections as much as we can, but it may be the case that even in very ideal circumstances there's an extent to which we cannot resolve uncertainty. There are ultimately unresolvable sources of uncertainty or variability in our estimates. In so much that that exists, we want to capture and recognize that in our modeling procedure. The first question I want to start by asking is, where does error come from? Why are our models uncertain? Some of these sources of error are resolvable. We can maybe limit them or narrow them down by doing something, but some are not. For example, one source of error in a model is sampling variation. Right at the time I'm recording this lecture, there's a presidential election in the United States underway in 2012. There's a lot of attention in the media focused on polling data, trying to figure out how people are going to vote, what people are thinking about right now politically. Do they prefer Mitt Romney or Barack Obama for the presidency? Quite obviously, no polling firm or even the government has the resources to ask every voting American what they think. Particularly, they can't ask them what they think every week. We have to take a sample out of the population of American voters and ask them what they think, maybe a thousand, two thousand, three thousand, something like that. And then use that information to extrapolate what the rest of the population thinks given that sample. But obviously, samples are not all alike. There's even in a perfectly implemented sample, even in a perfectly representative sample of the underlying population, there's going to be some extent to which you don't exactly mirror the underlying population distribution of opinion. So sampling variation is the variation that we see because our samples are not perfectly representative. So the error that accrues to results because even properly implemented sampling procedures do not perfectly mirror the underlying population because chance, random chance I should say, in who is selected for the sample. So this is not a lecture about sampling theory or sampling procedures and survey estimates, which in itself would be a quite involved topic. There are lots of ways of doing it. And all of these ways make a best effort to try to recover a sample that looks like the underlying population. And it should look like the underlying population in ways that are relevant to the thing you're asking about. So for example, if we're asking about presidential election opinion, we want a sample of voters because those are the people whose opinions matter if we're trying to project an election outcome. And we also want a sample that's representative of the underlying distribution of opinion in the population. And given that we know that opinion varies by race, sex, age, education level, ideology, all those things, we try to geography. We try to design sampling procedures that give an equal chance of selecting all of those different groups in the population. Now the simplest way of doing this is simple random sampling where you assign every single person in the United States a number from 1 to 300 million and draw all those numbers out of a big urn or something and pick the ones whose numbers are called and ask them what they think. It's a little hard to actually implement, but that would be a way of doing this. But even in that ideal procedure where you really are simple random sampling gives every single person, voting person, if we can narrow it down, an equal chance of being selected, just because in expectation the sample looks like the population doesn't mean every single sample looks like the population. Everyone knows that, for example, even a fair coin, right, a coin that has a 50% chance of coming up heads and a 50% chance of coming up tails, sometimes you flip a coin and you get six heads in a row or eight heads in a row. That happens from time to time. And that happens in random sampling procedures as well. You have some variation due to random chance in who you selected, who gets into that population. And that is a source. That randomness in chance and who gets selected becomes a source of variation in opinion estimates that we derive from that sample. So that is a source of error in estimates that come out of a model. There's also sort of a... Well, there's another interpretation of this that I'll delay a little bit until we discuss the second source of error. Now, there are ways of minimizing sampling variation non-methodologically. The easiest, I say that in quotation marks, but the easiest way to do that is just to make the sample bigger. The bigger the sample, the more likely it is to be representative of the population assuming that your sampling procedures are valid. Of course, it costs money to get together a bigger sample and it takes time to get together a bigger sample. And so it's easy conceptually but hard to do in practice. So there's a sense in which this is a resolvable source of error in our estimates, but there's also a sense in which it's unresolvable given the reality that our samples are always going to be limited. So we can see this as a sort of fundamental source of uncertainty in our models that we're just simply going to have to deal with. Another source of potential error is noise in the data-generating process. What I mean by that is suppose that we actually have the entire population. We are observing the entire population. So this is common, for example, in IR datasets, international relations datasets, where we're studying countries' behavior over time. It may be the case that we have every single country for a given time period that we'd like to study, and so we are observing the population. There is no sampling variation. Nevertheless, it may be the case that the outcomes themselves are randomly generated, which is to say that the procedure or data-generating process that we're studying is itself irreducibly random. So let me give you an example of this. There is a theory popular in the IR community that war is, at least in part, a fundamentally random outcome in the sense that it occurs due to factors which we normally attribute to the error term. What I mean by that is that, generally speaking, a peaceful resolution of a dispute would be less costly to the actors involved. And so the fact that a dispute arises must be because they disagree about the fundamentals of the dispute, whether who can win or their likelihood of winning or there's some disagreement about what they think about the dispute. And that disagreement generates the potential for conflict outcomes. But that disagreement may be part-driven by the fact of uncertainty. In other words, there's uncertainty about information. And the degree of uncertainty means there could be noise in what people think, to say, what the analysis that a leader makes of the situation may be itself random. The assessment they make one day may be the different than the assessment they make six months from then, owing nothing to underlying fundamentals but having to do with random variations in what kind of intelligence they've got or even such things as their psychological mood. And so it's the case that that process, the process which generates conflict outcomes is itself noisy. Now in the natural world, we see this all the time. There are lots of data-generating processes that are not completely deterministic. They include error or noise. You know, there are lots of examples. The easiest ones to always bring up is quantum phenomena where the randomness seems literally intrinsic to the process. But even macroscopic events like, for example, you know, weather patterns. Weather patterns we think of as being largely deterministic but there are so many factors that influence them in the long run. For example, there are so many small differences that can drastically change in the long run, the path of a hurricane, that predicting the path of that hurricane becomes very difficult beyond a few days because very, very minor differences almost attributable to, you know, noise in wind currents and temperature variations at day one can really make a big difference about where that storm ends up in day three. And those differences are too small for us to measure. So if, you know, just to sort of write down a version of this, if this is what the data-generating process actually looks like, that is our data-generating process. That means that there is a noise term in the data-generating process that we cannot make go away. It is, you know, it's random. It has some kind of distribution, you know, so often we model it as being normal. That doesn't mean it actually is normal, but we could say it's normal, close enough to being normal. Sometimes people say, well, the error term includes things that we don't know yet or that we can't explain or that we're neglecting for ease of interpretation. That's true. Things that you neglect will end up in the error term. What I'm saying is that even if we had a completely perfect model of the world, at least for some things we're studying, there would still be randomness, there would still be noise, and we'd have to deal with that. So these two aspects of randomness or error in estimates are irreducible in some way, and we're just trying to deal with them and to recognize that they make our estimates hard to nail down. It's hard to say, for example, who's going to win the presidential election next week for the reason that the outcome may not be determined. We may know there's a 70% or 80% chance that Obama's going to win. That doesn't mean he's definitely going to win. That means there's going to be a roll of dice on that day, and there's a 70% chance it comes up Obama and a 30% chance it comes up Romney. But there is the third source of uncertainty, and that's the uncertainty that comes in from a model mis-specification, which is to say that we're not necessarily sure that the model we've written down is the right one, and this is something that's actually somewhat hard to methodologically or econometrically capture because it's hard to get a sense of, you know, how do we know this model is wrong and how wrong is it? How good is our approximation to what we're doing? And part of what I want to get at today is that nonparametric approaches to the estimation of uncertainty in a statistical model are designed to minimize this source of uncertainty, but that means they may be less efficient at minimizing the other two sources of uncertainty. So, for example, if I know the exact process by which uncertainty is generated in a model and it has a parametric form and I know it, I can probably build that into a model and very efficiently tell you the uncertainty bounds that are going to come out of that data generating process. But if I'm not entirely sure about what the uncertainty where the uncertainty in the data generating process comes from and what it looks like, then there's that additional source of uncertainty present in the fact that I'm not exactly sure what this model is supposed to look like. And building a parametric model might be a good enough approximation or it might not be a good enough approximation. So, Bootstrap and other nonparametric approaches to uncertainty estimation are designed to remove model mis-specification uncertainty or actually, here's a way of putting it. They're a way of making model mis-specification uncertainty a non-issue by hopefully making a data generating, an error generating process recovery mechanism that's robust to many different kinds of error generation. The price we're going to pay for that is in wider confidence intervals and more uncertainty. A little bit inaccurately, somewhat intuitively we can interpret this as being the uncertainty price we pay for having uncertainty about what our model looks like. Of course, there are all sorts of other reasons to use nonparametric models of uncertainty. In some cases it's because we actually can't write down what the parametric model looks like. But these are some issues that maybe we can delay a little bit until we talk about what a parametric approach to measuring uncertainty would look like and where it might fall down. So, let's just go to that right now. So, a parametric approach to measuring uncertainty effectively says we know a structure within which the data generating process is happening. That structure includes some integrated uncertainty. There's a certain part of the DGP, for example, or there's some kind of sampling error. And we're going to extract the degree of uncertainty out of the model using those assumptions and information that's present in the data. So, for example, in an ordinary least squares regression, what we're interested in is the degree of uncertainty in some kind of parametric, the estimated relationship like a marginal effect. I want to know whether when something goes up, something else tends to go up or down. And I want to know when the economy gets better, does the presidential incumbent tend to do better in the election or worse? And the way I might measure that is by running regression on past election results and the economy and economic reform and see what the relationship is. And if I run a linear regression, what I've said is, well, the relationship between those two things must be linear. And there's a parameter that measures the marginal relationship between them, the coefficient parameter beta. And so my interest is in the uncertainty in that parameter. So, uncertainty in the estimated effect is often, although certainly not always, rooted in uncertainty in parameter estimates. And then in turn, uncertainty in the parameter estimates is defined by the model. And so, in part, our estimate of uncertainty in the parameter depends on the quality of the underlying model. And by quality of the model, I mean how well does that model approximate the data-generating process? How much do its assumptions do abuse or are they unreflective of movement in the DGP? So, in order for these things to work, we need to have, perhaps not totally accurate, but say sufficiently accurate specification of the model and whatever variance estimates are going in, such as the variance-covariance matrix, in order for these things to work. Let me give you a very simple example of a parametric estimate of uncertainty. When we do a two-sample t-test, we need to calculate the difference of means between two samples, and we need to figure out, hey, is that difference something we would consider statistically significant or is it attributable to a sampling variation or noise in the data-generating process? Comparing the means of the two samples is easy. We just subtract the two means and see if they're not equal to zero, but then we've got to see, is that difference statistically distinguishable from noise? Well, if under the null, we have that the difference is equal to zero, so that the two means are the same, and the distributions of the two populations, one and two, their variances or standard deviations are unknown, but estimatable, and we're not sure whether these two things or these distributions are equally spread out, so we're uncertain about that. We're going to allow them to be different. Then the critical t-statistic that we would use is, well, let's look at the estimated difference in the populations, and then divide that by our estimate of the standard error of the difference, so standard error of the difference, right? And then we calculate the standard error of the difference as the estimated standard error of the first sample divided by the size of that sample, and this should be actually the variance here, this estimated variance, plus the standard error of the, this should be one here, sorry. Incidentally, the reason my notation is shifting is because I'm literally taking this straight from the state of manual. This is a very, very common test to run, and they use the X and Y notation instead of the one and two notation. Add to that the estimated standard error of the second sample, and then take the square root to get back to a standard deviation. Now, is that a reasonable estimate of the standard error of the difference between those two distributions? I'm sorry, I should say the standard error of the difference between the means of those two distributions. Well, maybe, maybe not. There are a few assumptions that go into this, for example, we have to assume that there exist finite variances of the two distributions that we're comparing, which may seem like a trivial thing to assume, and it usually is, but there are some distributions that are not especially well-behaved, and there are some circumstances where this finite variance assumption won't apply. So, you can fall down on that point. The one thing that's nice is that the distribution of means, regardless in some ways of the underlying distributions of the samples, the distributions of the difference in their means, we can appeal to some central limit theorems and say that these things are distributed T or in very large samples at normally. So, we don't make too many assumptions here, which is actually a good thing, but a little later I'm going to show you an example where even this very mild set of assumptions creates a test which won't work, at least in some cases. Whereas a nonparameter test doesn't make those assumptions will work better. Now, a much easier example in terms of parametric assumptions to lay out is how we construct the variance-covariance matrix of beta hat. So, as you know, an ordinary least squares regression model does something like this where we're looking for some kind of linear relationship between a bunch of x's, we can just label these x1 and x2 and so on, up to k. And our estimate of y is going to be like so. Now, we know that these parameters are uncertain, and so we can construct a matrix of the variances and covariances of these beta hats, the VCV. And under the usual OLS formulation, the VCV is given by our estimate of the standard error times x transpose x inverse, where x transpose x inverse is the square matrix form by multiplying the matrix of regressors transpose times itself. And of course, there are various ways of getting this estimate of the standard deviation of the regression or standard error of the regression. If I recall incorrectly, the usual way of doing this is just 1 over n minus k times the sum of squared errors. Or also, if we want to stick to matrix notation here, you transpose U, where k is the number of regressors. So, that's fine. It works under many circumstances, but obviously a lot of assumptions are built into this construction of the variance. You've probably already discussed at some point in one of your various classes that if the variance of y, of the error term epsilon, because as you know, in the underline data generating process, y x beta plus epsilon, so this is the true value of y, there is this epsilon term. If epsilon is not homoscedastic, if epsilon does not have a constant variance, so if epsilon does not have constant variance, then the assumption that we can estimate the VCV using sigma hat squared is going to break down. And there are various fixes that we can apply to that. We can apply white's heteroscedasticity, robust standard errors, or Efron's variant, or any number of other techniques to do that. But nevertheless, it's important to know that there's a parametric assumption that's being leveraged in order for us to say that this variance covariance makes just how the betas make sense. There are other assumptions as well. I mean, a sort of obvious one is that the data generating process looks like this. If it doesn't look like this, there's no way to get to this VCV in its convenient form, and there's no guarantee that approximating this VCV with a line, so approximating a nonlinear DGP with a line, is going to properly recover our necessary level of uncertainty in our fitted values Y. Y hat, sorry. What I'm saying to you here is that if we're using OLS as an approximation tool, which, in my opinion, we almost always are using it as an approximation tool, then it's the case that our covariance-covariance matrix may or may not be a good summary of the true degree of uncertainty we ought to have, given the fact that our model is an approximation, and given the fact that the betas are the best set of betas that would approximate the DGP are uncertain. So, you know, linearity, or another way of looking about this, or talking about this, is proper model specification. If we don't have that, there's no guarantee that our models make sense. And, of course, maximum likelihood estimates, I don't want to go into great detail about how these uncertainty estimates are recovered. In short, it uses the information matrix, so-called, the gradient, the Hessian gradient that comes out of a maximum likelihood estimate is the source of, is how we measure the uncertainty in an estimated model, typically speaking, under an MLE framework. But the same kind of ideas apply in the sense that if the likelihood function that you're using to approximate the data generating process is not a perfect match to the actual data generating process, then there's no guarantee that the uncertainty in the beta hat estimates you produce for maximum likelihood are necessarily going to be a, the best statement of uncertainty about the betas, or beta hats, I should say. In other words, there may be the case that these beta hats are actually more uncertain, possibly less uncertain, but probably more uncertain than you realize. And the curvature of the likelihood function, which, as you know, depends on the sample at hand, may or may not fully tell you that. It may not, that source of uncertainty may not filter through fully to the standard errors. So parametric models of uncertainty are useful and good, but they are approximations. And much as we used a kernel regression approach to relax that assumption and allow us to get a better estimate of the mean, we're going to use a non-parametric approach to uncertainty to relax those assumptions and perhaps give us a better estimate of the uncertainty. So let's do that. So to give you a sense of how one might non-parametrically assess variance, I want to start off with a really simple example. And this example is the Mann-Whitney or Wilcoxon rank sum test, which is described in many, many places, including many statistical textbooks. But if you have a copy of state 11, you can also check out this data 11 man page, which gives you a sense of how this works, which would be freely included with your state of software. But basically what happens in the Mann-Whitney-Wilcoxon rank sum test is that we're trying to decide whether two samples that we've gathered come from equivalent distribution. So we've got a sample of something and we've got a sample of something else. These might be two response variables for two experimental groups that have been subjected to different treatments, for example. And what we want to know is, do these things come from the same distribution or not? So is it the case that the distribution of X1 is the same as the distribution of X2? And in particular, we're going to test to see if the medians of these two samples are equivalent. So the question here is, are the medians of X1 and X2 the same? And the null hypothesis is that they are. Now you might immediately say, hey, we've got another more parametric tool to do this, a two-sample t-test that we just described, and that's correct. The two-sample t-test, a mean comparison test, is aimed at the same goal, trying to assess whether two distributions are the same and in particular whether two distributions have the same mean or not. But the Mann-Whitney-Roll-Cock-San rank sum test is more parametric than a two-sample t-test because it makes no distributional assumptions. There are actually a couple of different ways to talk about this test. In other words, there are multiple ways to get to it. But the easiest way to describe it is probably to talk about it in the following way. Each one of these samples has a bunch of different observations in it. So for example, I'm just going to denote these samples X and Z to make it a little bit easier. So maybe I'll go back and change these so that we have consistent X and Z, and we want to know if the medians of X and Z are the same. So we've got a bunch of different observations, X1, X2, X3, all the way up to X of n, and then we've got a bunch of observations Z1, Z2, and so on up to Zm, and m does not necessarily equal n. What a rank sum test does is it says, let's compare every two possible pairs of observations. So we'll start by comparing X1 to Z1. And we'll just see whether it's the case that X1 is bigger than Z1. And if so, we'll denote it, and if not, we'll denote it as 0. So we'll code this as a 1 or a 0. So rank X1, Z1, we'll just code that as a 1 if X1 is bigger than Z, 0 if not. We compute this rank statistic for every pair, possible pairing of observations in the two samples. So what we do is we say, okay, just add up the number of ranks, and we'll call this u. This is the Mann-Whitney u statistic. It is the sum of the number of pairs for which xi is bigger than zj, and we're going to sum this over i and then sum it again over j. And it can be shown, basically by combinatoric theorem, that if the two distributions are equal, we expect a certain distribution of this statistic. In particular, we expect this statistic to follow a certain distribution. And so what we're going to do is we're going to say, is it the case that this u that we've computed follows this distribution or not? Is it the case that we can distinguish our observed value from what we would expect under the null? So u has a distribution under the null, but that distribution is given by combinatorics only. And all we need to do is just see whether the computed u that we see is well inside this distribution or not. Now you may say, wait a minute, there's a distribution that kind of sounds parametric to me. Well, note that nowhere in here have I specified anything about the characteristics of the x and z distributions, and nowhere in here have I specified anything about computing means or anything. I'm just comparing ranks. And if the distributions are the same, no matter what those distributions are, we expect to see a certain distribution of u. So it's nonparametric in the sense that the distribution of u under the null does not flow from any distributional assumptions regarding x, z, or the difference between them. So in essence, what we say is under the null, we expect to see some kind of distribution of u. And we're going to calculate our particular u-statistic, u0, and we're going to see whether it's the case that our u0 would occur under the null a lot or a little bit. And the area shaded to the right here is going to be the one-tail p-value for the, I guess it would be the first sample being bigger than the second sample, because that would mean the sum of ranks would be higher than normal. So it's a very simple and very old test that states back to the 40s. And it's commonly implemented in, for example, a testing of experimental hypotheses for treatment effects where we don't want to assume any kind of distribution of the test statistics. Now we can show by performing this test that in R under some sample conditions that it actually does a bit better than we might expect than a t-test under similar conditions. So what I've got here is just a little, I'm actually going to start off. Let me clear my console here for a second. I've cleared my console and I'm also going to clear my memory. That's generally a good idea before you start on the analysis. I should actually add that to the lecture file here so that everyone clears their memory before we get started. Make sure there's nothing hanging around from previous sessions. And I'm just going to set some seeds so that we all have the same random variables in our memory. And I'm going to draw x and z, two variables from the Calci distribution. And these are going to have different locations. And so if you plot the nonparametric density of x and z, you'll see actually it's probably best if we do this via box plot here. You can see that we've got a very high variance in our estimates here, which is natural because the Calci density has very, very fat tails such that its variance is odd. In fact, the Calci variance is undefined. And so you can see that in the fact that our distributions are pretty weird-looking here. But if we just look at inside the limit between y equals 5 and y equals negative 5, you'll see that the second sample here, as we specified in the location parameter, is a little bit bigger than the first sample. But the variance of this distribution is undefined. So if we run a t-test comparing x and z for a quality of means, what you'll see here is that the p-value for the difference test is 0.3767, which is not statistically significant. It's telling us that there's no difference in these two distributions. One is not larger than the other, doesn't have a bigger mean than the other. But if we do a Wilcoxon-Ranxum test, we get a very, very small p-value, which tells us that the location shift is not equal to 0. In other words, that one of these distributions is bigger than the other, and by looking at the box plot, we can immediately see that it's the z. It's the z distribution that's bigger. So the Wilcoxon-Ranxum test detects a statistically significant difference in a case in this Calci distribution where the t-test doesn't. And if we do this over and over again, you'll see that we get similar answers over and over again. So two-sample t-test p-value of 0.6 versus a p-value of 0 again. So again, the t-test is not finding a difference in the Wilcoxon-Ranxum test is. Same thing, same thing, t-test is missing almost every time. The t-test finally did get a statistically significant result there. But the Wilcoxon is nailing it much more. So there are some simple simplistic cases here where relaxing the distributional assumptions about the underlying dependent variable can be useful at detecting differences where a more parametric approach would fall apart because the assumptions are not supported. In this case, it's difficult to approximate the variance for a t-test because the variance for a Calci statistic or the Calci distributed variable is not defined. Now you might be asking yourself, how likely is this going to really happen in applied data? Well, not that often. Probably it's going to be the case that in most situations a t-test has reasonable enough assumptions that it will do a good job of catching statistically significant differences where they do exist and ignoring them where they don't exist. But the Wilcoxon-Ranxum test is, as we've just demonstrated, at least in some situations a little bit more robust. And so in very generic data analysis situations it's likely that t-test will suffice. But if you have a situation that's a little bit different, your data is not exactly well behaved. It's a little bit non-normal. Something is going on that might lead you to question the assumptions of the t-test, whether the central limit theorem applies or whatever. It might be a good idea to do some Monte Carlo simulations to see if the Wilcoxon-Ranxum test behaves better under the circumstances of your particular data set than an ordinary t-test. And in my case, I've noticed that many researchers that I work with experimentally just default to using Wilcoxon-Ranxum tests because they're more free of assumptions and so they feel like they're letting the data speak for itself more than we might by imposing the normality or the t-distribution structure that we do with a t-test. Now again, central limit theorems allow us to do that for a difference of means test because there are central limit theorems that typically apply in these cases and we can sort of make it work asymptotically. But something to think about, that's a very simple controlled example. Now most of you obviously are going to use much more complicated models which are going to have much more complicated assumptions structures and as a consequence, probably more fragile assumption structures. And so now let's take a look at a more complicated example where we relax assumptions and look at the consequences for estimates of uncertainty. So now I want to talk to you about a generalized procedure for handling uncertainty in a non-parametric way called bootstrapping. We could in fact apply bootstrapping or bootstrapping-like procedures to the two-sample mean comparison test that we just conducted as an alternative to the Man-Whitney-Wilcoxon-Ranxum test. But of course the advantage of bootstrapping is that it's not limited to these narrow scenarios. It's applicable in a wide variety of scenarios. The idea behind bootstrapping is pretty simple although you can certainly go as deep into the subject as you want. You can see I've got this, this is where some of the readings for this week come from a book about bootstrap methods and their application by Davidson and Hinckley. It's a thick book and a very technically complicated one. It's not the only book on the subject, there are several. There's a somewhat famous book by Efron and Tip Sharani I think about bootstrap sampling as well. So the idea is simple but there are lots of complexities that you can quickly get into if you want to start really thinking seriously about this. But the basic idea is the following. Hey, we know that our estimates are variable and uncertain. But rather than try to do something with estimated error terms to capture that uncertainty, instead of estimating those error terms, we're going to try to simulate the process of sampling. Sampling is a source of uncertainty quite obviously in a regression or in any kind of statistical analysis. And so maybe we can figure out something about the uncertainty in our estimates by reenacting the sampling process and seeing how our results change in different samples. Now, of course that leads you to ask the question, where are all these new samples coming from? Well, we'll deal with that in a second. We have to find some way of sampling repeatedly and actually as it turns out what we're going to do is pretend or treat the sample that we have as though it's the population and then repeatedly sample out of that with replacement. So that's how we're going to simulate the process of sampling. But we're going to somehow get these samples. We're going to estimate the model on each simulated sample. We're going to recover whatever thing that we're trying to estimate on each sample. And then once we've done this a lot of times and we've got a lot of different samples, we can look at how our estimated effects or parameters change in each sample. So what we're doing is we're recreating the sampling distribution of a statistic without taking repeated samples from the population. We're going to take repeated samples instead from our underlying data set, from our sample, from our data set, which sounds a little strange. But it's intuitive in the sense that if sampling repeatedly from the population works, maybe sampling repeatedly from the sample will work too. And as it turns out, it works quite well. So let's take a look at a couple of different ways we can implement bootstrapping. So I'm going to start with an example of parametric bootstrapping that you might be familiar with if you've taken some previous courses in political methodology or in statistics. A parametric bootstrapping in general is using distributional assumptions of a parametric model as the source of resampling. So what we're going to do is sample out of the distribution that is provided for us by a fitted model. It's called parametric bootstrapping because obviously it relies on the parametric assumptions of the model that are creating this distribution that you're sampling out of. The benefit of doing this, however, is that it allows us to estimate uncertainty in quantities for which we cannot solve the parametric distribution. But what we can say that we can construct its uncertainty or we can construct it from quantities for which we do know the parametric distribution of its uncertainty. So a very simple example of this is the case of estimating marginal effects out of regression models with interaction terms. So I've got a little model here with X, Z, and X times Z being used to estimate Y. There's an uncertainty term as usual, an error term. We estimate this model on a data set, and what we want to do is plot dy dx against z. Now if you just take the derivative here, dy dx for the model, you'll see that it is beta 1 plus beta Pz. That's the derivative. And so if we want to estimate this marginal effect, how much does Y change with X, we need to know the value of Z. And if we want to know the uncertainty around that, we need to calculate the uncertainty for this quantity. Now in some cases, for example in ordinary least squares regression cases, there is a formula for calculating this exactly. So for example we could say, well I just need to know the variance of beta 1 plus beta Pz. And the formula, the identity for the variance of the sum of random quantities, is variance beta 1 plus the variance of beta Pz plus 2 times the covariance between beta 1 and beta Pz. The z term comes out because it's a constant, and so this becomes, the second term here becomes z squared variance beta P. And the z term comes out here as well, but it comes out once because it's a covariance instead of a square term, beta 1, beta P. The first term stays the same. And we have derived the exact formula for computing the uncertainty around a marginal effect estimate in a linear least squares regression. This won't always be the case, there are going to be some quantities for which we don't know the exact distribution. For example, when we simulate quantities of interest out of nonlinear models like a loget or probit, it's going to be much harder to come up with exact statements of the uncertainty around those estimates. And so we're going to have to do something else, and we can do that something else in the case of regression models as well. What's the something else? Well, the something else is this. I know how dy dx is constructed. It's constructed out of beta 1, beta P, and z, right? Right. So what I want to do is determine the uncertainty around this. Well, the uncertain quantities in this equation are beta 1 and beta P. I'm going to consider z fixed and certain for this analysis. That means that what I could try doing is I could try saying, okay, well, I'm going to simulate a lot of estimates of beta 1 and beta P out of my distribution for these two things. Now, I should say the true quantities don't have distributions, right? So in practice, we're doing this with estimates. We've estimated all these quantities out of a model. And as a consequence, we don't know the distribution of the betas either. We've estimated it somehow using, and actually really, there's two different ways of looking, but we can talk about there being distributions of the estimates. Now, what's the distribution of the estimate going to be? Well, by various central limit theorems, we can invoke, we could say that this distribution is going to be normal with means given by the estimated beta hat quantities. And that's our best guess anyway. And with standard deviations given by the estimated VCV matrix. Now, the actual distribution of beta hat is going to be given by the true values of beta and sigma, right? So if we knew the true state of the world, we would say that in the limit of a very large sample, these things will be distributed around beta and sigma, where those things are the true variance, covariance, matrix, and mean vector. But we, of course, only have estimates of these things that we've constructed from a data set. So we're going to plug in our estimates of the VCV and beta and say, those are our best guesses for the population parameters. We know that in the limit of an infinitely, technically an infinitely large sample that beta hat and sigma hat are, I'm sorry, that these beta hats are distributed normally with a mean vector of beta hat and a VCV of sigma hat. This is incidentally the VCV from the regression. And this here is just the fitted MLE coefficients. The normal distribution part of it comes from a central limit theorem that says that these coefficients get normally distributed in large samples. So why don't we do this? Why don't we say, let's just pull a lot of draws of beta one hat and beta p hat out of the asymptotically normal distribution for beta hat. And for each one of these draws from the distribution, we'll compute dy dx or dy hat dx. And then, once we've done this a whole lot of times, we've gotten a whole lot of draws, we've computed a whole lot of dy hat dxs, we now have in effect a simulated density of dy d hat. So recover OLS or MLE coefficients and the VCV, plug the estimated coefficients in the VCV into the asymptotic distribution of beta hat, simulate many draws of beta hat, I should say not from beta hat, simulate many draws of beta hat from the asymptotic distribution. Sorry, I got ahead of myself. And then compute dy hat dx for each draw. And then the resulting collection of dy d hats forms our estimated or simulated distribution for dy d hat. This is, in short, what the clarified package of King's, Tom's and Wittenberg published a couple of years ago now. I guess it was more than a few years ago now. That's what this is designed to do. It's designed to take a fitted model, use some assumptions about its asymptotic distribution of its parameters, simulate draws from that asymptotic distribution using the fitted coefficients, and then construct out of those draws more complicated quantities that we can then compute over and over for each draw, and therefore get a distribution of the complex quantity. So that's what we're doing. We're doing, in short, the simulations. And this is parametric bootstrapping again because it relies on our assumptions about the model, and also our assumptions about the asymptotic distribution of the coefficients. How is this applied in practice? Well, this approach is often used, like I said, for computing these interaction term coefficients, and we can repeat that process in R and give you a sense of how it's done and what that product looks like. So let's do that now. All right, so here is my fake data that creates a model of interaction. You can see I've got two variables, X and Z, and their product X times Z, constant term. Here are the beta coefficients on the constant X, Z, and X, Z in that order. And what I'm going to do is just use a bit of matrix algebra to construct the model, which is Y is X beta plus a normally distributed error with a specific standard deviation. That's it. So then what I'm going to do is just output that data as a state of data file and then clear my memory. So I've got nothing left in memory and try to recover the data generating process using just the data. So I read the data back in that I just created and it's as though I just loaded the data set. So here's my linear regression model. You can see here's X, here's Z, and here's X times Z. And if you look back at the beta coefficients, we're doing a decent job of recovering the beta coefficients that we specified. But what we want to do is see what the relationship between X and Y is. And you might say, well, here's the beta coefficient on X. It's 2.6, so every 1 in an increase in X causes a 2.6 unit increase in Y. But that's not true. That marginal effect is going to be different at every value of Z. And for some values of Z, it may even be statistically insignificant because you can see this product term coefficient is negative, which means that as Z goes up, the marginal effect of X on Y is going to get smaller and smaller and may even converge to zero. So what we do is compute the dy dx using the formula we laid out earlier. And we compute the standard error of that quantity. And I actually don't need to save the EPS file, so I'm just going to get rid of that. And I'm going to start off by using the exact calculation, which we noted in the notes. So given that this is the true model, what's going on here? What's going on is I forgot to save the coefficients in BCV terms. Here we go. So there's the marginal effect of X on Y at all sorts of different values of Z. And I've got the 95% confidence interval plotted around that. Because this model is a linear model. It's not an approximation. This is the exact replication of the underlying digit generating process. Those 95% confidence intervals are going to be the best linear unbiased estimate. They're blue. So these are perfectly fine. Perfectly they'll work. But we may not in practice always know that that approximation is good. And for non-interaction terms, or for interaction terms in nonlinear models, we may not know that formula. We may not be able to solve for it. So now what I'm going to do is demonstrate the simulation approach. So what I'm going to do is, you can see right here, I've got, first I'm going to start by drawing the line again. And I'm going to draw a thousand draws out of the multivariate normal distribution with the mean of the estimated beta coefficients and the sigma of VCV. So now I've got a thousand draws of beta. And if you just look at head beta dot draws, you'll see for each coefficient I've got tons and tons and tons of drawn beta coefficients out of the posterior... Well, it's not a posterior density. It's actually just a... It's only a posterior density if the priors flat. But if you do a plot or a density plot, for example, of X, beta draws, let's see, it would be all rows in column X. There we go. That's what all those draws, that's what the distribution of those draws looks like. It's normal, as we would expect, and it's centered around our coefficient estimate, as we would expect. So what I'm going to do is use those beta draws to compute dy dx at values of z, z fits here, from negative 10 to 10. So that's my dy dx draws. And now if you just look at head of dy dx draws, I've got a ton of different estimates. And actually I should probably just look at the first column of these. All rows, first column. This is for the dy dx when z is negative 10. That's the first column. This will be for when... Oh, actually it might be... So it's actually z fits in rows, replicates in columns. So that's for when z equals negative 10. This is for when z equals negative 9. And if we go all the way to 0, there's that. And if we go to... Anyway, so there's our matrix of draws. Now what we want to do is recover the 2.5th and 97.5th percentile of those distributions. So for each row, we're calculating the 2.5th percentile and 97.5th percentile of our draws. So in other words, what we're doing is we're saying, if I do a plot of the density of dy dx draws for the first row, there it is. So there's our estimate of dy dx when z is negative 10, when it's at smallest. If I want a confidence interval around that, I should say, okay, we'll take the 2.5th quantile on the left-hand side and the 97.5th quantile on the right-hand side, and that forms our estimate of the uncertainty of dy dx at that point. Now, so do that and then just repeat that for every single value of z. So if I repeat that for some other value of z, there you go. You can see that the effect has gotten smaller and it's gotten smaller again and smaller again. So every value of, as the value of z increases, my estimate of dy dx is going down. And the uncertainty around it can actually get wider or narrower as well. And the easiest way to show this is just to draw lines. Whoops, I need to redraw my original plot here. There's my dy dx lines. It's just to draw lines around the central line showing what the 2.5th and 97.5th quantile looks like for each one of these locations. You can see this line is a bit shaky. Why is it a bit shaky? Well, this is a simulation-based process, so we're getting minor variations due to, you know, in effect, the simulation variability. But you can see that we're pretty clearly tracing out the 2.5th and 97.5th confidence intervals around this central line and that it's pretty close to the exact calculations that we constructed before. So I can just draw in those exact calculations again and you can see it passes right through. It's almost indistinguishable from the simulated densities. Now, we could go back through and smooth out these simulated confidence intervals using, for example, lowest regression, which we, the kernel regression, which we just learned. We could draw a line through all these slightly variable points and get a 95% confidence interval that looks really nice and smooth and simulated. And if we were going to make this pretty, that's exactly what we would do. But it actually, for illustrative purposes, it's kind of nice to see the variation because you can see we are simulating different values for each value of Z. So you get a little bit of difference due to simulation, you know, variability. But in general, we are recovering the right values and we're getting the 95% confidence interval that we intended to. So that's a nice illustration of how parametric bootstrapping works. You repeatedly draw samples out of some target distribution. In this case, it was the distribution of the Thetas. And that's going to allow us to recover some quantities of interest. In this case, 95% confidence intervals of an interaction coefficient marginal effect, a dy dx. In this case, we could verify that it was correct by checking the exact calculation. And that's nice to know. It's good for this illustrative example. But of course, in practice, what you're going to use this for is calculating the uncertainty around effects for which you don't know the exact formula. And there are all sorts of situations you might run into where this happens. And it's nice to have bootstrapping in your back pocket as a way of doing that, which is why that Clarify package by King Tom's and Wittenberg is so important and valuable because it gives a way of encapsulating that tool, making it easy to use for a wide variety of consumers and allows them to assess uncertainty of quantities they might not otherwise be able to do. So that's an example of parametric bootstrapping. So parametric bootstrapping still relies in some ways on the parametric nature of the model. We might take an approach that's even less dependent on parametric models, which might be necessary if there isn't one or if we can't write down exact nature of the parametric model. So for example, a lowest model has some sort of parametric form, and we can actually calculate standard errors in analytic ways out of them. But it might be easier to do it in some other way. Or we might not know how to write down the parametric form of the standard errors for some particular complicated models. We might think about doing this if there's something about the parametric formula for uncertainty that we think is questionably applicable to this situation. Now, there are lots of ways this might happen. We might, for example, worry about certain forms of heteroscedasticity. Now, typically there are easier ways to correct for that. We might, for example, consider just drawing or just using white's heteroscedasticity robust standard errors in that case. But there are other reasons why we might doubt the parametric formula for uncertainty, for example, we might recognize that the model we're using is just an approximation. And so given that approximation, there's no guarantee that the derivations of the VCV matrix are valid and truly represent our uncertainty in the estimates. And I think the biggest reason at least in my book is that very often it's easier to do than parametric approaches or parametric woodstrapping if we don't know what the parametric model looks like. And it may be easier to bootstrap the answer than to sit down and figure out a proof and write down some variance estimate formula. So what nonparametric woodstrapping typically involves is sampling out of the data set with replacement, I should say, to create what's called a bootstrap data set. So you've got some kind of sample, you know, observations, X1, X2, X3, and so on all the way up to X of n. And what you want to do, this is the actual sample data set, and what you want to do is create bootstrap samples, B1, B2, you want to create a lot of them, let's say K many. And what you do to create each data set is you say, okay, I'm going to sample out of my sample with replacement. So I'm actually going to just get rid of this and suppose there's only three observations in my sample. So what I want to do is have an equal, whoops, an equal chance of selecting any three, any one of these three observations. So you might roll a three-sided dice, some kind of dice that gives an die that gives an equal probability of selecting each one of these observations. So for example, I might pick the first observation is X1, but then I might pick X1 again, and then I might pick X3. The next time I might pick X3, and then X2, and then X3 again. I might pick X2, and X2, and X2. I might pick the same observation three times. I'm going to do this a whole lot of times. And the reason I'm doing this a whole lot of times is that I'm mimicking the process of what would happen if I sampled repeatedly out of the true distribution of X that's out there in the population. Of course, the rub is there is a population out there, and this sample is a sample out of it, but we don't know what it is. Our best guess of the empirical, I'm sorry, our best guess of the distribution of X is the so-called empirical distribution that is found in our sample. So we're using that empirical distribution in our sample of the target variables as our best estimate of the true distribution that's in the population. So this procedure is sort of laid out here in steps one through six, the procedure I just described. What you're going to do with each one of these data sets is you're going to estimate some model on each one of the bootstrap samples and you're going to save the relevant model quantities out of that estimate. I'm sorry, out of that data set. And you repeat this for every bootstrap data set so that you end up with each one of these bootstrap data sets giving you some kind of estimated quantity you're interested in. So maybe we're interested in some parameter estimate. I've got each one of my estimates here corresponding to my bootstrap data sets. I've got K many of them. What I can do is then say, okay, I learned something about beta or beta hat from the fact that I repeated estimating beta hat in each one of these samples. A trivial sort of thing to do might be to say, okay, I want to know something about the distribution of beta hat. So what I'll do is I'll create a frequency distribution. So here's the frequency of beta hat i where this is i denotes a bootstrap observation and I can construct a density estimate or a histogram and then say, okay, this is my nonparametric bootstrap simulation of the density of beta hat. And in fact, usually what I'm doing is I'm estimating if I believe that beta actually exists in the world, what I'm doing via this procedure is estimating the density of beta itself using this bootstrap procedure. So whatever the median of this estimated bootstrap density might say is my estimate of the median of beta. So I might put a bar on that. That's my estimate of the median of beta. Actually, I think it's median double bar and mean one bar. I always forget these conventions. Let's put a double bar, mean double bar or beta double bar hat that's our estimate of the median of beta. Now, what we're doing is we're appealing to properties of sampling distributions in order to justify what we've done here. And we're going to be able to do multiple things with this procedure or with the products of bootstrapping. And so I want to talk a little bit about what can we do with the products of bootstrapping. Well, one thing we could do is directly construct 95% confidence intervals by selecting the two and a fifth and 97.5th quantiles of the simulated quantity. So going back to the previous page, we simulated lots of draws of beta hat here. If we want to construct a, it's called a percentile confidence interval for beta hat, we just select the 2.5th and 97.5th estimates, or I'm sorry, 2.5th and 97.5th quantiles of all of our simulated values for beta hat and that becomes our confidence interval. So recreating that graph from earlier, here's our f of beta hat. We construct an estimate of the density of beta hat using our bootstrapped estimates and here's the beta hat right here. So what we do is just say here's the 2.5th quantile, here's the 97.5th quantile, there's our bootstrap estimate for the 95% confidence interval. That's a very simple way of using bootstrap estimates to construct confidence intervals and of course we can conduct hypothesis tests in a related way. If, for example, we want to test for whether beta hat equals zero or is statistically distinguishable from zero, we can say, well, let's look and see whether that 95% confidence interval overlaps zero. If not, we can reject the null hypothesis that beta is zero with confidence 0.05, two-tailed. Usual way of applying confidence intervals to hypothesis tests. You can also incidentally construct bootstrap percentile p-values in a similar way. For example, if this is zero right here and our test was for beta being greater than zero, we could compute the p-value as the proportion of the bootstrap samples that are to the left of zero and that would form a one-tailed p-value for whether beta hat was bigger than zero. So it's a pretty standard procedure. But there are other things we could do as well. For example, what we could do is say, well, what I'm going to do instead of bootstrapping beta directly, or beta hat, I'm going to bootstrap some construct of beta hat. So maybe, for example, I'll bootstrap the variance of the estimate of beta hat. So I'll get a bootstrap estimate of the variance in beta hat and then I'll say, I'm going to assume that beta hat is distributed normally and I'm going to use my bootstrap estimate of the variance in my confidence interval formula. So as you may know, the ordinary formula for a 95% confidence interval, 0.052 tailed, is beta hat plus 1.96 times the standard error beta. All we're doing here is we're saying, instead of using the standard error beta computed from a parametric formula, estimate this via bootstrapping by computing the variance of all of our bootstrap estimates that we recovered earlier. So we get a bunch of beta hats out of our K many bootstrap replicates. We compute the variance of that quantity. That becomes our estimate of the variance. But then once we have that bootstrap variance, we plug it back into a formula that assumes a normal distribution of beta hat, namely this one. And as you can see, the assumption of normality comes in in the fact that we can use this t-statistic multiplier here, the 1.96. That t-statistic multiplier obviously assumes that beta hat has to be distributed t or in the limit normally. So that's another way of using bootstrap estimates that's a little more parametric than pure percentile distribution, but it allows for a little bit of bad behavior in our dirty data set, for example, that may make the variance formula a bad approximation for the true variance in beta hat. There's another thing we could do. We could use a bootstrap calculation of some relevant statistic, like the t-statistic. So what we could do is instead of bootstrapping beta hat, we could bootstrap t. We calculate the t-value for each estimate of beta hat and use the critical t-values that we recover from bootstrapping to construct a confidence interval. So you run k-many replicates, you estimate k-many beta hats out of each data set, you compute the t-statistic for each beta hat replicate assuming that the true value is the originally estimated beta hat. And then you say, okay, use those t-values as the multiplier here. So what I'm doing here is bootstrapping this in short. I'm using bootstrapping to determine the multiplier that goes in front of the confidence interval formula. Instead of assuming it's 1.96, which would assume normality, I'm going to compute, well, what are the limits I should be using given repeated bootstrap samples? Now this might be easiest to see if you actually do it. That's the doing. It's very good to illustrate this process to get a sense of what we're actually doing here. So in a second, we're going to go into R and do this, and you'll see what we're doing here. What you'll see here is that what we've done is we've mixed various degrees of parametric and non-parameter techniques to create these confidence intervals. And the reason that we do that is because the more we can use parametric assumptions, the more efficient of an answer we'll get, the narrower the confidence intervals will be, and that's good. The question is, are these assumptions good ones? Is adding this parametric information to the model a good idea? There is no universal answer for whether that's true. Monte Carlo simulations can help you determine in various candidate scenarios whether using degrees of parametric approximation is justified. I think a generally good idea is just to see whether in all three scenarios you get a statistically significant result, and if you do maybe use the least confident of them just to be safe, if you get different answers then really start to delve into your particular problem and see whether certain assumptions you're using to construct these confidence intervals is justified. So now let's take a look at how to do this in practice to give you a better sense of what we're doing here and how it's different. All right, so let's do some non-parametric bootstrapping. So what I'm going to do is, just as before I'm going to make a data set that comes out of an interaction term linear model. There we go, and I've created this data set in a data frame called dat, which labels the variables y, x, and z. And what I'm going to do is create a percentile bootstrap estimator. So what I'm going to do is, in order to do this, use the bootstrap library, which you can see I've loaded up here. And the bootstrap library just implements some easy ways to loop over and sample with replacement out of a sample dataset that you have. But in order to make this work, what you have to do is tell it what you want to compute repeatedly. This is the so-called theta value in the bootstrap command. So if you call up the help file for the bootstrap command, you'll see that you have to give it a theta, which is the function to be bootstrapped. So we're going to give it a function called boot.funk. And the bootstrapping is actually going to take place over an index value. So what I'm going to do is repeatedly draw indices for the dataset. That is to say rows out of the dataset to sample out of. And then dat.in is going to be the dataset we're actually going to sample out of. So the very first line of the command of this function just says, select the lines of the dataset that you're given to by end and call that dat.boot. The reason we have to do it that way is that the bootstrap command likes to bootstrap out of one-dimensional vectors. And so if you've got a multi-dimensional matrix that you're trying to bootstrap out of, like any dataset typically would be, you need to bootstrap out of its indexes from one to n, instead of directly bootstrapping out of it. That's the only trick there. So out of this bootstrap dataset, the bootfunk procedure recovers the coefficients from a linear model and then constructs, just as before, the doidx line for a bunch of different values of z. So this is the doidx line for values of z from negative 10 to 10, and then it spits that line back out. So if I do this and I say, okay, I want to run boot.funk and I'm just going to give it a bunch of different data values here. Now the length of the dataset is 50, so I can't give it indices over 50, or it'll crash. And then I need to give it the data set here, dat, and it returns a line. It took my indexes that I fed it, it ran a regression, and then it recovered the estimated line out of that sample dataset. So I'm just going to do this over and over again, but the bootstrap command is going to give it a list of indexes here. So this is the list of indexes right here. It's just going to give it a whole bunch of these over and over again of length 50 that might include some repeats because, of course, we're sampling with repetition. In fact, it almost certainly will include some repeat observations. So that's what I'm doing right here in this bootstrap command. I'm bootstrapping out of boot.funk. I'm doing it 500 times using the dataset dat. And if I just do this one time and show you what the overall output of it looks like, you see it shoots out a whole bunch of stuff. The first thing it shoots out is the actual bootstrap replicates. And we've got a whole bunch of them. It should be a length 500, or I'm sorry, I guess, column, 500 columns which correspond to the replicates, and 20 rows, or actually 21 rows, which correspond to all the different points of Z that we're bootstrapping dy dx out of. It also shoots out funk.theta star, jackboot, val, and jackbootse, and the call. These are optional commands that you can use. You can have the bootstrap command directly compute a function and then compute what's called its jackknife variance estimate. Jackknifing is a related nonparametric command that I'm not covering in this lecture. If I was going to do it, I probably would not do it internally to the bootstrap command. I would use another command. There's a jackknife command that you can use to do this. Or I would write my own jackknife routine. But anyway, just so you know, it's there. And the point for this particular exercise is we can ignore those additional outputs and focus just on the theta star output right here. The theta star output is the bootstrap values of theta. So I'm going to store in boot.samp the only piece of this I want, which is the bootstrap replicates. So now what I'm going to do is construct bootstrap confidence intervals. And so to each one of these rows of X, I'm going to ask for the 2.5th, the 50th, and the 97.5th quantiles of my bootstrap replications. So I save that in boot.ci. And then I just plot the result. Ooh, I need to make sure to have zplot loaded in memory so it has something to plot against. Oh, and it's not called xplot. It's called zplot. Save that right there. So there's our dy dx simulated confidence interval and actually the central line is simulated as well. And we simulated that via bootstrapping. And we actually bootstrapped out of the, non-parametrically out of the sample data set instead of bootstrapping out of the limiting distribution of beta as we did in the parametric case. So just to give you a sense of how does this parametric bootstrap compare or non-parametric bootstrap compare to the exact calculation of the confidence intervals, what I'm going to do is add in here. These are the confidence intervals that we calculated previously via the exact method. And I'm going to plot them in blue so that you can see them distinctly. So just come in here and add a blue here. And if these are still in memory, let's see if they are. Oh, I bet they're not still in memory. So we need to reconstruct them. That's okay. It only takes a second to reconstruct them. Let's see. We can come in here and just do this. There we go. So there are the bootstrap confidence intervals plotted against the blue exact confidence intervals. And you can see they're pretty close. The blue confidence intervals are a little bit narrower. But you can also see that the bootstrap confidence intervals are, when you get to the left side, or the right side of the graph, where Z plot is near to 10. So if I just come in here and highlight this, down here where we're near to 10, you can see that the bootstrap confidence intervals actually go further up than the exactly calculated confidence intervals. But they're pretty close. And the exactly calculated confidence intervals are actually inside of the bootstrap ones over here to the left. That's not surprising for two reasons. Firstly, the exact ones are narrower because they take advantage of parametric information and thus are more efficient. And given the fact that parametric estimates, in this case, are completely accurate because we generate the DGP and we know that they come out of the right model, they're going to be a little bit better. We can also see the confidence intervals are a little bit different in the bootstrap case, which is also to be expected because the bootstrap estimator is not bound, the percentile estimator is not bound to any kind of linearity assumption in the sense that it's not forced to sort of make things balance in a certain way. So it's not surprising that our confidence intervals look a little different in terms of their location. In this case, it's actually not advantageous because we know the parametric estimates are probably going to be better. But in cases where we don't know what the DGP is, which is to say every real data set, it might well be better. So that's one way of doing parametric bootstrapping. Now suppose I impose the T distribution like I did before, but I use bootstrap bias corrected mean estimates instead. So what am I doing here? Well, what I'm doing here is I'm estimating the linear model and the coefficients just as I did before using the standard models. Then what I'm doing is bootstrapping. From my boot samples, I'm calculating the standard error of the bootstrap samples. So that's the estimated standard error of beta, I'm sorry, of dy dx, rather, for each value of z that I've calculated it for. So that I've estimated a standard error of the spread of the beta estimates for each point. I can also calculate how far the median estimate or the mean estimate is from the bootstrap... I'm sorry, let me say this a little differently. I can calculate how far the average bootstrap sample estimate is from the original estimate that we recovered via the model. So what I've done here is I've calculated what's called bootstrap sample bias. And all I'm doing is I'm taking the matrix of... I'm creating a matrix of the original estimated coefficient, the original estimated dy dx is from the sample. And then I'm subtracting all of my bootstrap sampling calculations of the same quantity from that and then computing a mean to get the bootstrap bias. So all this bootstrap bias vector is on average how far is the line estimated from the original sample from the average estimate of the bootstrap samples. So in other words, we're assuming that the original model might be biased in some way and that bootstrap sampling will allow us to recover that bias and we can adjust our original estimate accordingly. So that's what I've done here. I've calculated the main line subtracting the so-called bootstrap bias from that. And then to calculate the standard errors, I've added 1.96 times plus or minus 1.96 times the bootstrap standard error to that main line. So if I go in there and plot what I've just created, there we are. There's our semi-parametric estimate of the confidence interval. Semi-parametric because I've used bootstrapping non-parametrically to determine the standard error and the line bias, so to speak. But I've used a parametric assumption that the distribution of beta is normal to use this confidence interval formula that I always use when I can use the exact estimates. Now here's how this compares to the percentile bootstrap estimates. You can see that again, they're a little different. They're in the ballpark but a little bit different. And we would expect the semi-parametric estimates leveraging the T formula to be a little bit more efficient. It actually looks like you don't get a lot of efficiency in this case out of the procedure. But in general, we might expect it to be a bit more efficient because there's a bit more parametric information that's included. So that's a way you can use non-parametric bootstrapping to inform a parametric confidence interval result. So that's one thing you could do. The third thing is bootstrap T confidence intervals. And what we're doing here is instead of calculating the bootstrap samples directly, what we're going to do is we're going to calculate the T statistic for each bootstrap sample. So what I've done here is created a matrix called boot.t. And what I'm going to do a thousand times is I'm going to draw with replacement a sample from the data set. I'm going to run a model on that sample. I'm going to recover the coefficients. I'm going to calculate dy dx using the interaction term formula. I'm going to recover the variance-covariance matrix from the model. And then I'm going to calculate the standard error of dy dx at each point z in my bootstrap estimate. Then finally, and I'm going to store this last bit, I'm going to calculate the T statistic for the distance from the original calculated line from my sample to the line I just calculated using my bootstrap sample divided by the standard error that I calculated from the bootstrap sample. Now in this case, note, s e boot, that's not doing a thousand estimates and then calculating the standard error of beta. We're doing something differently this time. What we're doing is we're recovering the estimated standard errors from the model that we estimated on that data set. So s e boot, as you can see, is actually calculated using that exact confidence interval formula, the variance of beta 1 plus z squared times the variance of beta 2 plus 2 times the covariance between beta 1 and beta 2. So we're using the exact formula. The trick is we're using bootstrap replicates of that to calculate bootstrap replicates of the s e, which gives us bootstrap replicates of T. So I'm going to do this a thousand times and on a fast computer it only takes just a second to calculate this thousand bootstrap estimates. Already done. Now, if you look at the head of boot.T, what I've got here is for all 21 points of z, from negative 10 to 10 that I'm plotting here right here, I've got T statistics corresponding to how, quote, unquote, statistically significant the difference was of the estimated dy dx in that sample, in that bootstrap sample from my original estimate. So what I'm going to do is figure out, okay, what was the 2.5 and 97.5 percentile of those T values at each point z. So if you look at T values right here, whoops, T dot vowels. These are the bootstrapped T limits that I should use to calculate a 95% confidence interval if I'm relying on the bootstrap estimates. Now, normally I would go from negative 1.96 to 1.96 or something thereabouts, right? Now, this is a small sample of only 50, so I might use the T downward deflation estimate and pump that up a bit in the twos. What you can see here is that my bootstrap estimates of the critical T statistic, because that's what I'm bootstrapping, is the critical T statistic, are a little bit bigger. They're in the 2.3 range for small values of z next close to negative 10. And they sort of drift around between 2.5 at various other values of z. So now I'm going to use those T values as the multipliers to construct my confidence interval. So instead of 1.96 times the standard error, and remember the standard error is actually coming from the original model on the original sample, not the bootstrap. It's the T statistic, the critical T multiplier that is coming out of the bootstrap. So instead of 1.96, I'm using these T values that I created. So now if I plot that, I get confidence intervals again. How do those look different from the bootstrap confidence intervals using the percentile method? They're a little different. They're a little bit further down, but in the same ballpark again. So in this case, we've leveraged a little bit of parametric information because we're assuming that the distribution of beta is normal. But what we're not assuming is that it necessarily follows the asymptotics of the Z, or the normal. In other words, it might be the case that the critical T statistics are different for each point Z depending on local qualities of the sample. So we've bootstrapped something, but it's something different. So to sum up what we've done in these cases, we're always bootstrapping. We're always drawing multiple draws out of the sample to construct repeated measurements of some kind of quantity of interest. What's changed is what we're computing and what we're storing and then how we're using that to construct the final estimates. In pure percentile bootstrapping, we're just saving the betas. And this is usually what Clarify does. In the bootstrap T confidence interval case and in the semi-parametric bias and standard error corrected bootstrap estimates, we're actually saving either standard errors and bias statistics or we're saving T statistics from the bootstrap. So always bootstrapping something, but what we're bootstrapping can be different from case to case and that might get you a little bit more efficiency in some cases. So there's a lot of literature written on this. I'm not an expert on it by any means. You can go much deeper into figuring out when each of these things is optimal in different circumstances. What I recommend as usual is in your particular application, reading the literature, try to get a sense of which of these techniques might be the most optimal and if all else fails and there's no literature on your specific application, which is very common, try doing some exploratory monocular to see which bootstrap approach will be the most efficient, which will be the best and get you the best answers on average. So for the last bit today, I want to talk a little bit about some new ideas that are coming into political science in the front of non-parametric analysis. And one of those ideas is laid out in Kelly Rader's 2010 manuscript where she addresses an outstanding problem in political methodology, how to deal with data that is intrinsically clustered where we expect there to be complex correlations between the independent variables and then between the independent variables and the dependent variable inside that cluster. So a canonical example of this is an experiment where you run R sessions and so, for example, you know, we have we can't have all the subjects in the lab at the same time, so we run five or six different sessions each for, say, two or three different treatments and then the total number of people in the subjects is KR, where K is the number of people in each session and R is the number of sessions. The problem here is that the unmodeled proportion of the variance, the error term in whatever we're trying to model might be very complicated because the subjects errors inside of a session might be correlated with one another in unpredictable ways. The idea here is that when you put a bunch of people in a room and have them interact as part of a laboratory experiment like some kind of economic experiment or something like that, the fact that they're interacting with each other at a particular day at a particular time with a particular crowd of people might have some influence on what they do, a net of the effect of the treatment and we're not measuring that because, first of all, we have no way or we have very poor ways of recovering the necessary covariates about who those people are and what they're doing and how they interact with one another and secondly, we may not have very much leverage because we're only running four, five, six sessions per treatment so we may only have four sessions and three treatments in some senses we only have 12 observations at the session level. The problem is that of course these treatment conditions that we've imposed are often constant by session so different sessions are exposed to different treatments and we want to be able to recover the effect of the treatment by having it commingled with whatever these session level error term clustering things are going on. If we don't model the within session intersubject correlation that might interfere with our estimation of the variance of the treatment effect and usually the problem is it pushes that variance to be too small. It could be the case that there's greater variance in the treatment effect than we are realizing because intersubject correlation in the error terms is positive within a treatment which tends to sort of lump together all of the error terms which tends to de-emphasize the variation in treatment effect because the clustering effect tends to pull all the observations and the cluster closer together. Now there are lots of ways of dealing with this in one of the ways, the standard ways that's popular in the econometrics and actually really in the experiment of economics literature is clustered robust standard errors. Without going into great depth about what CRSEs are or what they do it suffices to say that they can perform poorly when the number of treatments I'm sorry to take it back when the number of sessions is small so we expect clustering at the session level in this experiment if you have a small number of sessions or 14, cluster robust standard errors can actually make the variance deflation problem worse. If you apply cluster robust standard errors what often happens is the degree of variation in the estimate of the treatment effect actually gets shrunk gets shrunk smaller than it should be and it gets shrunk even beyond what it would be with vanilla standard errors. Now to make all this really even better there are actually some rare cases where where the variance in the estimate treatment is too big and the cluster robust standard errors shrink the standard errors appropriately which is of course it's very difficult it's not impossible to tell in any particular dataset whether that's what's happening but there have been some pretty recent Monte Carlo studies published in political analysis and elsewhere in Angus and Piszki's book for example that indicate that on average, on balance you should not use cluster robust standard errors when you have a small number of treatments now that raises the question what do we do and there are some suggestions in the Angus and Piszki book about how to deal with that but Kelly Rader has a suggestion in her article where she suggests applying what's called a randomization test. It's a generalization of these nonparametric principles that we've been applying all throughout the lecture. So the idea in this particular application is to estimate a model on a sample where the G80 Rader generating process is you know why is beta 0 plus beta 1x plus beta 2z and actually since that's the DGP I should add in an error term inside the, here we go plus so long. There's our example DGP where J index is the group, so like the treatment effect, I'm sorry the session and I index is individual so each treatment I apologize each session is associated with a group level treatment effect so all the people in the session get the same treatment so you estimate the model sample data set but then to calculate the variance of beta 2 what we should do is shuffle the values of z among all the different session groups but ensure that each member of the group receives the same value of z so what you do is you sort of take the z observations which are that number is constant inside of a group and you take all the groups z away you replace it with something else but you replace it with another group's z you don't replace each individual's z with a randomly selected z you replace the entire group so what you're doing is in effect a form of bootstrap re-sampling or re-assignment permutation testing actually is what it's called but you're shuffling the values of the variable at the group level as opposed to at the individual level and you do that a number of times, a thousand times you compute whatever t-statistic you're interested in for that particular data set, I'm sorry for that particular variable for each one of these shuffled replicates and then that gives you the distribution of the test statistic so then you compare the value of the test statistic you calculate from the original sample like a t to the distribution that you constructed via this permutation process and you see whether you calculate the p-value by looking at how far out your original calculated t-statistic is in the permutation testing created distribution so the idea is oh my, I didn't turn this off we've got some kind of test statistic that we've calculated from a data set, so here's t and here we've got it right here but we don't know what the distribution is out of this and so if the null is zero it might be wrong to use the standard t-distribution that might be a bad choice because this interclass clustering may make it so that the distribution of the statistic is no longer what we would ordinarily expect it to be under the usual CLNRM classical linear normal regression model assumptions so we compute this permutation process or this randomization process to build up a whole bunch of t-statistics for each one of the permuted samples to construct a new test statistic and that test statistic distribution we expect it to be wider than the original test statistic distribution although it doesn't necessarily have to be and so what that does is it's going to deflate the value of our of our t-statistic that we calculated because it's going to account for the fact that there's interclass correlation when we in the distribution of the test statistic so she has some Monte Carlo reasonably preliminary Monte Carlo simulations to back up this procedure it's also supported by some previous literature which she cites and if you're following along with the homework assignments one of the assignments for this week is going to be to redo the simulation process that she the Monte Carlo simulations that she did in order to justify this procedure and then we're going to relax some of the assumptions that she imposed as she suggests at the final page of the paper to sort of figure out whether this testing procedure is robust to those relaxations in particular going back to the original model in her paper she assumed that X and Z were uncorrelated and very often it's the case that they are quite correlated that things that are going on inside of a group are it's not just the unmodeled stuff that's the same it's also some of the stuff you're trying to measure is also the same this procedure would apply obviously to some experimental data sets as I've mentioned but it also would apply to lots of other cases like for example survey data sets by country where you expect clustering by country for example anyway this is an interesting idea and an interesting application of nonparametric techniques and might eventually lead us to some solutions to some outstanding problems in political science well that's it for this week thanks for listening and I'll see you next time