 This is real quite an honor to be here and it's a real pleasure to me and it's my first time professionally visiting not only Arhus but the Denmark as a whole and as you can see by both my name and my region I obviously have some sort of Scandinavian connection after actually if you're not familiar Wisconsin was settled by Scandinavians back in the day after they kicked the natives out but today so this is the great center of time series and econometric research in Scandinavia is right here in Arhus and it creates a group and so it's real quite a pleasure to be here and so what am I talking about today is averaging because I'm an average type of guy so what's the, this is some work I've been doing for a while and so I'm really talking about model averaging in my research but that's the overview here but the idea is that as an economist or econometrician you don't have just one model you may have multiple models and this will come arise for a variety of reasons you could have multiple theories that are competing competitors that you want to think about using you could have different specifications of the model it could have to do with non-linearities but variables in time series it can do with a lag structure you could have different regressors that you may or may not be including in the model different polynomial orders or series expansions that can come up and then the question arises naturally well which model should I use that's the way typically people see it and in fact if you look at a standard economic publication it's a rare publication the real rare publication that acknowledges that there's only one specification they only estimate one specification and they never talk about this in fact most papers have tables which have different columns each corresponding to a different specification and then there's some chatty talk about why the different columns are interesting or different or useful and so people acknowledge that there's multiple specifications typically and but often the discussion about which model to use is informal or formal based on testing testing has to do with scientific questions about can I accept the hypothesis or reject hypothesis doesn't tell you what is a good model to use in practice now often times people have this metaphor that I should be using the true model a lot of people who keep using that phrase let's find the true model I think that's silly truth has nothing to do with it because none of these models are true and this goes back to my former colleague George Box who said that all models are approximations but some are more useful than others and the concept really in practice is that all the models are lenses by which we view the world but we want to think about which one is a good approximation and one way to see this is a really really stylized simple example take this simple location model that there's an unknown mean for why and then you can have two estimators of the mean you can have a sample mean kind of a standard estimator the MLE and the normality or you can have a silly estimator zero it sounds silly but it's actually not an unreasonable estimate in many contexts just to use the best guess you have and zero is not really essential it could be any number and the mean square error of those two estimators are simply sigma squared over N and mu squared and so that tells you that you should pick one or the other and then you get mu squared is bigger than sigma squared over N but that's kind of silly you don't know what mu squared is and it tells you still that you don't really know which is the best estimator it depends on the unknown truth so if you're trying to minimize the mean square error you'd want to pick one or the other based on this unknown thing but then of course you could say well that's kind of silly because the simple estimator is not very very useful but let's think about another estimator which is an averaging estimator I'm going to average the sample mean mu hat and the silly estimator mu tilde where the weight is the simple number one minus the number of parameters I'm estimating p minus two divided by the kind of chi-square test for the mean y bar prime y bar that estimator is better than either the other two it's better than the silly estimator in the sense it has bounded risk and it's better than the standard sample mean that has uniformly smaller mean square error regardless of the true value of mu it's simply a better estimator in terms of mean square error and that actually is not an original idea it goes back to 1960 James and Stein so you can do better than a sample mean by taking averages of standard estimators and other estimators but the mu tilde is not really that crazy an estimator simply a restricted estimator we always think about models with restrictions that has happened to be a model with the restriction that the mean is zero but in general this also relates to the concept of forecast combinations so often times people get a little bit uncomfortable with model combination for general estimation problems but it seems very naffed when it comes down to forecasting because we don't really care so much about the model I gave rise to the forecast we don't really care about what the forecast is and I think the original articulation of the concept of our combination is an idea that is due to Bates and Granger in a famous paper from 1969 and their idea is suppose you have a collection of forecasts or forecasting models remember the same idea what should you do? should you pick one or should you pick the other? what should you report? you said no don't think of it that way think about combination you have all these forecasts don't throw them away use them all but how am I going to use them? I'm going to take the average but should you do a simple average or should I do a weighted average? well weighted average seems like the most general idea to do a weighted average of the forecast and they said well it's pretty obvious that a weighted forecast can do better than using any individual forecast the only trick is how do I pick the weights so this is the first great intellectual breakthrough to say a weighted average has the potential to do better the second breakthrough is said the intellectual way to think about this is to think about mean square error that the mean square error or the risk of the estimator is a way to cut through the complicated question of how to pick the weights and say there's a methodology find the weights that minimize the mean square error the third intellectual breakthrough they have was let's find a particular algorithm to pick the weights which is let's assume that the forecast are uncorrelated and if the forecast are uncorrelated then it's easy to compute that the optimal weights are inverses of the variances of the individual forecast and that gives a particular weighting rule that third intellectual contribution is not as great as the first two because it's kind of silly assuming that the forecast are uncorrelated it gave rise to a particular rule which is called the Bay-Sgrainger forecast combination rule but I think it's the first two ideas which were the great intellectual contribution first that combination gives you the potential to do better second mean square error is the guiding light by which we can solve the problem I almost had a title about big data apparently for me that they switched it but so everyone talks to these days about big data no one talks about small data anymore connected that is something called machine learning I have no idea what machine learning is but I keep hearing people talk about it and what is machine learning well I think the way econometricians like to think of it is the way Hal White told us to think about neural nets that is it's a fancy new buzzword for fancy new non-parametric methods and essentially that you're trying to estimate something like a conditional mean by lots and lots of regressors you throw it into some algorithm you have no idea what it is it spits out forecasts so essentially we can think of it as high dimensional a computer intensive work pattern match but it's a prediction oriented methodology suited for cross section rather than time series data but these machine learning methods like non-parametric methods in general depend on tuning parameters bandwidths, these kind of things how do you select these in practice in the statistics literature cross validation so that's what they typically recommend for picking the tuning parameters for a lot of the machine learning techniques that links it up with non-parametrics and standard methods a particular modern machine learning method that's re-making the rounds is called ensemble methods and ensemble methods are weighted average of other machine learning methods that is precisely forecast combinations ensemble methods of machine learning is what we call forecast combination econometrics back in 1969 of course you have to pick the ensemble weights the modern technique is cross validation that is something that Jeff Fersen and I called Jackknife model averaging or cross validation weight selection years ago all wrapped together that forecast combination non-parametrics cross validation machine learning all these things are similar what I want to talk about for the rest of the time is to get a little bit more specific than the broad concept of combination and talk about a research project I've been focusing on vector auto regressions kind of leaking in with the time series people and it creates that's a particular paper I've been working on I've called Stein combination shrinkage for bars it's still working progress the title may well change and the paper focuses on two issues impulse responses and forecasts I'm going to focus on impulse responses because you get the more exciting results with impulse responses in a vector auto regression the standard format is you have M variables and P lags and it's kind of a workhorse model in applied economics and used for multi-step forecasting and impulse responses but the main idea here isn't really specific to impulse responses it's just to any time you have a parameter you're trying to estimate that's complicated one of the problems that comes to rise in vector auto regressions is that it's easy to be over parameterized in the US type data we all have quarterly data for 50 years that's about 200 observations not too many and if M is big M times P is the number of things on the right-hand side it's easy to think that least squares estimation of a big model is not going to be well set up so in the modern language of econometrics we say we have to regularize the system regularize means tighten it up somehow and in the early literature where vector auto regressions were promoted by Chris Simms he essentially suggested keeping the dimensionality of the system small where they articulate this as a regularization but I think in the back of his mind that's what he was thinking of is that having a small dimensional model is going to be a good approximation and that's good enough you don't need to have a big model particularly popular in that the literature promoted by Chris Simms is the Bayesian vector auto regression methodology which instead of estimating by least squares estimates by a Bayesian method which shrink towards effectively the Minnesota prior they don't call it the Wisconsin prior or the Danish prior but the Minnesota prior because Chris used to work at the University of Minnesota and it's the random walk with drift model so it shrinks towards that this is widely used in empirical practice central banks and everyone use it but there's really no theory even though it's been around for God knows how many years if I say theory I mean from the frequentist point of view I understand the sampling properties of estimators the methods have continued to evolve there's a recent contribution review of economics and statistics and they promote that this is a state of the art of the BVAR universe and provide all the tuning parameters and MATLAB code you can go and download it and apply it to whatever data set you want and it does really extremely well in out of sample comparisons the goal of my project is to focus on combination methods model averaging methods that minimize mean square error as Bates and Granger suggested and we're focusing on the mean square error of impulse responses they're going to take stein type forms as I mentioned the stein estimator before you can also think of them as frequentist model averaging estimators in the style of gear declassians effectively the way to take the form is that they shrink the unrestricted least squares of parsimonious models just to specify some math so that you can see it the equation of interest is why it relates to lags the error has a covariance matrix typically you want to look at a structural shock so you decompose the variance of the error into h times h transpose that's I assume is being done by the researcher I'm not focusing on a particular structural model in the empirical work I'm going to focus on the triangular Kuleski decomposition but it would work with any other decomposition of the shock so the epsilon to the shock to the system impulse responses of the trace out response of the variables why and once again just to repeat for people who don't work with impulse responses on a daily basis the impulse response is a change in why in the future we call that h steps ahead h is the impulse response horizon we're going to do a shock today and so the kind of focus I'm looking at is the effect of monetary shocks the United States will worry that maybe the Federal Reserve Board might raise interest rates maybe in Europe worried about the central bank ECB and what is the effect of a monetary shock will prices go up, output go down will wages go down what are the effects when we're interested in the horizons impulse responses are calculated over a very long period of time looking at long horizon effects they're effectively nonlinear functions of the estimated parameters and therefore the statistical properties are governed by the underlying parameters you've estimated and these complicated nonlinear transformations so when you come down I started by saying you have models and you have a bunch of models in the context of the vector out of regression so I have one leg, two legs, ten legs which variables I'm looking at the impact of monetary shock on GDP should I include investment and consumption should I include prices should I include the number of Twitter comments made by the President what variable should I have on the right hand side there's just different specifications you can use so I'm going to be focusing on a combination that is that let's combine different variables and structures so it's again to fix the math a bit why is the vector that we're trying to calculate the impulse response function right on the right hand side all the coefficients and call that big coefficient vector a matrix B take the vector of the coefficient B that's theta it's a big coefficient vector like least squares is estimated that you estimate by least squares but the parameter of interest the impulse response is beta some transformation G of the least squares estimate theta so the parameter again B capital B is the coefficient to the right hand side theta is the full list beta is the parameter of interest just some sort of transformation you estimate the model by least squares when least squares regression get capital B list them all as theta take the transformation and get the impulse response beta now the practice you don't actually figure out when it's trunking G is you just go to state it and type bar basic yx but effectively the end that's of course what's going on is it's making a transformation of the least squares estimates to do combination you have to have a bunch of different models I'm going to be focusing on models which are vector odd regressions and models with fewer variables in particular to keep things simple my models with fewer variables are just odd regressions and that is because in the forecasting environment simple odd regressions often do extraordinarily well for out of sample forecasting so my sub models are vector odd regressions with one lag, two lags, three lags up through P lags and odd regressions with lags of one through P now once again you start with a list of variables you want to estimate a vector out of regression for and a max number of lags and then I'm going to be using all models with fewer lags and all odd regressions with fewer lags the application I'm going to come through P is five so that gives me ten models I could easily include a hundred models or a thousand models in the algebra it's not really a big deal the question is what sub models you want to focus on as being potentially interesting for your impulse response analysis and I just haven't in the application worked all that out yet it's a matter of working out the algebra of how to impose constraints for the sub models for now it's simple, just ten models but the hope is to generalize some of the other subsets what's important for the math is that each sub model can be written as a linear restriction on the parameters so theta is the list of all the parameters of the model and there's some linear restriction on those which generates a sub model so these vector odd regressions with fewer parameters odd regressions are zero constraints on parameters so each sub model is a different set of zeros the reasons why that's going to be useful is that for the combination theory I want to have explicit expressions and what's interesting if you understand least squares methodology is that theta hat is the least squares estimator theta hat of r is the estimate from a sub model one way of estimating a sub model for example an odd regression is just to type regress y on y lag another way is to go to the big vector odd regression and do a linear transformation of it so we have estimate from a big model I can get a sub model by just doing a rotation and the reason why that's going to be useful is that once I have a distribution theory for the big vector I can get the distribution for the sub estimates just by linear projection and then these are nonlinear transformations so I'm doing forecast combination or impulse response combination so I've estimated 10 models each one is going to get a weight I'm going to put weight 1 on the big model weight 2 on the next model weight 3, etc. I have 10 weights my combination estimator then is the weighted average of these estimators my weights will be positive they're going to sum to 1 now we want to get a way of picking the weights to pick the weights I'm going to use a distribution theory approach the distribution theory approach is to say let's come up with a distribution of this weighted average and then pick the weights which minimize its mean square error the mean square error of the asymptotic approximation now there's a problem in asymptotic theory that when you impose restrictions that are false you get omitted variable bias and that dominates the asymptotic theory so the way to get around that in distribution theory is to assume that the constraints are almost true that they're in a root n neighborhood of the truth that allows the bias and the variance to be considered so neither one dominates the other asymptotically for people outside econometrics they often worry about if this is metaphysics or psychology or something and it's a mathematical trick which allows everything to be relevant in the distribution theory which is important so mathematically the restrictions are prime theta are within a root n neighborhood of being true the deviations are dealt with big very bad all the restrictions close to true and then we get a distribution theory the first line says that the least squares estimator is normal we know that, we teach that to our econometrics students in econometrics 101 least squares estimates are approximately normal in fact I'm going to go further than that I'm going to say that the least squares estimate is a random variable z which is normal the least squares estimate is asymptotically z which is normal is that everything is going to be a function of z the second line says that the impulse response which is a nonlinear function of the least squares is asymptotically is a linear function of the same z so often times when we teach the delta method this might be econometrics 2 econometrics 102 instead of econometrics 101 you say that the delta method says that nonlinear functions are also asymptotically normal but we actually have a stronger result that it's asymptotically a linear function of the same normal random variable the third line says that the combination estimator has an asymptotic distribution which can be written as a linear function of the same random variable z and a bias term where the linear functional and the bias term is a complicated function of the weights the actual formula is not so important what's important is the idea as you put weight on the big model the bias decreases but the variance increases as you put weight on the small models you get lots of bias but small variance so if I was contrasting estimating impulse response by a vector auto regression with five lags versus an auto regression with one lag you can see that one is much more precise but if a biased thing and the other is much less precise but this tells you the degree of these two from this expression I can calculate the mean square error of the combination estimator and that coming up here I define the squared error of the combination estimator for a given set of weights and then I calculate its expectation and that is the approximate mean squared error so the approximate mean squared error of a combination estimator of the impulse response and once again I have 10 models or 100 models or something I use each to estimate impulse response I calculate and weighted average of the impulse responses and then I'm interested in what is the mean square error of that estimator and it is this expression which looks pretty nasty the expression has three parts and I can understand what they mean by thinking about it a bit the first part I see delta beginning and ending the first component up on top that is the squared bias so again the mean square error of an estimator is typically squared bias plus variance so the first term is the squared bias the last term the trace of a bunch of stuff that's the variance so I have a squared bias, I have a variance then also I have a term which is minus 2 times a weighted average of the k's now the k's down here is kind of a funky looking equation but what it is, it effectively has to do with the number of estimator parameters that term here is like a penalty it's minus 2 times the weighted average of effectively the number of estimator parameters it's very similar to what goes on in the acaiki information criterion or the malls criterion or models left from lately when you mean square error things you end up having a penalty due to the number of estimator parameters in the context of model selection the penalty is of course 2 times the number of estimator parameters in this context it's related to the average number of estimator parameters from your averaging estimator so that's what is replaced and then, oh sorry I want to estimate this mean square error and I propose an estimator which involves the estimate of the penalty so this is a completely feasible estimator has no tuning parameters this estimator has no tuning parameters and it's proposed as an estimate of the mean squared error it's kind of like a malls criterion and here what I show is that it's an unbiased estimator of the mean squared error this criterion can be written as a quadratic function of the weights I have 10 different impulse responses taking weighted average for fixed weights and I've computed the mean squared error it's a theoretical thing but I'm an estimate of the mean squared error for a given set of weights and there it is it's a quadratic function of the weights I know how to minimize quadratics in fact I probably teach that in high school although this is a vector he's an adult typically but when you take matrix algebra calculus you know how to minimize quadratic in vectors the only challenge here is the weights are all positive so actually you don't use the standard matrix algebra result instead what we do is that we use quadratic programming methods or matlab minimization optimization which uses quadratic programming at its heart so we simply minimize the quadratic subject to the restriction that all the coefficients are positive and sum to 1 and so what effectively what's going on once again is that I've estimated 10 models or 100 models I take weighted average I've figured out that the mean squared error of the weighted average is a particular quadratic function and then I find the set of weights which minimizes the estimated mean squared error positive and sum to 1 when you estimate a bunch of weights this way it turns out that most models don't get positive weights only a few do and the algebraic reason for that is you're minimizing a quadratic subject to the restriction that the weights lie on the unit simplex which is a pointy object you minimize quadratic subject to being on a point you often end up on one of the points the points are points where the edges of the simplex are model weights where you zero out particular models so typically the solution is a weighted combination of a subset of models I estimated 10 models maybe the var 3 gets positive weight maybe the ar4 gets positive weight but not necessarily all models given the weights you estimate the parameter of interest in this case it's the impulse response and it's a Stein estimator effectively I don't have a good optimality theory for this estimator in the impulse response context but we do in other contexts in a paper a decade ago I showed that under homoscedasticity in linear regression models that averaging is asymptotically equivalent to the oracle infeasible best combination and a few years later Jeff Racine and I generalized the technique to cross validation method so that if you select weights minimizing cross validation you can avoid the homoscedasticity assumption and that the averaging is asymptotically equivalent to the oracle estimator that takes the best combination in a paper in the Oxford handbook I focus on SIV regression so that if you use the cross validation method you estimate SIV regression that are asymptotically equivalent to the oracle best combination estimators in a paper in Quantitative Economics I look using local asymptotic theory as I discussed earlier here and show that doing this models combination dominates the unrestricted estimator uniformly in the parameter space when models are nested and separated by groups of four and it's kind of a complicated statement but it's the Stein effect if you shrink you have to shrink at least three coefficients to get a Stein effect I also have a paper in the Journal of Econometrics which looks at parametric context and there I show that doing combination dominates the maximum likelihood estimator asymptotically so a lot of people are taught in Econometrics 3 so if you talk about Econometrics 1 Econometrics 2, Econometrics 3 Econometrics 3 you may have been told that maximum likelihood is asymptotically optimal in some sense well this theorem says that's wrong that you can do better by doing shrinkage estimators you say how can both one thing be wrong and the other thing wrong well it all depends on the way you set up the asymptotic experiment but this theorem shows that the combination estimator uniformly dominates maximum likelihood and it achieves a local minimax efficiency bound which is stronger than the maximax efficiency bound which is used to justify maximum likelihood and in 2008 I focused on forecasting and showed that using these methods for forecasting beats least squares and a paper published a couple years ago with Xu Cheng we focus on multi-step forecasting and there we use multi-step across validation methods where you leave out blocks of data that some go on to the simulation so I have some theory now let's see if it actually works I have focused like the most of the VAR literature 7 variable systems 200 observations 5 lags I'm going to compare 3 methods least squares, the default BVAR MATLAB code that's on the website of the current state of the art and the third estimator's mind I'm going to look at 3 designs I'm going to look at different impulse response horizons I'm going to record the mean square error relative to that of least squares numbers less than 1 or better than least squares numbers bigger than 1 are pretty bad because they're worse than least squares this is a heavily parameterized model 200 observations, 36 regressors you'd hope you should be able to beat least squares the first model is the data is truly generated by an auto regression that seems kind of silly for vector auto regressions on the serial correlation properties serial correlation is where all the action is here because you get closer to the unit root that's where the BVAR method is designed to work well so there's a lot of numbers here but let me kind of walk through the table I have on the right columns the different impulse response horizons impulse horizon 1 through horizon 20 rows are the degree of persistence in the regressors the top is data is very stationary AR coefficients 0.5 bottoms the unit root coefficients 0.98 for 200 observations it's essentially unit root the theory by the way in the paper assumes stationarity that's why I don't have one list at all so let's look at the first result the first row it compares to BVAR and Stein numbers less than 1 are beating least squares methods beat least squares and for horizon 1 BVAR is 1 third better than least squares and the combination method gets half the root mean square or least square it's pretty good but if I go to the unit root if you get the flip response that the BVAR has half the root mean square error of least squares and the Stein method has 1 third that's because the BVAR method exploits it's shrinking towards the unit root it likes the unit root okay but let's look at the long horizons in the first row you see that you get huge benefit from shrinkage at long horizons in fact the Stein method has negligible mean square error relative to least squares in the very long horizon and if you look at the unit root you see that nothing much changes at long horizons and something very interesting happens so if you only focus on those two you would say what's not a big deal but something very interesting happens at 0.7 in between so that's autoregressive coefficient is 0.7 sample size 200 those of you who do time series have a feeling for what kind of data that is kind of persistent but not hugely persistent and what happens here is we see the BVAR method has a root mean square error which is worse than least squares and the Stein method is doing better saying huh that's weird and I go to the long horizons out to 20 steps ahead the BVAR is 20 times worse than least squares that's really bad while the Stein method is 20 times better than least squares how can it be 20 times worse isn't that a computer bug well it turns out what's happening in this context is if you actually look at the individual simulation draws is that half the time one thing happens half the time something else happens half the time the results look good half the time the BVAR looks weird what do I mean by that the estimate is the mode of the posterior and half the time it's putting the posterior mode right at the prior of the unit root so it's putting the mode right at the multivariate unit root now what is the impulse response from the model the true model at 20 steps ahead is the number 0.7 0 very close to 0 the unit root model the 20 step impulse response is the number 1 raised to power 20 which is 1 1 and 0 are very different and so what's happening is that half the time the BVAR method produces estimates that are so close to unit root that it's very bad at the long horizon when the unit root is not true essentially the method is too aggressive at shrinking towards the unit root if I change the design to a vector on regression with more complicated correlations then the same pattern appears but not as dramatic you know but the same kind of things in general the sine method is doing the best everywhere except for half unit root design I'm going to use data so what I do is this is the application used in a lot of the recent papers on vector auto regressions take these particular 7 variables for the US estimate it on quarterly data use the estimated coefficients to pretend this is the true DGP simulate it and then do the same exercise and here I'm going to look at impulse response clustered by a variable and what we see is not too similar from before that in general the sine method is doing the best V bar method can be doing significantly worse than these squares but in general also the sine method is not doing that much better than these squares I think it's because the data look very much like a random walk and so it's hard to eat it by the sine method I was surprising how bad the V bar method does in any case finally let me show you some actual numbers when we do econometrics we can evaluate the quality of methods by theory but most often those theories involve approximations we can evaluate things in Monte Carlo simulations where we control the truth but when I show you applications I can't tell you what's better and what's worse all I can say is it's different so we have to be honest about that but here are the variables I'm going to look at impulse responses due to a monetary shock this is a variable system there's 49 impulse responses right I'm not going to show you all 49 I'm only going to show you 3 or 4 and so I'm going to look at impulse responses due to monetary shock meaning the Fed funds rate change and then actually I'm not going to look at point forecasts here and I'm only going to look at a few impulse responses since a lot of them have similar shapes so this first plot is the impulse response of U.S. real GDP due to a Fed front shock so what's a Fed front shock one day the Fed funds get excited they wake up they hold a conference call they say hey this Fed front it's the board of governors the open market committee they say hey let's raise the interest rates the markets aren't expecting it wouldn't that be cool and probably they saw that there's need to do that so they raise the interest rates and immediately what happens in the world is traced out from quarters four quarters is one year two years three years four years five years what's the effect on U.S. GDP is it falls it should be kind of teaching basic and immediate macro right so what you see here is plots of three estimates the least squares estimate of the circles the B bar estimates of the crosses and the stars are my new method here the distinction between the methods short horizons they line up very nicely and this is pretty generic they line up very similar on short horizons they differ typically at the longer horizons second thing you notice is that the B method typically is in between the other two methods in terms of the point forecast third thing you notice is the B bar method tends to flatten out that's a unit root unit root means that shocks have permanent effects the least squares method tends to be stationary and the fine method tends to shrink towards be similar to the stationary estimate that's GDP so once again interest rates go up GDP falls but we also can see how it's traced out over time second thing this is the effect on price level Janet Yellen raises the interest rate I go out to Starbucks to buy a coffee and they say it's another 5 cents so the prices have gone up but what are the differences across the estimators the estimators say short horizons practically up to a year and a half very similar estimates very similar estimates in impulse response the differences are at the long horizons here the least squares estimate is the most different and here the Stein estimators closer to the V bar estimate my third impulse response is investment interest rates go up investment falls now we know that but what's more important is the magnitude by which it falls notice that the numbers are much higher than that of GDP the impact of interest rates on GDP are much more substantial not too much unlike what I think are taught in macro classes again at the short horizons the estimates are very similar long horizons are very different here the the B bar method is telling you there's a long term impact of interest rate changes the other models are telling you that it dies off in the long term and this final picture is the response of hours worked which is measure of unemployment in the economy and once again you see the estimates are very similar to the short horizons very different at the long horizons permanent effect according to the B bar method more of a temporary effect according to the Stein estimate another thing which is interesting about the combination methods is you can actually look at the weights and say what models were useful I estimated 10 models what I'm plotting here is the weights by impulse horizon I didn't focus on this too much but I'm estimating the weights separately for each impulse response horizon and what's happening is you see that there's one model the box which gets about 70% of the weight at all impulse response horizons and that's the kitchen sink the bar with five legs impulse response likes putting lots of weight on the bar 5 the impulse responses don't like it when you have bias and so it turns out to want to do that it also wants to put second most weight on the bar 3 but at short horizons it likes the auto regressive model so this is telling you the different models on how what comes up that it likes I also do the similar calculation for forecast combination the first was impulse responses this is for forecast combination I do the forecast combination separately by variable this is for forecasting GDP suppose I wanted to forecast GDP 1 year out 2 years out 3 years out and what models would I use this tells you that at the very short horizons the diamond model var 2 is the best thing to use put 60% of the weight on that model but for 8 steps ahead don't use it for 2 years out again at short horizons you want to use 20% on the var 4 and 20% on an auto regressive model but at the very long horizon if I want to forecast out 3 years it says just use 2 models use the auto regression and use the var 3 and take essentially equal weight of those 2 so the forecast combination says lean towards the small models relative to impulse responses impulse responses it cares more about bias put the weight on the big model forecasting says put the weight on the small model and it depends on the forecast horizon so just to conclude here what emphasizes that combination or averaging is underutilized in economics there's a kind of a growth industry in the stats literature about theory for this but it's underutilized in practical economics averages going back to portfolio optimization theory at a smaller risk than single estimators what we do know right now is have a fairly decent theory about how to select the weights to reduce MSE we have some asymptotic optimality results we have some distributional results what we don't understand are a bunch of stuff too for example we don't really understand much about inference after you do this model combination it's hard to do inference in general how do we go beyond point estimation how can we combine distributions this forecast combination model averaging is focusing on trying to get the best point but is there actually information in the heterogeneity in the fact that the estimates may be very different in the spread of the forecast could we use this in some way to get a measure of uncertainty these are things which we don't really understand or at least I don't understand and so come back in time