 So, I am going to talk about regression models and the basic problem that most people do not know how to do regression analysis correctly, furthermore the text books do not teach it in the right way, so it is not really their fault, so let me start by saying that one of the basic ideas that is required for a correct analysis of a regression model is that the assumptions are valid, we might call this the axiom of correct specification, in fact we could call this, this is going to be my topic in the sense of for this lecture, that a regression model is valid only if all of the assumptions of the regression model are satisfied. Now everybody who takes a regression course studies the assumptions of the regression model, but nobody knows why these are studied because after you study them then there is not much mention that is made of them, so one of the key assumptions is that you have the right model, you have all of the right regressors, you have the right functional form, all of the regressors are exogenous, the model remains the same in every period, it does not change from time to time, the parameters do not change, so there are a huge number of assumptions, a small number of these assumptions are sometimes tested, but most often people do not actually even test those assumptions and they assume that the regression, so for example we have a paper that has been submitted to our PDR, in the paper the author says that he wants to study the relationship between FDI and economic growth, so he runs a model in which economic growth is on the left hand side and FDI is one of the variables on the right hand side and then he says literacy also matters, so he puts in another variable which is called literacy and then he also takes interaction term, so he takes literacy and multiplies it by FDI, then he runs the regression, he gets a high R squared and the coefficients are significant, so he concludes that this is a good regression and then he looks at the coefficient on FDI and he says okay so if I increase FDI by so much, the impact on economic growth will be so much, now this is what people are taught, we were having interviews for lectures recently and nearly everybody responded to this question in the same way, so suppose I say ask you that I want to determine the effect of health on economic growth, so what will you do if you look at the WDI well development inductors dataset, you will find variables related to health, so something will be there will be a variable on hospitals, there will be a variable on access to clean water supply, there will be a variable on number of doctors per population and a few other variables, so you take all of those variables and you put them on the right hand side and you take economic growth on the left hand side and then you run a regression and you're off, you have your thesis, so now the question is can both of these people be right and the first person is telling us that here is the equation which says that economic growth depends on foreign direct investment and on literacy and the other person is running the regression on economic growth and on health and they're both coming up with their conclusions, so what does the axiom of correct specification have to say about this? Can anybody answer this question? Can both of those authors be right? Yes, no? How many votes for yes? How many votes for no? Okay, so why cannot both people be right? Please, sir, model you said that margin of health effect is quite related with the productivity of development, so in a sense we can say that FDA is far more relative with the, rather to say that health is quite related with the development. No, okay, all of these are good intuitive understandings but there is, you see the axiom of correct specification says that your model is valid only if it is true. So can both of these be true models? It is impossible. The FDI model says that health does not matter because the health variables are not included. The health model says that FDI does not matter because FDI, so either the FDI model is true or the health model is true or both are false but it is impossible that both can be right. It is impossible. There is only one true model for economic growth and if you have it, then your interpretation is valid. If you don't have it, then your interpretation is wrong. There is only one true model. This is the very serious problem which textbook does not mention. Even though it is written there, it is written in, what is written implies what I am saying but this point is never made because if you made this point you would stop doing econometrics because if you have 10 variables only then if you look at all possible combinations of regressors that is 2 raised to the power of 10 that is more than a million. So you have more than a million possible regression models only one of which is correct. So how much, what is the chance of you are getting the correct model? But if you don't get the correct model and this is assuming all models are linear. Now if you have log forms and non-linearities and other functional forms then your chances are like this is an ocean and your chances are finding is like chances of finding a drop of water inside the ocean or a speck of dirt inside the desert. So you see you cannot run a regression model like this which everybody does. I mean suppose I want to find the effect of health well this is how the people do it. I mean I'm not taking this example out of my mind and this is one of the thesis that was written that was presented at our interviews that somebody had written a thesis on health and this is what they had done. Now the question is suppose that okay you are running this model our own students and faculty have written papers like this we have published papers like this in the PDR. Now suppose you run this model that I have GDP on a lot of health variables then if this model is right then there are 1000 papers which people have written on GDP growth 1000 people literally this is true more than 1000 and none of them has included health variables so all of those papers are wrong. If my paper is right then all of those papers are wrong because they have ignored the health variable. If those papers are right then my paper is wrong. So there's a very serious problem which is simply not mentioned in textbooks. So the question is what can be done about this? There is an approach which is called the encompassing methodology which is due to Henry David Henry and I will talk a little bit about this but I will also show you some examples to explain why this is such a serious problem that people don't understand. Alright so this is a World Bank data set. I am going to take this household consumption series. This is the Algerian household consumption series. This is the Australian GDP series. So now Algeria is in Africa, Australia is somewhere else. So if I run a regression on these series there should be no relationship. I ask this question of one of the people who were in the econometrics interview that I take two variables. They have no relation. I take Nigerian GDP and I take Bolivia's consumption or Nepal's GDP and Chile's consumption. So will there be any relationship? So obviously not. These two variables are unrelated. Our regression theory says that if you have two variables which are not related then there will be no relation. Now we can run regressions here. Now I am going to run a regression. Y range. X range. Y range is now A5 to A54. That's Algerian consumption and then B5 to B54. That's Australian GDP. I think we are all right now. Output range. There it is. Now the R squared is 71 percent. You see now the X variable coefficient is 0.047. So the consumption in Algeria is about 5 percent of the GDP in Australia. The coefficient is highly significant. T statistic is 10.9. Standard error is 0.004. So if you look at plus or minus two standard errors around this coefficient, zero is very far away. So now we conclude that actually people think that the consumption in Algeria is determined by the GDP of Algeria but it is not true. What is actually true is that however much GDP occurs in Australia, 5 percent of that is consumed by the people in Algeria. R squared is good. Strong 70 percent is pretty good. And you can do this on any two countries. So first, so what does the axiom of correct specification have to say about this? What is the problem? Why can't we, how's that? Theoretical background, you see what happens is that this is a good question but it is fairly deep. Theoretical background is an important missing element which is not taught in your textbooks. Suppose I go to this health person and say, okay, you're saying that the health variables are causing GDP. How does it happen? This is the real issue which is mentioned nowhere in the textbooks but it is one of the most fundamental issues in understanding regression analysis. And that is why a crucial component of any good regression analysis must be a historical and qualitative analysis which nobody is doing except that I tell my students to do this because you have to establish the causal chain. The causal chain does not establish by numbers. You have to go and say okay, when people are healthy then they work harder and so you have to look at missing hours of work and how they are related to health. So you have to think about how the health variable affects. Similarly for FDI, you can't just run a regression. Okay, here's my regression, r squared is high, coefficient is significant, therefore GDP affects. But still I think the main point people are still not getting which is the axiom of correct specification which is the underlying everything. This regression interpretation is true only if the regression is valid. The regression is valid means that I have all of the right regressors and none of the wrong regressors. Now are there any missing regressors in this equation in the Algerian consumption on Australian GDP? Are there any missing regressors? Yes. So how do you know ke hona nahi chahiye? Well, you see if I say that Australian GDP has an effect while trade is linked so you cannot exclude it like that, can you? There you go. There are good arguments and strong arguments and then there are weak. So you have to learn how to make strong arguments. When you say that something is a mistake then you should be able to say something which will be convincing. Now, okay, now I have put the Australian consumption in here. Oh, that is not what I wanted actually. What I wanted was something else. Okay, now I'm going to put in the Algerian GDP. There is a very important fundamental missing variable from this regression. What is the missing variable in the regression of Algerian consumption on an Australian GDP? Obviously, yes. Theory tells us that consumption function is a function of the GDP of the nation. So once you have a most important missing variable then all regressions are wrong which exclude the most important variable because you have a missing variable. Say, you know, if you have, this is called mis-specification analysis. Everybody has studied this theory but somehow this theory is never used. So students are never taught. This is the very strange thing about the econometrics textbook. They tell you the right things in bits and pieces but when it comes to the application everything is forgotten and that's why our students produce these hopeless theses and they answer these questions in a hopeless way which is completely wrong and they have no idea even though these things are there. So if you have the missing variable then mis-specification analysis tells you that because of the missing variable the equation is mis-specified. Now there is a practical aspect to this also in the sense that if you have the, if the missing variables are minor they have small effect then excluding them only has a small effect on the regression but if you have major missing variables then everything is completely wrong. So here we have a major missing variable because the main determinant of GDP has not been mentioned. Now according to Keynesian theories the GDP growth is determined primarily by the investment. The investment in that country is the exogenous variable and the sentiments of the investors, they drive the economy. If the investments are confident and bullish and they are putting a lot of money then that will cause growth. If the investors are not investing then there will be no growth. Now what do you think of the health and the FDI regressions? Which were actual papers which were, one was a thesis that was passed and approved and somebody got their MFIL, the other was a paper that was submitted to PDR and the author appealed to me that my perfectly good paper has been rejected by PDR, please help me and get this published. So what do you think is the problem with these two papers, the health and the FDI? Yes please. What is the specification error? Which is the missing variable? GDP is what's being, is the dependent variable. Huh? Which other variable? What is the primary determinant of GDP growth? Investment according to Keynesian theory. If investment is not present in the equation then the equation is majorly misspecified. Now let me show you what happens if I run the regression, okay so that's A5 to A54, the excel may, the regressions have to be right next to each other, it cannot handle. So what I want to do is I want to run, first we run Algerian consumption in Australian GDP and now I want to run, oh okay I need to put that, I want to run it on with the missing variable. So I want to put Australian GDP and Nigerian GDP, Algerian. Okay so now I have Algerian consumption and then Australian and Algerian GDP and now I'm going to run the regression. So I've made the regressions from B5 to C54 which is the next two columns, what that means is that I've got Australian consumption, no Algerian consumption on both Australian and on Algerian, which formula, X range, X range, X range problem. Okay so now the R squared has gone up to 88 percent, now it's very interesting, the first X variable, this is Australia, GNP and this is Algeria, GNP. So now we run Algerian consumption and we find that Algerian GNP is very powerful, important and there's a 50 percent marginal propensity to consume but Australian GNP is not irrelevant as it should have been, it frequently is in these equations but actually Australia is very strong, significant and it has a negative effect, negative 2 percent and this is, standard error is very low, 0.008, T statistic is minus 2.5 which means that it's a, you see if you look at the P value it's 0.15 which means that it's significant at not quite 1 percent level but 1.5 percent which is significant at 5 percent for example. So Australian GNP is still significant, again this is most likely just a wrong result and the wrong result is because we have excluded an important variable. Now you have to understand, you have to use your brains, you have to understand why this is happening intuitively without any formulas, why did we get a positive strong coefficient when we ran Australian GNP directly and a negative one when we put in the right way, but here this is the Australian GNP, now the Australian GNP if we just run Australian GNP then the effect is 5 percent, so positive coefficient and the effect is that if Australian GNP increases Algerian consumption will increase by 5 percent. Now the second regression is saying that if Australian GNP increases the Algerian consumption will go down by 2 percent, so which of these is right, now you have to understand yes okay. So first I mean the one question is that why we have positive coefficient here that is something you can answer, why we have negative over there that's not so easy but the implication of this notice that suppose that investment is the key determinant of the GDP growth and it is not present in your FDI equation which it is not, then can we even say that the sign is right for the FDI, after you have omitted the most important variable then nothing can be said whether it's positive contribution or negative contribution or no contribution, this is what the axiom of correct specification says that unless you have the right equation all your interpretation is wrong, so if you are missing a primary variable then your regression is wrong, so this paper is completely wrong, another nearly every paper that we see makes this type of mistakes nearly every paper that we submit makes this kind of mistake that the primary variables that are important are missing, now one very important thing that you can understand here is why is Australian GNP coming out positive and significant in this equation, because why, because why, trade between no that's wrong, there is in fact very little trade between Australia and Algeria, very very what, what is the main determinant of Algerian GNP, Algerian consumption according to economic theory excuse me, main determinant of Algerian consumption income exactly, Algerian income excuse me no foreign aid is not the main determinant of consumption Algerian GDP if you look at your microeconomic theory consumption is a function of income that is the primary main most important determinant, so when that variable is taken out then everything else acts as a proxy for this variable, because that is the missing variable that's the main determinant, so what do you think is the relationship between Australian GNP and Algerian GNP, both are increasing functions both will be positive basically GDP is growing in all the countries so all of them will have positive correlation, so this Australian GNP is acting as a proxy for the missing variable which is Algerian GNP since this variable is strongly positively correlated we can check it actually easily check it equals corral E5 to B55 C5 95% correlation between Australian GDP and Algerian GDP and this will always be the case for any two GNP series because they are all strongly increasing, so if you have a missing variable and that missing variable is the primary variable, so you have to distinguish if you are going to do real regression analysis then you have to there is a very important difference between big primary variables which are strongly determining and small variables which have minor effects, again this is something which people don't understand if I want to determine the effect of health on GDP and health is a minor way, say it has a 5% impact on GDP then if I'm running a regression on 20-25 points of data in Pakistan for example it is impossible to find this effect because there is too little data to pick up a small effect it's just something which is intuitive meaning if you want to pick up a small effect you have to have a lot of data if you want to pick up a big effect if you want to ask what is the effect of Pakistan GNP on Pakistan consumption you can do it with 20-25 points but if you want to find out the effect of doctors on GNP you cannot do it it's there's not enough data or with 20-25 points annual data you might be able to do it in some other way you can do it by for example assessing what is the effect of the labour force hours on GDP now labour force hours will be a major determinant because production function is FFKL so now if you find out the effect of labour force hours on GDP then you ask how much does the medical factor effect on the labour force hours again this can be this is a strong effect so you can trace it with small amount of data then you will be able to find out but if you directly affect GDP on if you look so so by breaking it up into the causal chain this is very important which nobody knows how to do and it is not taught in the textbooks we want to affect and this is actually related to Maria's question that what is the real effect and you have to go behind you can just you cannot just look at the numbers you have to look at the real world causal sequence what is the mechanism by which health factors will affect GNP and you have to trace the causal chain okay one important effect of health will be that people will not be able to go into work so you can get data on sick leaves in government organization how many times people are applying for sick you can get data on in hospitals you can take a survey and find out how many people who are in the hospital they are working and they are off from their jobs so there are various ways you can get an estimate of how health affects hours worked and that should be relatively easy clean relationship with short causal chain then you can do estimation and then you can go on to the other step which would be how much hours affected affect GNP and then you can get a small and even if the health hours take like 5% or 10% affect on labor hours once you have an impact of labor hours on GDP then you can use that to get your estimate but you cannot go directly because the intermediate variable is missing so one of the theses that was proposals that were submitted was that we want to assess the effect of migration from Azad Kashmir on the crop yields now this is the kind of nonsensical idea that students are coming up with for their thesis proposals that I take two variables nobody has ever done this before so now this is a thesis topic what nonsense this is not how it is done any if you as I said if you have only 10 variables you have one million regression so one million students can write theses because there will be one million regressions which nobody has done before there has to be some sense in that regression now if migration has what is the crop yields crop yield depend on you know how much seed you put in how whether you run the tractor whether you put in fertilizer whether these are your primary variables if you take out all of those variables and you run a regression you might you will get something because what will happen is that when you take out the significant variables then the insignificant variables take the place of those significant variables and the and the coefficients this is exactly what you are taught in mathematics but nobody explains the meaning you are taught your if you if you take a good course in econometrics they go through the formulas for the bias when you have a mis-specification if you take out one of the variables you can calculate what the bias will be if you put in the wrong variable it's a very complicated formula and meaningless but basically if you interpret the mathematics it's saying that this variable is trying to fill the place of the variable which is missing and so the coefficient will be according to how it relates to the missing variable here we have a very simple situation there's only one missing variable let's say that is the Algerian GNP so everything will depend the coefficient will depend on how the variable correlates with this missing variable but now after you put in the missing variable then there are still other determinants you know we've got 90 percent of the way but now there's 10 percent left and there will be other variables which will be important now maybe now why this is coming out negative well probably because after you account for the after you account for Algerian GNP then then there will be other variable maybe it's exports of Australia again you see in order to find out we have to study very carefully what is actually going on this is again real world situation for example suppose that Australia and Algeria both produce oil suppose I'm not sure and so when the Australian GNP is strong they are exporting more oil and this cuts into the oil exports of all Algeria so then that would explain the negative correlation but that doesn't mean that this is a good reaction then we would need to put in that causal chain in order to understand what is going on so you have to understand the real world in order to understand how to do regression analysis correctly and this is unfortunately never explained in courses so the axiom of correct specification is a very dangerous axiom it says that you can interpret the equation validly only if you have the right equation the chances of your having a right equation are one in a million so you're dead from the start there is no chance that you have and that is why your econometric textbooks don't even mention this problem because this problem is so great that we can simply dump all of the articles that we see into the garbage without even looking at them because what are the chances that the author is runs a regression says okay this this and that and blah blah blah but everything depends on the assumption that he has got the right model and the chances of getting the right model are just not not much more than zero so what can you do to avoid this well this is what the henry methodology does what people normally do is that they start by how normally what we do when we run regression analysis is we start by taking some simple model and then we test this model and with the test work then we stop that is I mean if you have a strong r squared and the coefficients are good then we say okay we found the model now the thing is that as I have said everybody can find a different model and and everybody does but not all of those variables only one of those regressions can be correct so I have a paper in which there is health which is the explanation of GNP another paper which fdi and there are hundreds of papers one of them is taking literacy one of them is taking a religious background as the source of the GNP growth so everybody can write a different paper but only one of this can possibly be correct and most likely all of them are wrong and in fact if we take the simple Keynesian theory that investment is the primary determinant of GNP then any paper in which investment is not used as a variable is automatically wrong you can throw it in the garbage and that means nearly everything that people are doing with GNP growth and nearly everybody anywhere huge number of theses are written on what explains GNP growth because today GNP growth is God this is the thing that everybody pursuing even though it is very harmful actually to go for GNP growth but so here is a fdi growth paper then there is another paper on intellectual property rights and growth and again this paper takes these variables very well written paper is GDP is IPR international property rights index fdi efw the freedom index trade GDP ratio population growth secondary years of education and gross domestic investment now here he has put in the investment variable so at least the major the one criticism that the major variable has not been included is not applicable to this equation and here is the analysis GDP equals 121 times IPR plus 0.06 times fdi plus 33 times economic freedom 1 0.01 times trade ratio 131.9 times population growth and so on so then he says the analysis clearly shows that the enforcement of IPR by one unit would significantly cause to increase GDP by 121 units this is the coefficient is 121 so it means that in Pakistan we should start enforcing intellectual property rights because now can we rely on this regression can we any you see this is the thing that you are being treated like children you are being told that these are games you are playing but actually this is not true now you are in the top one percent of Pakistan tomorrow you are going to be heading the analysis and policy think tanks and you are going to be in the ministries and you are going to be and and if you are going to actually just recently the commerce minister was called to question for why is Pakistan export performance not very good he did exactly this he said okay let's look at and he assigned the task to one of his staff and they ran a regression of export growth on four or five variables and they said look the problem is that we have too many tariffs and we have this and that and by running a regression like this he came to a conclusion now the question that I'm asking you is is this reliable should we any once I have this regression can I go out and and say okay we should we have so many any CDs that are being pirated and nearly everybody every one of you is running word and excel on without unlicensed copies so before you go out please pay your fees for the seven hundred dollars that requires to buy the microsoft word and microsoft excel so that will increase the gdb of pakistan what do you think so now are we any jokers that we should go and run these regression and make these policy recommendations like everybody is doing and people are running these regressions and in the health equation the girl who wrote the thesis said that okay this is what it is gdp will increase by 10 percent if we increase health so we should put so much money into doctors and hospitals and so on is this a reliable recommendation is this valid can we trust this result yes what's what do you think can we trust this result this guy has done a lot this is much more sophisticated than others he has done the integration analysis and the co-integration analysis and johan's interest yes this is my point of view yes add up okay what you are saying is that we can think about the problem whether intellectual properties cause an increase in gdp or not and we can apply our brains and I agree with that fully but the question I'm asking is is this regression result reliable okay let me ask a second question suppose that I gave you an assignment to produce something which is almost exactly parallel to this regression yani something which is equally yani this uses the same methodology the same tactics the same level of rigor could you get another equation in which this number would be minus 100 minus 100 would it be possible instead of plus 100 can you get minus 100 here yes yes yeah we both the highest main investment has gti gross domestic investment how is that model specification oh yes yes currently we have this linear model suppose I made this into a log linear model suppose I put in quadratic terms I take exactly these things but I suppose I put in interaction terms suppose I put in some other variables for the same variable like economic freedom all right so suppose instead of taking the economic freedom index suppose I I look at the number of elections which are being held the number of transitions there are many other ways that people have thought suppose that in the trade liberalization there are at least 10 different measures which people have devised for free trade indexes one depends on the difference between the black market rate and the and the official exchange rate hundreds yani so each of these variables I can find alternates what would you think would happen to this coefficient they will remain the same at 120 as I change these variables or will it change how much will it change I mean this is the question that see what I am trying to show you is that this is not the issue of point of view this is the issue of reality if I give this second if I give this same exercise to another person to do and he picks up a different series and he picks up a different data set he will get a completely different result this result is completely unreliable if this result is completely reliable we cannot go around and start making policy recommendations on this and not only is this folly but this is completely any this is what students are being taught to do and if you once you understand what is going on you can get the result that you want now for example one of my students Zaidah sorry was I asked him to do the Granger causality test so we experimented with a little bit of things that okay the data set is going on from 1960 to 2015 or 2010 so that's about 50 years of data we would take out the last five years or take out the first five years you get completely different results if you change the specification yes from log log to log linear or or direct direct everything gives a different result and ultimately if you learn about what's happening then you can get all four results x causes y y causes x by directional causality and no causality and this is not just any he showed this but actually if you go in the literature and look at papers you can find all four results for any one country in fact I have in one of my papers I have collected their I think their 15 16 papers written on I think it's export led growth and some people say yes exports cause growth some people say no growths cause exports some people say that there is bi-directional causality and some people say there is no relation and you can find published papers and all four so if there is some something valid in this methodology then it should be reflected in the stability of the results it's not a question of point of view it's a question of what is out there so so what can we do about this well this is what the so I say that in my point of view the there was a paper that was written in this is the paper econometric modeling of aggregate time series iterations between consumers expenditure and income and united kingdom this is simply a model of the consumption function consumption equals alpha plus beta y this is the most basic relationship very strongly theoretically supported that yes your consumption should be some proportion of your income and in this paper he formulates the methodology the henry methodology as it's called this is the beginning the first paper and my view is that if you don't understand this paper you know nothing about econometrics and if you understand this paper then you have the beginnings of an understanding of econometrics so this paper starts by saying that the Keynesian consumption function is one of the most thoroughly recite topics but there is no consensus everybody who estimates the consumption function comes to a different result and these different results are very important because these was they determine the investment multiplier the marginal propensity to consume they determine the short-run and the long-run impacts of government policy everything depends on the consumption function because you know in Keynesian economics the consumption function plays a major role if you have unemployment and the government wants to eliminate it then they have to stimulate the aggregate demand what is the aggregate demand aggregate demand is the consumption function how much do the people want to consume this is that their demand if the demand is too low then there will be unemployment if the demand is too high then there will be inflation so it's important for you to have a good idea of what the consumption function is if you don't have it then you will make policy mistakes because the government is always trying to set the policy this is the Keynesian theory the government should try to find the level of expenditure at which the full employment level will be reached if they put in too much money money supply is one way to do this there are two ways monetary policy and fiscal policy if you choose the monetary policy to be too high then you will if you overshoot the target you will get inflation if you undershoot you will get unemployment both of these are policy mistakes so you have to have a good idea of what the consumption function is but there is no but if you look at the consumption functions which have been estimated in literature he cites many of them one two three four five six seven different studies all of them have different estimates if you use one study you will come to one number use the second study you'll come to other numbers no no convergence and very big differences not small differences so what is the problem why so first of all why is there why does everybody get a different result so what henry says is that this is because we are using the wrong methodology the most important thing for a methodology is that it should be cumulative cumulative means that knowledge should build if the second person if one person puts down a brick then the second person should build on top of that or besides that so that the if everybody starts putting a brick in different places we will get a cluster of different bricks and there will be no coherence so what this means is that when there is a collection of models on the ground so person x has done one study person y has done another study like we have here byron has done one study deaton has done another study henry has done another study bispham shepherd wall towns and bean all of these people have done studies now i can add an eighth study and this is why people are working that everybody has done growth and now i'm going to add another paper on growth people have thought about the effect of health and on but nobody has thought about the effect of haircuts so now i'm going to go to the barber and get the haircut variable and i'm going to run a regression of number of haircuts it will be very strong and positive actually because as the population increases the number of haircuts will increase and haircuts and gdp will be very strongly correlated so we will have a new theory of the gdp if everybody gets more haircuts we will get more gdp so then we will make policy recommendation that the government should go and increase the number of barber shops this is actually what is going on i'm not this is unfortunately not a joke this is this what's going on today in policy making so what is the solution well one of the keys to henry is that when i put on a model i must also say that my model is better than anything that is currently in the field so i must test my model against all existing models and prove that my model is better than every other model once we do this this is the key this is the key idea of encompassing so this is why when you do a regression model you have to do a literature survey now i am writing a paper on consumption function suppose or on gdp growth or on any other thing i have to look at all existing work i have to say that in fact my student rahman he did his study on determinants of inflation so i said okay find all of the papers that have been written on inflation in pakistan not that many about 10 12 many of them coming from bide also okay so one person has done inflation so we take all of those 12 regression models now our goal is to produce an inflation model which is superior to all of those 12 so now when i produce this model i will have the best regression model for pakistan among all existing models so this is an easier problem than the problem of finding the one true regression model among the million that cannot be solved but at least i can solve this problem that i can beat everybody else now that doesn't mean that my model is true the next student to come along will find a model that will beat my model but at least we will be making progress my model is better than all others and his model is better than mine so there is some continuity there's some progress so that is the key idea of the henry methodology so now then there are methods of model comparison so this is becomes now key given two models how do we compare them how do i show that my model just just look at two models for the moment because so i have a model which is the h model and another person has a model which is called the w model so now i must show that the h model is better than the w model if i'm going to implement this methodology so how can you do that well there are two methods the one that is used by henry is the method for nested model testing so in the nested model what we do is we make a big model which includes both of the other models as a special case so consider for example the health model and the f di model both are explanations for economic growth one guy is saying that the economic growth is a function of f di and literacy and the cross product and the ipr model is saying putting in some other variables let's say just ipr and investment just for simplicity or let's put in the third one variable also so there's a model for growth growth is a function of literacy and f di another model for growth growth is a function of health another model for growth growth is a function of intellectual property rights so now i build a big model in this model i must put in all key theoretical variables if i don't then the model is dead from the start so in the model i must look at so one of the parts of the literature review is that you have to go to the theoretical papers the ones which have not studied which have not necessarily run regressions and they the theory theory explains how gdp growth occurs so one of the theories is kentzian theory which says that the gdp growth depends on the investment but there are other theories of gdp growth as well and you look through the literature you you actually call the literature such the literature you are now looking at where people who are doing literature if they have no idea what literature review is why they are doing it so what they do is they say okay here's this article and here's what he says and here's this article and this is what he says and they just make a list of unrelated pieces of articles and they said this is the rassam by which we start the ceremonies and so we just cut and paste descriptions of 20 different articles this is called our like the laveti kuraan we just we have to do this and then we start the thesis no relation between those 20 papers and anything that is written in the thesis no actually when you are doing your literature review you are looking for very specific things you are looking for regression equations because every regression equation on a similar topic you have to beat that in order to make your thesis also you're looking at theories to find out what are the variables which the theory says is important because of theory says that the variable is important you have to put it in regardless of whether or not anybody else has put it in like in this case the theory kentzian theory says investment is important you have to put it in because if an important variable is missing your regression is wrong so now we make the encompassing theory says you put in a make a big regression model which includes everything as a special case so we we put in a model for gdp growth in which we have investment and then we have the health variables and then we have the fdi and literacy and then we have the intellectual property rights all of those variables are put in now in this big model every model is a special case in the sense that the fdi model says that the coefficients of ipr is zero right if the fdi model is right then the if that model is right then he has excluded ipr so what he is saying is that ipr doesn't matter otherwise he should put it into his model and similarly the health variable is saying that ipr does not matter and the ipr model is saying that health and fdi do not matter so every model is a specification of zeros a model says that these variables have zero coefficients now all of you if you have had your regression analysis course you know how to test this null hypothesis that some coefficients some variables have zero coefficients there is an f test for this and all regression packages have an implementation so you can test all of those three hypotheses each model is one hypothesis and sometimes it will happen that one of the models will be accepted and the other two will be rejected this doesn't happen often but sometimes it can happen most often what will happen is that all three models will get rejected so if all three models are rejected then all three models are wrong then what happens is that in your big model you look for a simple model which is not rejected by the data maybe the model is that works is the one which has investment and fdi but no ipr and no health then this is your model so you find the best model which is now when you when now when you present okay here's my model I am saying that gdp growth is determined by investment and fdi but not by ipr and health now this is not just you took those two variables and you run a regression and you came up with the results but actually you have looked at all of the million models because you put on all of the variables in there so all of the million variables are a special case of this model so you have found a model which is equal in power to the big model equal in power to all of the million variables million regressions and it is not rejected by the data this is the best you can do this is the best you can do in the sense that you can never find the true model this is one of the famous issues that have you heard of Karl Popper Karl Popper is a philosopher of science and basically what happened in western philosophy of science they had the idea that science leads to true results and all true statements can be discovered by science this was the idea and so they tried to develop a philosophy to prove this and they tried and they tried and they tried and actually they came up with a philosophy which is called logical positivism and if logical positivism is true then science is the only way to get to true knowledge everything else is just guess what but and for a long time logical positivism was believed to be true because the westerns are very ideologically attached to the idea of science and science must be the only true producer so this philosophy proved that science was was the only sole root to truth so they believed in it even though the philosophy itself was not very well established ultimately in this philosophy was studied by philosophers and examined in detail and it was found to be false and ultimately rejected and there are many people who showed the many flaws in this philosophy one of them was Karl Popper and Karl Popper showed that you can never prove anything by scientific method you can never prove anything true you can prove things to be false so basically scientific methods relies on observations so suppose that for one million years or one billion years every day we observe that there is a sunrise so the scientific method only says that there is a scientific law the scientific law says that every day sun will rise and we have one billion observations that this is a true so we can predict confidently that tomorrow also the sun will rise but the problem is that this method of proof which is called induction is not valid because tomorrow the sun could go nova and there could be no sunrise or there can be some events like a meteorite comes in and destroys the plant earth and there's just no more sunrise so even though you have some event which has been going on regularly for one billion years without any change there is no guarantee that it will happen tomorrow and that is why you can never prove a scientific law to be true but you can prove it to be false for example people in Europe saw that all the swans are white they didn't see anything else so actually you know in the philosophy of science this is given as an example that there are two types of truth there is mathematical truth which you can prove like the Pythagorean theorem you can prove by logic so it's automatically true it doesn't depend on observations you can't you don't need to draw a triangle and check because you have proven it logically and then there is empirical truth which is called this is called analytic and synthetic one one type is analytic and the other type is synthetic so the synthetic truth is that you observe so for example there is a law that all swans are white because all swans in Europe are white but when the Europeans went to Australia they found that there is a black swan species and this is a genuine swan in the sense that it can interbreed with the white swans of Europe so they were so this is why there is a phrase called black swan black swan is means something completely unexpected so again this is a illustration of the proper idea that you can never you can disprove things but you cannot prove things so when I put up a theory which is better than all previous theories then so far it is the best theory but just like all scientific theories tomorrow somebody may come and disprove it and there is no guarantee you can never make a truth guarantee but you can only say that up to now nobody has found anything better than this and this is the best you can do with a scientific theory now so the key idea of Henry methodology is that when I produce a model it must encompass and compasses standard what means covers or surrounds or so I must encompass all existing theories my theory must be better than everything that is currently on the ground and I must be able to prove this so one way to prove this is by the method that I have told you which is called the nesting models so I find a big model which nests all the other models within it and then I test all the models and find a best model so this best model that I find is automatically better than all existing models the other method is uses the theory of non-nested model testing so here if I have a health model and an intellectual property right model I don't need to make a big model which covers both of them there are a number of different tests which you can make and there are a number of different methodologies all of these are quite complicated much more complicated than the nested model theory but you can go to the books and find find the tests and run them so you don't have to follow Henry methodology in terms of nested model testing the key is that you have to test and this you can do any so now what has been the experience this Henry paper was written in 1970s interestingly enough the model that the DHSY model David's and Henry Sarba you they wrote this model in when was it it was I have the dates it was in the late 70s 78 okay now they actually completed this research in about 1975 and they found a model which was better than three existing models the number of models that were there were so many and in fact this is a useful technique that instead of covering all the models which would have taken a lot of time what they did was they took classes of models groups of models and okay there are some people who say that these are the relevant variables these 10th authors have studied social variables for example and they have put in models related to freedom and health and and there are these people who say that exports so then they have put in models related to exports imports trade huge number of trade statistics and then there are people I'm just giving you an example of how to do this kind of analysis for yourself because if you go out in practice you will find too many regression models to compete with so what you do is you group them into categories and you say okay one group deals with this kind of variable and then from each category you pick one and this is what they did okay they said that okay here is one type of model which deals with transfer function here is one type of model which deals with stationary models here is one type of model which deals with the interest rate and others so I pick one model from each category so so they took they booked one one deal with seasonal change so they took three models H, B and W and they said okay now our goal is to build a model which will beat all of these three models and they succeeded they built the DHSY model which was better than all of these three models but it's not enough to be better than the best you should also have the model should be good on its own so when they did forecast testing of this model they found that the model fails the forecasts are very bad they are rejected you see forecast you have a model your standard error of the model is let's say 25 so that means that when you make a prediction c equals alpha plus beta y then the error should be within plus or minus 25 that's one standard error or within plus or minus 50 that's two standard errors if the error is 100 that's four standard errors away and that means your model has failed so immediately and and the key thing very important is that you have to make out of sample forecasts because if you if you if you estimate the model on a data set then all the data is automatically fitted so now if you make within sample forecast you say okay omit one point and forecast but that model that data is already in your estimate so it's not a serious forecast a serious forecast is when okay i estimated this model up to 1975 i didn't have the data for 1976 now the 1976 data comes in and has g and p and c and let's look at how well our model fits on this there you have no control you have not looked at this data so if your model works well on that then you have some satisfaction that okay my model is well now the model failed and not just fails in a small way it failed major way so they couldn't publish the paper and they kept trying to find a way to to improve the forecast performance and ultimately they found out that one of the things that was happening in that period was financial liberalization but that means is that it becomes easy to borrow money from the banks and one of the ways that the banks made it easy to borrow money was that they were paying out they said that okay if you have house and what were happening in London and also in other places they were is that the value of the house has increased a lot and the people had loans on the houses but the loans were taken out on the original value which was quite small so the people had a lot of equity in the house the house is worth hundred thousand and they have a loan of twenty thousand remaining on the house so they have eighty thousand pounds worth of equity in the house so what the banks introduced at that time was housing equity withdrawals you can you can take a loan if you pledge you take the house as a collateral so a lot of money that was illiquid and unusable became liquid and usable because people could borrow money against the value of their homes and so when they put in a variable like this into their model then the forecast became correct and so then they published the paper and the paper is published like this and it has this but the thing is that this last method that they used was actually against the Henry methodology because the Henry methodology says you cannot add an ad hoc variable at the end to fix things the Henry methodology says that you should start with a big model which includes all models are special cases and you should come to the right model so now if we take all ad hoc variables which are missing from this there are too many now if we say that okay financial liberalization is the missing factor then there are at least 10 or 20 or 30 variables which relate to financial liberalization so if you actually and truly and sincerely followed the Henry methodology then you would have to start by putting on all of those variables and finding out the one configuration which is better than the rest and this they did not do so this is a this is a practical weakness of the Henry methodology that although it makes a very grand claim and it is very useful and among the existing methodologies it is the best available but still it is not the right answer and not the correct methodology and in fact I am taking a faculty courses later on in which I will explain how first of all what the Henry methodology is because most people don't even know that and then what are the defects and problems with the Henry methodology how we can fix them and how we can get to regressions which can be taken seriously not as a joke that okay I run newspapers and actually this is true you run newspapers and regression or GDP growth this will a number of newspapers being published you get a very strong correlation is worldwide globally true and so the best way to improve our GDP growth is to just publish more newspapers