 Thank you very much Tony it's a great pleasure to be here today and I want to start by thanking Fin Tarp who convinced us to have this panel today it is going to be a preview of what's to come because what I'm going to be discussing as well as my co-panelist is a project that we have in the Journal of Economic Inequality to produce a special issue which is going to be devoting devoted to appraising assessing existing databases on inequality and let's see here they told me I have to do something different because it's alright so this is my my co-conspirator who couldn't be here today because he has to deal with more important things as chief economist for Sub-Saharan Africa at the World Bank is Shiko Ferreira many know him very well and I also want to refer to our research assistant who's a PhD student at Tulane University and the databases that we included in this in this project are nine they're not all in fact we did not include two of the ones that are be presented tomorrow I think so it's going to be interesting because hopefully this will be the beginning of a process rather than the end of doing this kind of work we hope to have it published online by the end of 2014 early 2015 and today's presentation is a preview you get the press rehearsal so to speak so let me start by telling you why I think that this kind of assessment is not only desirable but necessary and I want to tell you sort of some anecdotal evidence how this the idea of having this special issue came about I mean I over the years have been very very always very concerned with the fact that we have inconsistent data depending on the sources on what's happening to poverty and inequality and not just that many other fronts but in particular it has always worried me the fact that you can come up with very different narratives especially you know if you ask the question what's going to have what happened to countries during a crisis you often can come up with different narratives if you use different sources and I think that's very dangerous both from the point of view of just knowledge creation but even from the policy implications of what you learn from that information the anecdote is that they know the idea started Chico and I were at a panel at Yale University organized by former president sedillo who from Mexico who is now the director of a center there it was on Latin America and we were just having a panel with you were not there and there but we were talking about the same things that you talked about today but the declining inequality in Latin America how great it was and how happy we were and somebody raised their hand and said but you know what the IMF says that the decline in the declining inequality in sub-Saharan Africa is more prominent than the one in Latin America so she go and I looked at each other said what's that we never heard of that what kind of data the person who raises hand by the way was a former president of the central bank in Chile very well respected economies paper the Gregorio so we took it seriously and so I went back to look at the data and this is in the fiscal monitor it was not in the paper that Andy presented today the fiscal monitor is a report that the IMF produces regularly and yes the data that they had there showed that inequality had declined in a number of African countries and then I said well let me go and check where the both call I went to the World Bank database I thought they were using the World Bank database in there I saw that the numbers were very different and so I went back I talked to people at the IMF I said well where does this data come from and they told me came from Frederick souls database which is as Andy described it today primarily imputed data based on some existing data but all of it is imputed all the data that isn't that databases imputed so I started asking a little more but then I said well I don't I mean this is imputed I know nothing about imputation methods so who am I to say whether it's right or wrong I said well let's let's go for it and let's have an assessment of databases but let's have them of most or all that they want the ones that we're using generally so that we can have a go at what's going on with this databases we tend to use them a little blindly all of us do because you know we want to use the data we're not data producers and you know we don't want to have to spend a lot of time or what's behind this data and we say well great if there is something that tells us what's happened to the Gini coefficient before and after government taxes and transfers I mean how you know that's heaven for most of us if we can do that so anyway we thought this is going to be a good public good and we invited a bunch of people at first I thought it was going to be just a forum section of the journal and then she could say that that time he was the editor no no let's do the whole issue so that's how this started so we the first thing that I am showing you here is we distinguish between three categories of the data sets the micro database data sets the ones that use secondary sources and the ones that use imputation so in total we have nine and so which are they and who are their reviewers we have sepal stat which is produced by the UN the reverse Francois in blue are the ones that you're gonna hear from today IDD is income distribution database from the OECD it's been reviewed by Leonardo Gasparini and Leopoldo Tornaroli Liz Martin Ravallian did a review of list of Cal world development indicators which is a World Bank Tim's meeting and John Latner and said luck which is the one that Andrea used as he referred today used for for his edited volume and I use a lot as well and the world top incomes database that Andrea Brandolini is writing the review for so from one to five they're micro data household surveys the last one is the one that uses tax returns as we all heard today then the secondary source base are all the genies produced by Brando Milano which and with our host and the invitation space is sweet that although it uses the same acronym I think we're gonna hear today it's not really only with and we don't know exactly if their cousins their cousins or nothing Andrea says nothing but Steven Jenkins is going to discuss sweet and with okay so if you want to know where they're based institutionally you can go and look on my slides I think that some are UN a World Bank OECD and some are private initiatives that are funded by by foundations primarily so which databases are not included here and worth mentioning the UTIP project by by by Godbreath who I don't know if he's here already I think it was I saw it on the program I think it's gonna be tomorrow or is it no longer on the program I don't remember so he will he has a very nice data set that they're put together then the Gini project by the time we started working on this it was not available then this is my website and I excluded and not because it's mine it's I'm willing to be subject to the same scrutiny but we had not really started posting it until recently and it has only very few observations and then there's a new one that I think also is going to be presented here the global consumption and income project so we asked the reviewers to sort of follow playbook in which you know they could use these criteria as a guide accessibility and user-friendly quality of documentation reliability accuracy of reported indicators and transparency and replicability which I think are very important when we are dealing with data sets so how many these are the data points country years with primary source data so there about 1346 this are the region the regional breakdown and also because in some cases you have more than one source database producing the same results in meaning the same results for the same country in the same year but not necessarily the same indicator you have more in the second column in particular because Latin America and the Caribbean has essentially two main initiatives that produce data for Latin America Sepad the UN and said lack okay so what do we want to know about the databases most of us when we're working on something that needs information on inequality and poverty we want to know which indicators what's a country coverage what's a period coverage however you know more sophisticated users also want to know which welfare indicator if you use income or consumption per capita or equalized total or monetary which makes a lot of difference by the way depending on whether you use a measure that has imputed rent for owners occupied housing and auto consumption can change can change the sign of the direction of a genie coefficient for example before after taxes or transfers you also may want to know statistical significance of what's put out then do you want to know probably if the income concepts are homogenized which is the reason that Andrea gave earlier today for choosing said lack over as a pal indicators calculation from unit records of group data are regional price differences taken into account by the way I discovered that some aren't doing it without that being so obvious in the on the face of it what's the definition of household so in particular in my case I always want to know what did they do to the data the data adjustments and in my case because of Latin America and I'll show you why I'm particularly interested what do they do with the underreporting problem particularly underreporting problem at the top do they correct do they not correct how do they do it but then you have top coding treatment of extreme values and zeros or even negative incomes and then you know if you are someone who is working particularly in a particular country you also may want to know some more information about the survey the sample design the questions they use a recall period they comparable across countries and over time and it is it possible to have access to the micro data which I think from all this group the only one that gives you that option is actually list list Luxembourg formerly known as Luxembourg income study project now it's main main main objective is not to produce data but to produce homogenize micro data that the researchers can use remotely okay so give me a little bit let me give you a little bit of highlights you know one one surprising thing is not indicators tend to be more less the same but not very not all of them not all of the databases that use micro data reports statistical significance which was a surprise because I thought you know there was a standard measure I'm going to give you just a preview of the kind of things that you're going to hear when you look at whether they use income or consumption these are the again the micro database surveys practically all use income the only one that has some some consumption some income is a pub call world development indicators all the others have income because that's what you get in many many countries in particular in Latin America it's much more common to have income based surveys then you know I was interested because I'm working now in tax incidents and transfer incidents to include estimates before taxes and transfers before and after well turns out that even though there are two that have it before there's only one that produces results systematically again this is a micro database okay before and after which is the OECD differences in prices by region looks like the only one that's doing it is at luck and I didn't know that until I asked but they deflate all the rural all the rural incomes by by I think 10 or 15 percent by definition to get the rural poverty and inequality estimates then in terms of the treatment of the data do they correct run the reporting well looks like the one that definitely do does it and almost systematically sepa then there's some of it in the OECD and the world's top incomes data now is documentation sufficient to replicate results and that's one of the things that probably made Andrea and me stay away sometimes from using Zepal because you cannot you don't have the documentation to know exactly what was done to every one of the data that are being used to calculate the inequality indicators and the only one that gives you access to micro data as I said is yes all right so let me let me now focus on sweet which like I said was inspiration to do this to do this I am no expert on imputation methods like I said I probably will never be so but I want to know is the script you know imputation method sufficient to replicate yes it's so well-documented that Dan tell us our RA was able to replicate all the information that that's all produces we know how to do it then you know I asked a question what has it been a method been subject to scrutiny by experts in the field of imputation for me that's not clear enough because the journal where this got published doesn't seem to be a journal where you have that expertise so in order for me to say I am going to use that data I want the experts to tell me yes the method is acceptable is there a systematic validation process in place with experts in countries and regions again to me it's not clear I hear anecdotes people saying oh you know the Hacundo Alvaredo from the top income says no but that they know the top income just makes sense but there's no systematic at least from my perspective validation I also don't know exactly how the genico-efficient for income before taxes and transfers are calculated I think that the and you know I wrote salt but but he hasn't responded yet I'd like to know what method he used oh by the way this is you know the sources used by secondary and imputed data sets you can also see what people use to generate the various data sets all right so let me give you some previews of the things that always obsesses me as a concern okay we have said pal and said lack there's a large overlap this is all Latin America and the Caribbean only large overlap lots of countries and date and years overlap they both calculate genico efficiency record for micro data the important difference in terms of methodology said pal corrects for under reporting in a way that is you know reasonable one could think but we only know the general method and I you know I can go into that if you want during the the questions and answers so how likely is this difference of affecting our analysis of levels and trends in inequality in Latin America the results are similar in terms of trends and the data points are quite correlated however the inequality levels as expected because a pal corrects particularly for under reporting at the top then to be systematically and significantly higher in sepal which is not correct for under reporting and since it's not well documented like I said I cannot replicate it I cannot compare so if you take the difference between the sepal genie minus the set lag genie that's what you get and the average difference is the red line so you'll always find most of the Latin American countries to be much more unequal if you use sepal and in some instances also if you look at specific data you can get trends in opposite direction well what about list key figures and versus OECD they're also highly correlated nevertheless when you zoom in to particular country here they can be important differences as you can see here this is OECD genie minus this genie okay so you do have differences which when you're looking at particular country for particular period they can affect your results and that's what concerns me okay two minutes all right so what about sweet versus others you know general trends look fairly similar and that's why I am going to ask even maybe later to comment on whether even if we're looking at particular countries particular years the results can be misleading can you use this for regression analysis and I'm going to leave it at that but when you zoom in for example this is sweet minus the width and genie so sweet tends to give lower estimates then and all the points there mean discrepancies okay either above or below and this is Povkal and sweet they're also quite with discrepancies but let me go to this table I have two tables that I want to show and then I finish this was the source of an inspiration for the special issue so I went back and I said okay let me grab the bunch of the countries for Africa and take the they were comparing 90s and 2000s that was the argument primarily I think by the IMF let me take the earliest point and the latest point that was in this report and compare Povkal with the IMF fiscal monitor based on sweet turns out that you know we have nine countries here four out of the nine if you use Povkal in a quality increase and in sweet the decrease so in four out of nine cases you would give a different story completely not only in terms of levels but trends so that's scary okay then you know I started corresponding with I mean this is comparisons for Indonesia sweet is the little points black and this is Jamaica this you know these are the countries that have a lot of differences depending on the source but you know when when I started asking so so what's what's going on when do we have such differences between Povkal and and sweet so Fred sold was kind enough and send me this Kenya confidence integral in sweet and then I said okay so you know if you start by one extreme of the confidence and then on the other extreme of the confidence you would probably get an increase in inequality another decline therefore really you can't tell within the percentage points that we're looking at whether it was an increase or a decline using this data set finally you know I also looked at the measuring the redistributive effect by by because that's that's you know that's quite fascinating if you have a number of countries for which for many years you can check that there is a bit of effect and I compared it to our project commitment to equity in which we do very if you want country-based careful fiscal incidents analysis and this is what I found sweet is the change in the genie from before after taxes and transfers is the red in the CEQ the commitment to equity project these are the countries for which we can do the comparison at this point you can see the difference between the two and in this case in four out of 14 countries you would get a different result and the difference when I say different I mean well in percentage points it's very large the difference or relative to the redistributive change is very different or even like in Sri Lanka we get the opposite results in 4 14 again is a very high ratio if you're trying to use information and this is just an example of 14 countries I don't know how that affects the results of Andy and his co-authors which by the way I love the results you know it's the results we all like but I think it's you know makes us pause in terms of how to use this information and how much to be confident about what we obtain in terms of results both when we focus on particular countries or when we do these cross-section regression analysis thank you very much and thank you for giving me a couple of minutes more or maybe more I don't know