 Mae'r braffon wedi fynd ar y amddangos. Yn oedd hwyl iawn yn ddófain o'r gwerth i fynd ar y bwrdd iechyd. Ond yna'n ddod o'u bêl i'n ei bod yn hyn o'ch ddau'r hyn. Fyrydych chi. Fei'ch angen i ddelfwylio a'r dyflawn ei ddelfwylio. Yn raen ei ddelfwylio, dyfio'n cydwell gael, mae'r bwysig ar fynd i'r waith uch i ddadol. A dyna'n gilych cais ei ddelfwylio a'r hanes, ond yra'r gwahanol yn ydyn ni wedi bod yn arsu. Rwy'n dechrau'n meddwl o'r ddat fy modledd. Rwy'n meddwl o'r ddiddor o'r awm. Dw i'n edrych i ddataeth, ond o'r ddau cymdeithasol èu ddataeth ac oed yn dda i ddoedd bod ar y pannol, diwethaf, ddataeth. Dyna y gallwn amddunydd i bawb a Darren angen pa y bydd y ddau heb. Ond rydym wedi bod rydw i. Efallai rydw i'r bodyn bynnag Waiting Tony yn 3 poynau gallwn i gael yn ddau o'ch ddaiddau o'ch ddau oeddaeth, ddweud o'ch ddau o'ch ddau oeddaeth. Felly mae gennym hwrnod y gweithlwn i'u ddod i gweithlwn iaith. Byddwn i ddau, dyma ar gyfer gyffredinig a rwy'n wedi gyda gweithlwn iaith, ond mae cyd-deddygiadau was眼ol gwrthoedd yn erbyn hyn'r gweithlwn iaith. Y ddweud y gweithlwn iaith y gweithlwn iaith y gweithlwn iaith yn ei gallu'r gweithlwn iaith, yesterday I'm not updating my paper and there is the swid which dates from late last year so put out by you and you wider and Frederick salt so basically swid is wid plus some extras and Nora gave you some things about that they are great people like them because they have they cover a lot of countries and you can see it's around 170 countries in both cases they cover a lot of years wid goes back furthest and swid goes forward a bit more because he started later and has had the chance to operate some more but he has a 1980 cut off wid is has got four quality ratings so things rate from one very good down to two and three down to then four don't know whereas quality ratings are not really used in swid at all except maybe for the 1980 cut off in the sense you might argue that the past is something we don't know about so much one thing you'll find out immediately using wid is that there are genies that are based on a huge number of different definitions and sources and I'm going to come back to that issues a lot in what I'm going to say on the other hand you put up the swid fantastic guys there's just one genie or several one for net income and it's standardized on a list definition so the standard list definitions that Marcus will probably remind us about later there is the net income genie there's a gross income genie and the difference between them is redistribution I'm only going to be talking about net income genies today because there's quite enough to say about those if you use the wid you're one of the problems you'll immediately find is there are lots of gaps in terms of countries in years and that that might be a problem for you if you want to address particular issues so swid may seem an amazing bonus to you why the gaps are all filled in folks there are no missing country year observations basically what is done is that as Nora has hinted there is a multiple imputation model that is used to fill in the gaps annish as Nora reminded you every number in the swid is made up using an imputation model and I'm going to come back to that in fact so with the multiple imputation there are in the main file there are 100 multiple imputation values for the genie and there is a summary file which contains the mean of the imputations and I might also show you that that later and so people in the previous versions pretty much all the time have ignored the multiple imputation aspect of the data the imputation variability and focused on effectively the mean values so I've talked about the advantages and disadvantages in the paper I just basically assume that you know all the advantages I also talk a lot about file content and documentation I'm not going to talk about that at all today but basically I'm going to focus on potential disadvantages problems illustrate them and suggest what you might do make recommendations and because Tony's going to be pushing me for time I'm going to give you my conclusions first and work on from there so the first thing we we've been in this place before Atkinson and Brandolini in a couple of very good papers have reviewed the width predecessor the dining and square data set and all the issues that they raised there about comparability and data quality I mean basically the first task I set myself was to say are these have these issues gone away with more modern data sets and I'm afraid the answer is no that the issues remain very relevant I've got two conclusions for with users that people must report the details of their country year selection algorithms and justify the choices that they made there is an amazing number of papers out there where it's very difficult to find out how people have chosen the observations and I'm going to show you that's actually very important to do secondly people do actually acknowledge most of the time that there are these non comparability problems and a lot of the time they use something called a dummy variable adjustment process which I'll talk about a bit later on but basically it's too simple and people need to get wise up and use more sophisticated approaches the swed well I'm afraid I'm going to disappoint the IMF or I'm basically going to suggest that they should rethink their strategy my headline conclusion for the swed is that it provides plausible data but not sufficiently credible data unquote there are two things you could think about one is basically about the point estimates bias in other words and that's driven essentially by the imputation model and that's where I've got my main concerns the other aspect of it rather interestingly the precision issue that is if you ignore the multiple imputation nature of the data the when you derive your estimates that doesn't screw things up very much which is pretty amazing really but that's how it seems to me so overall I recommend the weed rather than the swed but that support is very conditional and it refers back to points two and three which are major points we've already heard a lot in the talks about where we can get differences across datasets and non comparabilities issues of data quality I've summarized them on the left hand side in terms of differences of distributions and on the right hand side in terms of differences in data sources and sorts of adjustments that people might make so for example in the resource measure income blah blah blah a big difference that will come up is the difference between gross income or market income and net income but there are all these other headings another thing to know about of course is that I mean the weed provides a series of variables which help you identify the differences on these various definitions what that also means is that you can take different combinations and work out how many data potential data series you have and the answer is a lot as I shall show you but for the first thing we have to remind ourselves about all this is a real trade-off problem and that is between quality and coverage a lot of people coming into this area basically want to have global coverage uh but the headline message is here and we've heard it already today is that the more global you want your coverage the greater the prevalence of poorer quality data that are included so I've extracted a table from the paper that has across the top time periods on the vertical on the vertical axis and in rows is regions of the world and the difference between the top and the bottom panel is that the top contains all the observations that are in weed and the one at the bottom is just focusing on the good quality observations the quality equals one and the first thing you'd note in the the top is that all cells of the matrix pretty much are populated and boy there are a lot of numbers however if you go to the second panel you immediately note that there is a substantial drop in the number of observations so that means if you work with the top you're including a lot of poorer quality observations moreover that that uh selection is very selective look where the the entries are concentrated in the bottom half of the table basically you start losing all the observations for africa you lose for pretty much you know a lot of um developing countries which might be the ones you're interested in so that is a trade-off conundrum if you will that has to be faced up and I don't think any matter of adjustment can get around this issue but it needs to be thought about a bit more so this might multiple data series thing so my headline conclusion remember was we need to know about algorithms I didn't know about these data at all before I started in this so I but I know about UK data so I looked at the width for the UK that's the 99 observations on the left hand side time on the on the horizontal axis genie coefficients on the vertical axis each dot is one of the observations the black diamonds are the quality equals one and the hollow ones are the quality the other ones and basically you see that uh we get this concentration issue and moreover there are a lot of black dots um over this but moreover for each of the different years you can still find repeated observations per cell okay just to round that home let's go to Finland why should we go to Finland rather than fact that we're just we are here of course um is that part of my point is that these are countries where it's typically thought that the quality of the income series over time is relatively high quality so we could you know so things we've got problems here we're pretty much going to have problems any other country so if we just focus for Finland on the quality equals one observations we tidy up some of the variable names in wid and do things like that we can still come up with four times three plus one is 13 different series okay this is just out of the wind so which one do we choose we need to know and quite often folks we don't know and so at each of you know any given year here although you tend to see a sort of view shape there you'll get different views about the precise picture about levels and about trends what about comparing with external benchmarks well this has been a theme already in many of the talks have things changed very much the answer is well things have got better since the ackinson brandolini jl article but even then there are some quite mark differences that exist so here this picture shows focusing on one year around 2000 where i'm comparing wid observations with observations that come out from the europe stat online database and from the lists key figures and here we're all restricting ourselves to observations that are essentially the same as the list key figures one so it's all pretty modest supposedly in terms of definitions and of course that restricts the number of countries if we want to have three observations but if you look at the the heights of the different bars for for these european countries you can see that there are some quite large differences in terms of percentage points you know if we take two percentage points in the genius quite a large change most countries don't change by that much every year you can see we get differences of up to four to five percent in some years so even then you know there are differences and and if you were to use the different series it would change rankings and so on what about a developing country well this is china in the wid and this is even if you're focusing on the observations where uh we're not taking account of the differences between rural and urban and so on this is just focusing on the whole country everybody we have this quality coverage conundrum even within this country so if you want to look at a long time series of inequality that includes china then folks you have to go down to quality equals three rather than quality equals two that is the triangles rather than the squares okay so that gives us the series with the the funny drop down here which you wonder what's going on moreover this point about having multiple observations per year even when the the wid definitions suggest that you've got exactly the same income definition you get really big differences look at 1995 huge difference in the genies that are going on here and even if you focus more and more on observations that are have got even more consistently dark divine observations so these are the ones that are filled in rather than hollow and there's a few of them around in these pictures then you still get big differences 1995 here you've got some oddities and so on so there are big differences in it if you pick up an article for example by these guys who are quite like the article in many ways but they also do some benchmarking against wid and their picture looks quite different from mine i've no idea why okay so that's wid what about the swid okay so the swid um how's it generated remember the the key thing is that all the observations are imputed and the idea is to use an imputation model so first of all the selections and exclusions as i said focus on um the pre 90 get rid of the pre 1960 wid observations and so on then okay here's the the imputation procedure it is really complicated so um there are pages in my paper about the the details because the details are very important but the idea is actually very simple so let's just suppose we have two data series for the genie coefficient for a large number of country year observations one is both based on gross income or market income if you will or just let's call it gross income and the other one is on net income but the problem is that some of the estimates are missing for the net income genie so what you do you are identifying assumption is the following you assume that the ratio of the genie for net income to gross income is constant the same common to all countries within a particular group of country year observations so that that provides you with donors if you will and that gives you an estimate of that ratio call it rg and then basically given that ratio you go to the missing folks in the cell multiply the value that you do have bind the ratio and out comes your number that's an imputation it's much more complicated this it's regression based the more than 20 different data types of so many different series lots of different definitions of groups and are rather unclear various other steps including smoothing and as a bonus salt also provides share of the richest 1% in top income database sense there's a basic problem though with this imputation idea about this constancy within groups of observations and by the way it is just a multiplicative version of the dummy variable adjustment method that people use there's basically I would argue that there are two competing the demands that can't be met basically you have to group observations in order to have donors okay to provide the values that are going to be imputed to the missing observations and of course you want the groups to be as big as possible so that you get a reliable estimate within the cell okay and to have some observations after all but then of course you need lots of observe as lots of groups is possible to take account of the acknowledged variation in the difference between the genies for the ratio of gross income grossing town to net income but the trouble is that if you have more and more groups that means the average group size the number of countries goes down country and country years so in the limit you have no donors so it's this sort of problem you know and yeah so basically I don't think you can solve the problem given that the data they're available I don't think it can actually be met satisfactorily in in practice so the assumption that's basically built into with is us with is likely compromised so I don't think with users should however smirk the fact that there is this non-constancy reminds us that with users who work in terms of genie differences with their regression adjustments need to look out and they need to be much more sophisticated than they have done in the past so it's a judgment call on both groups I understand okay there are other issues with the swids imputations that I don't like imputations of smoothing definition of the series there is a genuine bug in the computer code for the share of the top one percent series so simply don't use these data and it's unfortunate that salt who knows about the bug has not advertised it on the other hand do applaud the fact that he provides replication material I'm going to show you some evidence from the swid to explain or show you the sorts of things that you get out of it so this is for a net income genie so how can you benchmark the swid well you can try and do this benchmarking exercise again and see if for situations where you think you've got good data what does the swid show you for it so the sort of thing you would get is so here's genie his time these dots here the dark dots with the squares are the means of the hundred imputations for each year so for people who have not taken account of imputation variability in the past I think IMF those are the numbers that are used in their regressions okay now the and there are good probably reasons why they didn't because in the previous version of swid the multiple imputation bit was not as easy to use as it is in this later version okay the gray dots are summarized the imputation vary variability in this so the whole this is all the range of the imputations that's 100 gray dots in each years I much prefer to show it this way salt refers to confidence intervals that's to me is inappropriate language because they're not confidence intervals based on standard errors in the sampling variability sense the these are different okay so that you can see the series here for Finland it's benchmarked against list key figures what do you note here is that here we're going before the wids series so this is where we're drawing on the imputation bit drawing information from across countries in years elsewhere and it's well reassuring that the imputation variability is a lot larger if we look at the other end of the series again we're pushing out and here we have the the swid imputations but here he's not even online with the list key figures what's going on here and the difference the difference that I'm going to finish Tony like for us while finished and the difference is here is that these list to two list key figures were not available to salt when he generated the data so he wasn't able to standardize his series but however his out of sample prediction is a bit way of them out of the mark the other thing is that in previous years the levels and trends are also different so if you think about throwing these numbers into a regression model the numbers and the patterns the relativities compared to different countries different time periods are different and that's going to affect estimates if you do this for the uk you get a similar message i've suppressed the imputation variability in these pictures these numbers are the official income statistics in britain the only difference between them is a change in the equivalent scale so the series marked together in parallel virtually the same the swid gives a rather different picture and i don't believe the swid story china in kenya okay so you might have been quite impressed by the relatively narrow range of imputation variability in the swid that came up from those earlier pictures this is china earlier years look at kenya here look at the range on this picture by the way from a genio of 20 through up to 90 so you can get huge imputation variability in some countries and so the more that countries like this are put into your imputation model it's going to have an effect on your estimates and moreover there are differences in levels and trends to the extent that you can trust the wid observations in these two because remember when you move into the developing countries you're moving down the down the quality scale but interestingly for example for china this is the wid the swid estimates here correspond to the trend that's shown in official statistics but in fact if you go back to the xin zhu article that i was referring to before the penis one a nice part of that article is working with recent household surveys and suggesting that the trend is in fact better represented by where the wid one is pointing which is which is further up so the final part of the paper is to do a series of regression illustrations to investigate the extent the sorts of things that are going on and i've taken an illustrative example based on the blind or isarchy type of literature about the relationship between inequality and several other things macroeconomic approach and done some benchmarking substituting in other information good quality information when we have it and see whether it makes a difference the bottom line is it does make a difference part of that is because you're focusing on a homogeneous set of countries but once you get into a homogeneous set of countries you might as well not use the other sources um the other thing is that uh partly with the swid in the regressions nor a wants an answer to a question can you still use them in the regressions we cannot tell because by definition for a lot of the other countries we don't have a bit external benchmark reference point the observations are missing on the other hand given the benchmarking that we can do against good quality series over time in year you've heard what i've said but also if we look at precision if you simply ignore the multiple imputation variability interestingly if you take um proper account of it using for example the mi estimation suite in stator then you get the same answers so you know the issue to me is potential bias not precision uh so i've come to the end and there are the conclusions again thank you very much