 Thank you very much, you only wider for inviting me to this event. A couple of caveats before I start, first of all this is work in progress. So some of the data here has yet to be verified and checked, but I think it gives an idea of the direction in which we're going. The second caveat, if you will, is that it's got a rather grandiose title of enhancing the quality of income data in surveys for microsimulation models in Africa. But in fact the project is actually looking specifically at Tanzania and Zambia with some additional information from South Africa. So it's an overambitious title that I've given it. And due to time constraints I'm actually going to focus mainly on Tanzania with a little bit of South Africa at the end. I think it's become quite apparent during the course of the other presentations that actually the quality of income data is really important in tax benefit microsimulations, clearly and evidently where taxes are concerned, particularly direct taxes, income information, quality information is absolutely essential, but also where it comes to means-tested benefits that are tested on income then clearly the quality of income data is also important. However, income data in many surveys in sub-Saharan Africa whilst collected isn't put to the test that for example consumption data is put to because consumption data is regularly used to estimate poverty and inequality in most low and middle income countries. So the income data doesn't get tested in the way it does in maybe developed countries or as I'll say a little bit later in South Africa. So there becomes questions about its quality. And we found particularly in Tanzania and Zambia, which are two models that we were working with with country teams there, that there were several issues revealed. There's issues of missing income data, but also issues of implausibly high and implausibly low or zero income values. So the challenge, early versions of TASMOD apparently simulated far too much direct tax whilst micro-ZAMOD simulated far too little compared to external administrative data sources. Now this could have three reasons actually. Two of them could relate to the validation data in the sense that tax data is often reported on a cash flow basis, not on an accrual basis. So it's interesting to the revenue service to know how much tax is collected in a given year, not how much is due to be paid in that year, which would be an accrual basis. So what I mean by cash flow is it could represent self-employed taxation from the previous year, it could represent arrears of direct taxes and so forth. PAYE is generally on an accrual basis because it's paid as it's earned, but the model models taxation on an accrual basis, i.e. it's how much tax is due in that particular year. So that might be the problem. It might also be a problem of compliance. We know, for example, by talking to revenue authorities in various countries that the informal sector, whilst through legislation may be due to pay tax, often doesn't and there are difficulties in collection. So there's a compliance issue. And then finally is the quality of income data in surveys and clearly it's on that that I'll be focusing this afternoon. And in particular, the income variable of interest is income from and employment. And we've selected that for two reasons. First of all, in Tanzania, for example, it's the main contributor to the over simulation of direct taxes. If you look at the Tanzania data and look at the contributions of the various income sources, self-employment income, earned income, income from agriculture or income from other sources, it's employment income that's creating the main problem. And I think more practically what I'll be talking about is income imputation and most income imputation methods rely on being able to select good covariates that will predict income. We've had much more difficulty in finding covariates within the data set for self-employed income and income from agriculture. So I'm using here some Euromod technology. So YM is the variable of interest, which is the variable which represents employed income. And prior to the imputation process, we revisited the data preparation stages prior to the imputation process. We've heard from Javier and Gemma as well that model building is an iterative process. Well, so is data preparation. I think you often prepare the underpinning data set for the model. You prepare the model and you get feedback from the model which causes you to revisit the data. And this is an iterative process and that iterative process can go on for years. So I think it's important to realise this. Models and the underpinning data both get better as they grow up. Implausible incomes, I think identifying missing incomes is relatively straightforward. In most countries, person says they're employed, but their employment income is recorded as missing. However, it can be more difficult than that. So for example, in Tanzania, the survey question asked what was the cash that she received last from your job. And then there's a follow-up question, and what period did this cover? And the options are our hourly, daily, weekly, monthly or other. Now clearly that if the periodicity is missing, then clearly you need to set the income to missing too. Manual checks in Tanzania, and I'll focus on Tanzania, as I said. Showed that most of the implausibly high incomes turned out to be due to the coding error. So what happened was that probably the 100 highest incomes, something like 70%, the reportage was that they were being paid hourly. And clearly when you're actually computing a monthly income from an hourly income and you're multiplying by 40 to get weekly and by 52 to get annual and divide by 12 to get monthly, you're actually multiplying an error there if the error is in that way. So these were the kinds of implausibles that were identified by manual examination of the data. So using the raw, untransformed primary pay values, outliers were identified as values that were one and a half times the interquartile range from either the upper or lower quartile. And we use that because that's how the outliers are often designated in box plots. But also when we actually checked against our observations, that's where we found the outliers to be. So we clearly were able to set those outliers as missing in addition to the missings that were otherwise identified. And we did those outlier identifications both by occupational class and by highest educational status attained, both of which were very good predictors of income. In Tanzania, approximately 10% of the employment income was actually either missing or set to missing as implausible. There was also a need to do some further cleaning of covariates. We wanted to get the covariates as clean as we could and that was fairly straightforward and easy for things like gender, age, et cetera. It wasn't so easy for occupational class. I'm going to move forward. We tested four imputation methods, simple linear prediction, and three multiple imputation methods. Predictive mean matching, two variants of sequential regression multiple imputation, which is sometimes referred to as multiple imputation using chained equations or mice. The two variants of SRMI that we used was SRMI regress and SRMI PMM, which is a predictive mean matching variant of SRMI. As I said, the imputation methods are regression based methods for simple linear prediction and standard PMM. This is an OLS regression model. The main variable of interest is continuous, a primary pay for the two SRMI models. These are predicated on sequential regression models. In our particular case, due to the patterns of missing that we had, it was a combination of OLS and multinomial logic models. The essence of multiple imputation is that you produce multiple data sets, one for each imputation. We, for our purposes, used 50 imputations and for the SRMI approaches, 100 iterations for each of those 50 imputations, which gives us lots of challenges. What do we find in Tanzania? Let's look at a kernel density plot. This is the plot of the observed cases. This is paying US dollars, actually, just for consistency. We were looking at comparisons between two countries, and that's the observed. This is what you get with a simple linear prediction. This is the mean, by the way, of the simple linear, because there's only one imputation here. For the multiple imputations, this is a mean of 50. We've got the simple linear, then the PMM, then the SRMI regress, and then the SRMI PMM, which you'll see is really the yellow that's more or less where the non-SRMI version of PMM goes. That's all very well looking at these kernel density plots. What does it mean in the real world? If we move on to the results for Tanzania, this is looking just at direct taxes and simulation of direct taxes. Before any adjustment, we were simulating 493.2% of the administrative data reported direct taxes in a year. Before we did any multiple imputations, we did some constraining of the various income variables to the 99th percentile, including constraining the earned income to the 99th percentile by the different occupational classes. That brought us down to simulating simply 167.1% of reported direct taxes. Here are what you get as a result of the four methods of imputation. You can see they're all pretty similar. I suppose the standout really, if you're going to call it a standout, is the SRMI regress, which gives us slightly higher. I think that there's not much to choose between them. We wanted also to test this approach in South Africa because with South Africa we've got a dataset which is actually the National Income Dynamics Study which has actually been, the income data has been tested by a lot of different researchers. Indeed, the whole survey is conducted by the University of Cape Town, the salary unit there, and it's thought to be a pretty good value, a pretty good quality, and so it becomes an interesting case to test. We have found it to perform well when looking at direct taxation in South Africa using SA mode. How do we do it? We created artificial missing categories. We did that by setting a random number against every observation and dividing the observations into 10 desiles based on the random number and then created 10 datasets imputed for each of those 10 datasets with a different desile being set to missing and then added all the imputed datasets together at the end to create a completely synthetic dataset of incomes that we could look alongside the actual incomes. When we did all that, I've summarised this already quickly, done all that. This is what the kernel density plot looks like but this is what the scores on the doors looked like. This is the, we used PMM, simple PMM. It was performing quite well before and actually takes five minutes to run as opposed to five hours to run on our machines. This is the percentage change. The first column A uses the original employment income that was actually recorded in NIDS. The column B uses a totally imputed dataset on the NIDS cases. C is the percentage change. What you can see, I think, is quite good is that if you look at direct taxes, you see that 80% of direct taxes are actually captured by a totally imputed dataset because of preserving the distribution differences. If you look at the benefits, then the benefits that were simulated using a completely imputed dataset were very similar to when using the actual dataset, which I think is quite persuasive. I'm going to conclude now because I know that otherwise I'll be in trouble, but I think one of the things that we've skipped over is the fact that we've created these imputed data and we've created the mean in which to test on the model, but actually each imputation produces a dataset which is equally as good as any other dataset. Strictly, in order to be able to properly compute standard errors and confidence intervals around these estimates, we need to run the model 50 times and that's part of the next phase of it, but obviously one doesn't want to be... and you can do that. You can call it from stator and do that and we will be doing that. But one of the problems and dilemmas of this whole approach is that, yes, it seems on the face of it to be a good thing to be undertaking these income imputations, but how in practice can people use the model? If Remy, for example, in Zambia wants to use a model, he doesn't really want to call it 50 times from stator every time he wants to change or test out a new policy and it's a dilemma because the literature is silent on what you can do with multiple imputations because they're not usually used in this way. OK, thank you very much.