 Hello everyone, welcome to this UK Data Service workshop on Exploring Educational and Quality Using Census Microdata. You're presented today will be Nigel from the UK Data Service based at the University of Manchester. Hi everyone, welcome to this workshop, so you had some work to do to be able to access the Census Microdata. Don't worry if you didn't have a chance to complete that because we have the files and you can access them later on. So what I'm going to cover first of all is just a bit about the Census as we look forward to the 2021 Census data being available to talk a bit about Census Microdata and then to focus on looking at education inequality. So the Census was carried out in England and Wales in 2021. There was a real encouragement to complete the forms online and the aim is to deliver the outputs in 2022-23. So much faster than previous Census data and the type of data that will be produced will be aggregate tables, geographical boundaries, the micro data and glow data. So just to say a little bit more about those, there'll be a standard set of univariate tables for all the variables at different geographical levels. Promise of two and three dimensional analysis, so different combinations that are currently being consulted on the order they come out in. And there's also a promise from INS of a flexible table builder so that we can develop those things. And those will also be available through the UK data service with its own interface that has been read about to the moment. Within all of those tables, there will be controls to avoid statistical disclosures, so to protect individuals and groups of people. So looking at the geography of the Census, it will cover countries. So the Northern Ireland Census will come a bit later and Scotland is running its Census next year. So the UK wide data won't be available until the dates when those two countries finish them. But there will be separate country data, there'll be regional data and local authority data. There'll also be electoral data, so local authority, parliamentary constituency and ward. I'm not clear about the administrative geography, I haven't looked into that yet, but we will at least have local authority, ward and health district. I expect we will get regions like local enterprise partnerships and mayoral areas over time. And then finally the one that I suppose I've tended to use myself when I've used Census data, output areas, which is our building blocks. So the smallest output area typically has around 100 households, 100 people to 300 people, and those are built up into lower level super output areas and mid-level super output areas, and they're contained within local authority areas. So we'd use those, I suppose the kind of challenge of using the smallest areas is being able to identify individuals. So you would tend to only get univariate statistics in those areas. So micro data is a set of variables linked to a person. So for those of you who've opened the file you see there's around 80 to 19 variables in there covering aspects of individuals and households. So the two files that are available in 2011 and will be available in 2021 are region and group local authority level, which is a 5% random sample of the census data. There's also going to be for the first time a household data set released on the end user license. So this will be a 1% sample of all households. And as currently there will also be a secure access household data set, which is a 10% sample. And finally there's flow data, which captures origin and destination data, particularly used for looking at internal migration, commuting, student resident places, etc. So what are the benefits of using census micro data and what are the limitations? First of all the benefits, I mean there's lots of categorical variables covering demographic characteristics. It's a large data set, so there's lots of power for multivariate analysis. And it's relatively easy to do comparisons between local authorities and regions. So to select out, for example, one or more local authority areas and to look at how those compare maybe to the border region or to a country. And it's a component in many studies, so you can add it to other types of analysis using census and other data. So it can help answer the big questions, which then may complement more local studies or more detailed studies using survey data. One of the limitations, I suppose, is that there's no continuous or scale variable. So it's difficult to work supporting things like linear regression. It is very much a categorical analysis. So today we're looking at logistic regression. And there is no information on some topics that we might be interested in in our work. So crime, there's nothing on, income, there's nothing on. And the geography is relatively limited compared to census tables. So the lowest level of geography is group local authority. So I think that equates in the English scenario to around 260 local authorities grouped where there are 320 something. So beginning to focus on the census micro data that we've got that we're working with. So the file was previously called the sample of anonymized records and is available for 91, 2001 and 2011. So the file we've got is a 5% sample with 2.8 million records and the kind of information that's held there is like sex, age, race, migration. So when somebody came to the UK or when they were born in the UK, language is spoken, passport held, identity. So the sense of national identity, that's fairly recent one. Religion, social class, stuff about property, tenure and household living arrangements. We've not seen the results yet, but this time for the first time we asked people about their sexuality and gender identity. Yeah, sure, you've got to there the same. I'm presumed it was included because there's a consultation about what data researchers would like to have access to and that was identified as one of the key ones and agreed. The code book Lawrence has asked is in the documentation for the study. So if you've gone on and downloaded the file, the code book should be there in the documentation folder. So this file is available with end user license. So you simply need to be registered with a UK data service and you can download it. So we're attempting to use this practically with SPSS, Stata and R. So you have sets of code that were issued to prepare the data and there's also a set of code to run the analysis with. So first of all let's think about the measure. So I identified educational equality and the measure we're using is a population wide measure. So it's based on the individual responses that includes highest level of qualification achieved. So it goes from no qualifications to degree or higher. There's also another category which may link to overseas qualifications. So the kind of level is no qualifications GCSE A level apprenticeship which doesn't necessarily sit at a level and degree or higher. So for the purpose of this I operationalize that as people having a degree. So we're looking at the characteristics of people who have a degree. So to support that I've excluded anybody under the age of 25. Given the age banding was in five year bands that seemed the right place to draw the line. There is also a household measure of education deprivation which is no one in the household in level two qualifications or I'm not sure where those sit now. I'm not quite up to date with the numbers but basically what was GCSE A star to C equivalents. The literature which I'm sure many of you are familiar with suggests a number of things that might be associated with educational equality. So we're going to try with mentor meter again thinking about what aspects of an individual's characteristics do you think would be associated with having a degree. Okay so this is generating a word cloud as you type in. It's just quite interesting. I mean a lot of the literature those talk about parental education. That's one of the things we don't collect in the census data. We collect it at the level of household so you could derive it but you wouldn't know when people are living in different households. Some stuff about policy as well again which we would need to bring in. Neighborhood effects is interesting. There is some either ways of linking the geography through things like the index of multiple deprivation at LSOA level but there's also a classification of output areas into a quite detailed group. It doesn't collect distance to school. We tend to use socioeconomic classes as a proxy for income because it's not there and again it doesn't collect anything on migrant status but that's that's quite interesting. You're kind of them. You can see that gender, race, religion, parental education figure quite significantly in there. There's some more detailed things which we maybe wouldn't be able to get out from census data. Okay so thanks very much for that. So we've got a number of variables. I just want to now see which way you think those will be associated with having a degree. So would be example coming from a higher socioeconomic class, having higher socioeconomic status be associated with being more likely to have a degree. So again apologies to those of you who are struggling with mental music but the next one is asking you to look at some selected associations and there is a scale on here. So if you just drag the scale across who's more likely to have a degree. So to the left you're strongly disagreeing, to the right you're strongly agreeing. Okay so older people around the middle, moving around a little bit, females just below the middle, people born in the UK, this is shaping up but at the moment we you all seem to agree that coming from a higher social class is more likely, you're more likely to have a degree and that people who own their homes all right are also more likely to have a degree. So I'll leave that screen up for a minute but I suppose thinking about presenting this, one of the areas I wanted to think about a little bit is how these different categories may or may not interact and how effective a measure they are. So just to pick out one, looking at white Irish people there's lots of evidence that in older white Irish people didn't include their nationality on census returns, maybe linked to stigmatisation of Irish during their 80s and 90s, maybe not but that therefore the white Irish who are dominating the census tend to be younger people who've come since more recently. Another factor is thinking about age and how age interacts so is if older people are more likely then populations who have more older people are also more likely, so younger populations are less likely to have a degree. But I think we can kind of hold that there and say higher social classes we seem to have a fairly strong agreement, outright homeowners, quite a few around the middle, older people, females, people born in the UK, Indians and white Irish and a kind of low expectation that black Africans will have a degree. Okay, Jeanette's made the point which is very relevant as well that people think that ethnicity is not the same as nationality, it is the same as nationality, so though their background may be Irish they're British citizens now. Okay, so we'll move on, I'm not going to make massive use of Mentimeter throughout this so it's not going to be every other slide for those of you who might be struggling to use it effectively. So I'll move on to just talking a bit about the scripts, so the idea of providing the scripts was to enable us to do the workshop of course but to enable you to tailor this type of analysis in the future, so in my use of census microdata I was particularly interested in housing and I was interested in housing deprivation, so on this dataset there are four measures of deprivation which are yes known categories, one is housing deprivation, one is educational deprivation, employment deprivation and health deprivation, now all of those are targeted at household level, so for example the health deprivation means that one or more people in the house have poor health or have a limiting long-term illness or a disability, so you can use those kind of categories, you can use a lot of the other categories in this kind of way we've structured, so if you want to take the way this has been done you can apply it to other measures within the dataset and you can obviously include different types of variable, the bits you would need to tailor in the script are to identify your working directory and to get the file from a location where you're using it, so you'd need to amend those in Stata and R and the lines open and save files in in SPSS, okay so let's move on to a bit more about the code, so in terms of the outputs I've used the standard output window for SPSS and Stata but in R I've manipulated it because I found it impossible to read, so I'll explain where those that will come, I've also as well as producing tables between having a degree and the different variable selected, used the ChiveSquad testing claim as V to see whether there's a significant relationship, significant difference and the strength of the association, so those outputs should all come in the scripts that you run as we go through them, all of the regression outputs are referenced the first category of each variable, so for example in terms of base and ethnicity there's no output in SPSS and R I think for the reference category where a state of those produce it and in R I've commented a sync command which will just save this to a file, so sync in R with a file naming quotes inside the brackets creates a file and sync with nothing in it closes that file, I've manipulated the regression outputs just to take out the coefficient label, odds, ratio, lower and upper, coefficient, and the p-value and those are written out to a CSV file in your working directory, so those are the kind of scripts that you'll be working with, so just to summarize what you were doing in the scripts that prepared the data, so first of all you access the file, so part of that is signing up for it with the UK data service, putting a purpose for it, and then downloading and on extracting the files you want, there's a section on recoding variables, labeling variables and creating the aggregate file, so Richard has asked if the slides will be made available, they will be as well the scripts on the event on our event site, so that's the process for creating the data, now you can as I said you can slot your own variables into that, you can slot your own dependent variable in your independent variables, so now we're at the point where we need to get going with this, I think it's probably worth us giving ourselves a few minutes just to get sorted out, so this address github.com etc etc is available and has the different script files and also the aggregate dataset if you didn't manage to create your own, okay so hopefully you're getting the files you need and I think this part it's over to you quite a lot, we can use the chat if you have questions don't flag them up in the Q&A, we'll pick them up as we go along, but the first stage is to look at the association, so the two variables you really do need to pick the association you're going to use are age, because it's there in 5, 10 and 20 year bands so you're going to have to use more of those, in the script I used a five-year band and social class which is a 8, 5 and 3 category for occupational social class or NSSEC, again you'll need to pick one of those, so you should have the data, you should have the script and you should be starting to run and having a look at the outputs to see what you're getting, so in the script I've staged the regression, so I've got one with a few other variables to have a look at first and then a second one, so I think as you begin to move towards running that first model, maybe you would think about using this to talk about the association you find, so again it meant to just to have a look at what different people are finding and how you're interpreting the results, so we've got a significant association with age, I'm not sure I get the second bit of that, so are older people more likely to have a degree or less likely to have a degree, so that would mean that populations that were older would be less likely to have a degree, so maybe migrant populations who tend to be younger from some groups would be more likely to have a degree, there might be two, I suppose what that's pointing out is that there will be significant variation within groups as well, so I suspect some of you would have moved on to the second model and I think what I'll do is open that screen to see what kind of associations you're getting there, so if you are still running the first model then put them in here as well so we can see if there's other things coming, but if we look here we had a kind of association with disability, association with tenure, with age, not with place of birth and people born in the UK, so I think it's country of birth, isn't it?