 Okay, hello everyone. Welcome to this UK Data Service webinar giving an introduction to 2021 Census Geography data sets. The presenter today will be James Crone of the UK Data Service based at ADENA at the University of Edinburgh. Thanks Jill, welcome to this webinar everyone. In this webinar I'm going to talk an introduction to 2020 on Census Geography data sets. The first couple of slides we're going to look at the Census itself just to give some background. The UK Census is an opportunity to ask the entire country questions about the population. When the UK it runs every 10 years it was last held in 2020-21 in England and Wales because of Covid in Scotland it was actually held this year in 2022 and what the sort of questions you ask are directed both household and individuals. So the household questions may be about the actual property that people are living in. Individual questions will be more about the sex of the individuals, their ages and where they sort of work sort of job they do. The Census in the UK is carried out differently in the different nations because there's different UK national statistics agencies. So in England and Wales it's run by the Office of National Statistics. In Scotland it's run by GROS, sorry Registers Scotland, yeah and in Northland it's run by NISRA. In the old days the Census form was always filled in by paper and you would get it on Census Day you would fill it in and post it away. In recent years they moved to doing it online so you can now do your submission online. So as I say in England and Wales the Census was done in 2021 and so over the last year or so ONS have been processing all that Census data that's been submitted from the forums and building output data and what they produce from this data is univariate and multivariate tables of Census statistics. So univariate data would be a single amount of data about the people and multivariate would allow you to explore the relationship between the different types of Census data. Once all that data has been processed the Census statistics are output as tables with counts of people or counts of households and the data is output of different levels of small area output geography the smallest of which is the Census output area. So what you can see here is a map showing parts of Edinburgh and we're showing some a few Census output areas and we have a table here showing univariate statistics so the idea each of these output areas is a small area of geography to which you could associate with it different Census statistics. Now output areas are a synthetic geography that are created purely for the publication of Census statistics and we can get a bit of information about output areas. The output areas because the synthetic are built using a zone building system in terms of the relationship being how many people or households are within each output area you have a minimum of 40 households and 100 residents up to a maximum of 250 households and 625 residents when they're building the output areas there is a design goal to make sure that the population within the output area is homogeneous so that across the output error you're having you're comparing like for like in terms of the population and a secondary goal is to ensure that it's to minimize the amount of change between the 2021 output areas and the 2011 output areas so again so it's easier to make changes and you're not comparing boundaries that are massively different between 2021 and 2011 as you can see the output areas cover a relatively small area so they typically form like a street or a few streets together in terms of buildings within the last between the 2011 census and the 2021 census there's also been changes to the population so there's been a gradual increase in population size across the country as a whole and changes in population density and the distribution of population as for example power blocks have been demolished or a new house has been built and also within the output areas themselves they could be changed to the homogeneity of the population as there have been local changes and to account for this and to make sure the output areas maintain their design criteria in terms of those numbers of households and residents and the homogeneity they have to make minor adjustments to the output areas between 2011 and 2021 so across England Village as a whole they believe that the it's believed that 95 percent of 2011 output areas have remained consistent into 2021 and within that 5 percent there's been different sort of change so some of the 2011 output areas have merged to 2021 output areas and then some of the 2011 output areas have split into two or more output areas and you can see that this is just showing leads for example and in blue you can see the 2011 output areas that have merged into that have basically been merged together to create a single 2021 output area and the orange are showing when that 2011 output has split into two or more parts to form two or more separate 2021 output areas so the splitting might have happened where the population density has massively increased so the output areas now no longer are now too large in terms of population for an output area so they have to be divided into small areas into separate output areas whereas emerging is they could have been a population decrease and in order to maintain the minimum level of population if I had to be joined together and you can see a more decent example of this if you look into one of those output areas and leads so this shows for the same locale of leads the output area boundaries in 2011 and 2021 so at left you have the 2021 you have the 2021 boundaries and at right you have the 2020 sorry at left you have the 2011 boundaries at right you have the 2021 boundaries and you can see in the 2011 you have a single output area which covered the entire locale whereas in 2021 that single output area is split into three 2021 output areas and the reason that's happened is because between 2011 and 2020 and 2021 an area from a brownfield site has they built a housing estate on it with new population so the the population within that area has massively increased so they've had to split the output area into two parts and also in 2021 they decided to at the top there's a tower block here which is a quite a dense population and has a different population characteristics to I guess the terrace housing so they decided to create an output area just for the tower block itself so you go on from one to three and you can imagine this happening right across the city those areas have changed whether there's been changes to the different types of population and and this will they've adjusted the output areas but by themselves they don't purely use output areas because of this think or statistical disclosure control which is there to ensure that from census aggregate data you can't identify individuals and there's a risk you could do that from output areas because they do cover quite small populations and you could have like only one sort of type of person has a particular employment within the output area so they could be identified as an individual and so because of that for some of the certain census districts they need to get it out through those much larger areas to do this as well as the output areas to have a thing called super output areas which are much larger amalgamations of output areas and there are two layers of those there's a lower layer super output area layer and is a middle layer super output area layer we've got some numbers here so you can see they're off they cover a much larger number of people and much larger number of households so the lower layer it's up to 3000 people the middle layer up to 15 000 people and again 1200 households for the lower layer six town households for the middle layer and how these are created are by merging output areas together to create lower super output areas and then lower super output areas to create middle super output areas which you can see on the right here so we have a single output area this would merge with five other output areas to create a single LSOA in turn those five LSOAs could form an MSOA as here and the important thing is that OAs, LSOAs and MSOAs all nest in local authorities and they all nest within each other so and that's a perfect alignment and there's some minor terminology different between England and Wales, Scotland and Northern Ireland so in England or Wales you have output areas, LSOAs, MSOAs in Scotland you have output areas and in 2011 you had to think of data zones and intermediate geographies in Scotland the data zones are equivalent to LSOAs and the intermediate geographies are equivalent to MSOAs in Northern Ireland the lowest geographies are called the small areas which are the output areas and in 2021 they're going to produce super output areas for Northern Ireland but just a single layer I guess the equivalent of the LSOAs again the Scotland census only happened this year so the geographies for Scotland won't probably appear until next year when they start to produce the actual census stats for Scotland so as well as producing the census stats for the statistical areas which they call statistical building blocks the output areas the LSOAs and the middle SOAs it weighs census stats can also be output for the different non-census geographies to enable the census data to be compared with other data which may only be published at those geographies so you could you can produce census stats by local authority or by electoral ward or parliamentary constituency if you want to look at electoral results and within the context of census data or health outcomes depending on particular policy types this webinar is really an introduction to Geo's Basics sorry is census geography data which is the realm of geospatial data so there's like four fundamental types of principles of geospatial data so this vector and raster data so vector data consists of points and polygons what makes geospatial data geospatial and its reference to a spatial reference system in our case the British national grid which just locates those polygons or those lines in geographic space if everything's in the same spatial reference system then you can tie different data sets together lastly each of the census output geography instances has an alpha numeric code called a geographic identifier which uniquely identifies that particular instance so you can see that this particular output area is for this is here and this one in a different part of the world or the UK rubber is different to that one and it allows you to tie data to that output area and then a geographic lookup table is a means of relating one type of geography to another and in terms of these geographic identifiers I said these are these alpha numeric digits so that you can see one here and this in the UK for UK census data uses the GSS coding system which is the government statistical service system of codes and if you're familiar with earlier census data from 2000, 2001 and before these tend to use hierarchical codes so you might have quite a long digit which of say eight eight digits and embedded within those codes was there was the code of the county the districts of the ward which say the output area nested within and you could use the code purely on itself to work out the hierarchy of that output area this is not the case of GSS codes they purely contain the type of geography and the unique ID of that instance which you can see here for example so all the codes are nine digits length and the first three digits are the entity which is the type of geography and the remaining digits are the unique ID of the instance within that geography so for example you can see we have an easy zero one two six six seven nine code the entity code is the first three digits which is easy zero zero and the remaining digits are the unique ID of that instance and the ons produces a thing called the register of geographic codes which you can look up and it will tell you what each of those three digits entities actually means in terms of what geography relates to so we take our easy zero one two six six seven nine as our easy zero entity i mean look it up in the registry we can see that refer to output areas whereas if we take the easy zero seven triple zero one one eight easy zero seven can tell that refer to non-immigrant bolting districts the gss codes geographic identifiers allows us to uniquely identify instances of geography so in terms of census boundary datasets as i said there are type of geospatial data and they describe the spatial footprint of an instance of a given geography and consist of one or more polygons and each of those polygons is made up from points nothing too hard about that because the geospatial data they will be provided in gss data formats like shapefiles which you might use an edge read a desktop gss software like arc gss or qgis or in these days when people use r and python you might ingest into a data frame or something but shapefiles are still fairly ubiquitous they're like the csp of the geospatial world and you get different flavors of the boundaries so the ons in particular produce their um census output geography boundaries in an extent of the realm variant and eclipse the mean high watermark variant so what we have here is a map showing some boundaries at the extent of the realm so the extent of the realm which the realm is the uk and the extent is how far that is the extension into the sea basically so it's like the territorial waters almost and you can see how that what that looks in terms of boundaries if you look at if you're familiar with the Bristol geography there's these two small islands here and then the extent of realm flavor flavor does this you can see it extends away from the land into the sea and creates this like bizarre feature which is it's not very aesthetically pleasing but it is correct in terms of the actual the geography where as a clipped variant provides a much more familiar geography and it mean high water so in the mean high water variant you get to see all the the river extents you get a much more aesthetically and familiar geography in terms of the coastline so if you the the general recommendation is that you use the mean high water version if you're producing um visualizations or maps because your your viewers of your maps will be getting much more familiar geography um but because it shows the rivers in quite great detail if you are like doing some sort of analysis of data in land and see you might be looking at you might have point data which you might be trying to relate to the polygons but because this extent the the clipped version includes the river areas you may find up your point falls inside the river rather than the actual polygon so the the the advice is to use the extent of the realm versions if you're doing any sort of analysis which could be spatial analysis so you're comparing one geography to another and so that's the difference between the extent of the realm and the clipped variant they're normally just called clips rather than clips in mean high water and then there's also beyond these extent the realm in mean high water is I think of generalization and generalization is the process of taking what could be a very complicated geography sorry polygon with thousands of points and stripping out some of those points so that the feature is made above less points which still retains the look of the the polygon and that's to make simply to cut down the file size which makes exchanging the data quicker but it also means if you're using the boundaries of n like a GIS it's quicker for the the system to draw the polygons so it makes the data a lot more responsive if you're creating a map from it or if you're using a web application or something and there there are various different O and S for example produce free two or three types of generalization that's super generalized which are very small file sizes but the geometry could be quite distorted and again the the the advice is to use generalized boundaries for visualization purposes where you want to like cut down the file size that's still a quite responsive maps but use the ungeneralized data or when you're doing any sort of spatial analysis because you want to make sure that the you're not creating problems when you try and compare other data sets you want the data to be as accurate as possible using analysis so those are polygons and then there's another thing out so centroids a centroid is literally just a point so it's a simplification of the census geography from a polygon to a point there are two types of centroid is geometrically weighted and is population weighted the geometrically weighted one simply takes is a spatial average of all the points that make up the polygon and it could be a center of mass whereas the population weighted the point is based on some the underlying population and these two could be different positions because census data is based on populations population weighted centroids are the norm and you could use centroids for simple analysis like so you want to find the distance from a road to some some output areas it'd be quicker just use them as this the output area centroid points because then you could relatively quickly compute the distance between the road and those points rather than having to like consider the entire set of coordinates and make up the polygon but mostly centroids are used for your referencing purposes where you might want to relate boundary one to boundary two and you might find that boundary one does not nest perfectly when boundary two so rather than doing a consideration of the full geometry of the polygons in one versus two you could just use the centroids of point of the geometry one and then doing overlay against the second one which should be not exact but it would allow you to make some sort of look up in order to map one geography to another and then census lookup tables so in a lookup table you have some input value which you then look up and then output an output value so for example here are some examples could be we could have a 22 into an output area and then you'll have a lookup table that maps to the 2020 on LSOA 2020 on MSO and the 22 into LAD so we will look up from output area to LSOA to MSOA to LAD which is our standard census output geographies or we could have a 2021 output area and they could have a look up to care boards or NHS regions or more usefully we could have a look up from 2011 output areas to 2021 output areas which will at least allow us to do some sort of um comparing 2011 data to 2021 data and then postcode directories which are reduced by the OS at ONS and are also available for the UK data service are a special type of lookup table for postcodes because postcodes are quite regularly used in other types of survey or a lot of data is geographed by postcode so it's useful to find out what geographies census geographies they relate to so you can then add context to your postcode data so you could have a postcode which could be produced for this year or this month forever and a mapping to output areas LSOAs and countries and this is just an example of a geographic lookup table and again you can see it's got these nine digit geographic identifiers so we have the nine digit geographic identifiers for output areas which have a look up to nine digit geographic identifiers for LSOAs which map to nine digit geographic identifiers for MSOAs which in turn map to nine digit geographic identifiers for local authority districts look up tables just a way of mapping different type of geography so in terms of plans for the release of 2021 census survey data sets so that's boundaries centroids on lookup tables through the UK data service so currently the UK data service provides access to our database of boundaries which goes back to 1971 so we have 1971, 81, 91, 2011 and we are in the process ARGNA 2021 boundaries so the boundary data selector allows you to make sub-selections from our database and then to choose the type of output format you want the data in so for example you could just select all the output areas for a suburb of London and extract that as a shape file or as a map info file easy download just provides pre-candidate data you don't it doesn't allow you to make sub-selections you can simply download all the boundaries all the output areas say for England for Wales or for Scotland and as a shape file or a tab file or whatever and through the easy download we also provide access to the centroids and the lookup tables postcode data selector is a bit like boundary data selector but for postcodes it allows you to make sub-selection from so postcode directories so you can select all the postcodes within the Nottingham NG9 postcode district and it allows you to pick which types of geography to look up to so as I said my team so myself and my other colleagues in the dinner are in the process of adding new census geography datasets to the UK data service applications so we take the data from ONS at the minute because ONS have released them since about August and they're still releasing the boundaries as they go but also some are generating some of the higher geographies from the output areas and we'll add these to the applications in the next few months or so we have to do some minor updates to the applications and also to the interfaces because to support the new 2021 data and because when you use our UK data service applications there's a pop-up questionnaire which asks you to provide feedback on the applications over the past year or so we've had quite a lot of feedback that people can't tell what sort of data set is like they're downloading before they download it so we're going to try and add some sort of preview functionality to the boundary data selector which will allow you to like see some sort of visualization of the boundaries before you download them and the sort of attributes they have so you can check whether the boundary is compatible with the data that you want to try and join to it and we're also going to try and add some additional geospatial data formats like geo package which is becoming more popular as an alternative to shape files so this should be happening in the next few months or so so the latter part of the webinar having covered the data itself is just to provide some examples of how you could use the boundaries and sort of for visualization purposes of census data so people are probably quite familiar of using boundaries for visualizing census data but the idea is you have some sort of aggregate statistics which here is by Scottish council areas here we have the number of males and we have so the total number of males per Scottish council area and the number of males who are employed in manufacturing and we've just basically cut to a proportion on the third column that just tells the proportion of males within each council area who are employed in manufacturing and then we have a set of boundaries for the same geography as the stats so these are Scottish council areas and the idea is you join your attributes to your boundaries using the common geographic identifier and when you do that each boundary then has that stat so we can tell that Edinburgh here we have about 5.27% of persons are employed in manufacturing who are male whereas out in 5 it's 15.55 and likewise in East Lovien it's 8.47 so I mean it's not very you can't really this requires you to actually read the values which is how you have this type of map called a chloropleth map and all you do in a chloropleth map is you shade the polygons according to the actual statistic that is attached to that polygon so we can hear a chloropleth map here this is showing all the UK so darker areas are basically higher percentages of this particular census stat which in this case is a percentage of people employed in agricultural forestry and fishery so in this case as perhaps not surprisingly the higher proportions are in rural areas so particularly the Scottish borders and the middle of Wales and there's clearly less people employed in agricultural forestry and fishery when the central bell of Scotland and southern England chloropleth maps are which used based on boundaries provide an easy way to visualize how it measures across the geographic area or to show variability of an origin so that's because of the as I said the census data is output a different geographies of which you have different types of boundaries you may be able to put map the same census variable at different levels of geography as here so on the left you have the census statistic maps at a local authority level whereas on the right we have that same variable mapped I can help area level so at a higher level you tend to get quite smooth the this is smooth and you get like general sort of trends whereas the more detailed level you can obviously see more detail in that sort of local nuances in the data but again because it's like the polygons are smaller you tend to get some sort of noise it's it may be less than interpretable at this level if you're sticking in a random report or something and also because earlier we said there was this issue of disclosure control you may find that not all your variable is available to be mapped at both levels so it just depends on the variable in question now there are some problems with chloroplastics that people should be you should be aware of and one of which is the chlorophyte map it tends to imply that the population is distributed uniformly across the polygon again if we go back to our area of leads and we have our and we've got the boundaries shown in red but you can see if you look at an area of photograph underneath that there are large areas of this polygon where there's no actual people living so for example there's an area of greenland here as a church yard of a graveyard here it you couldn't say that the population is uniformly distributed across that polygon and there's also another problem we don't have time to go into detail here but there's a thing called the modifier unit problem which is this effect sentence boundaries and especially if you're looking at change over time because you may not be comparing like for like if your polygons in 2021 are different to your polygons in 2011 and if you've if you've got two maps side by side showing boundaries 2011 and 2020 21 because the change is in the actual polygon geometry you're not comparing like for like so that's something to be aware of to to get around the first issue what we've seen in sort of like the last 10 or so years is is um especially in a site called data shine is a thing on is to take those chlorophyte maps and try and mask them using like another type of data another polygonal data set um sort of buildings for example and these are like mass chlorophyte maps in some ways that does away with the the tendency for chlorophytes to depict the populations be consistent across the entire zone because here then the debtors only output where they're buildings so in some ways it provides a more um realistic depiction of the population structure so I mean if you've not already seen it the data shine website is really great is the data shine Scotland and is the issue on England Wales these are both obviously produced for 2011 data so the expectations they will provisionally be produced for 2021 data on the Wales or 2022 data for Scottish data another form of visualization that you can produce from boundaries are cartograms so a cartogram is like instead of um instead of drawing the polygonal areas based purely on the land area you draw the polygon based on the actual variable you're mapping so you distort the geography based on like say the percentage of people employed in manufacturing a polygon that has a large amount of manufacturing will be started distorted and will be relatively larger than a polygon with a small percentage of manufacturing producing a cart a different type of visualization and there are various types of cartograms so non-contiguous contiguous or drawing cartograms and the big advantage of cartograms is the help like the the help when you have like large populations mapped to very small areas I mean as in here for example so this is a map showing um which is like it's the people in it who actually are in a mortgage property so they own their home essentially and it dates across Scotland but within the central belt because the areas of Glasgow and Edinburgh are so small they're quite hard to see versus the large ruler areas and a chlorophyll map whereas if you map that same variable using a cartogram then those small urban areas are a lot more prominent and it's a lot easier in some ways to see the the the pattern and the variability when the data using a cartogram cartograms will be used for census data quite widely so there's an excellent book that was published for the 2011 data called people in places and it's full of these cartograms um it's a really excellent book it just takes all the various census topics for 2011 and then basically uses cartgrams to to show them and you can also if you if you wanted to you could create these sort of cartgrams using a desktop gis like QGIS which has plugins to do these sort of stuff we're sort of coming to the end of the webinar in terms of the formal talking there are sort of three bits of information sort of resources you might want to look at if you to get more information so professor david martin and his team of self hampton were responsible for building the az tool which is the piece of software that the was used to to create the output areas themselves and and david martin his team have produced some great resources describing that tool and also david has has entered the history of the census so david's blog post which is on the uk data service data impact website is a really great place to look for the history of the census sorry the history of the geography of the census in the uk and the az tool page from the self hampton folks is great for if you want to get more into the details of how output areas are built and the whole thing of automated zone design and then on s i've made i've got this excellent set of resource called the on s geography statistics training course which provides some great tutorials about how you could like how to analyze data and use geography both in a desktop gis but also programmatically using them are and python so that's something you might want to look at if you want to go down that route