 Welcome to this webinar on statistical disclosure control in the 2021 census. We're joined by colleagues from ONS today. Sam Trace will be presenting. She's been in ONS for three years working in statistical disclosure control and census outputs. Prior to that she worked in AHS research ethics and she's also vice chair of the MRes research ethics at the University of Portsmouth. She does make a request for any researchers who'd like to present so if you feel like you'd like to present to the on the course she's involved with then on research ethics then please contact her in the chat or from the contact details at the end. She's joined by Keith Spicer who is the head of ONS statistical disclosure control. He has a very long CV and has been involved in a lot of stuff. I'll put a link in the chat later on to his book on research ethics in surveys. He was part of the original team that proposed the flexible table builder which has become the custom customized data set I think is the right way of describing it now. So he will be in the background and picking up any technical queries you have. So we'll go over to Sam now to present. Thank you Nigel. Good morning folks. I was actually any kind of research you want to present to people from the ONS would be really welcome so if you want to get in contact with that that would be lovely. So today we are on disclosure control for census 2021 making the census safe. So obviously as the gentleman said please post in the Q&A not the chat as I have it as we go. Keith may be answering some of those questions as we go through. I'm not sure how much time we'll have at the end but we'll see. I've said making the census what does that actually mean? It means we can't identify people in the data or learn things about them that we didn't already know. There are regulatory definitions of which bars we have to jump depending on which piece of data we're talking about but that for general purposes is our definition for today. SDC methods for create a custom dataset that is a clickable link if you want to click through and take a look. I'm sure hopefully you have already had a look at the created custom dataset tool. I'm a slightly invested so I think it's amazing but how we got to being able to do automated disclosure control checks and what we put in there is the subject of this morning's talk. So we had to the SDC methods now the targeted record swapping this is affecting the data that went into the system identifying people and households and swapping them with a similar record in the local area. We are recording the session Gargi I don't know how we'll be distributing it afterwards but it is currently recorded and self cell keep perturbation that is adding slight noise to the figures and making slight changes to cell counts and also we built in some rules to the table which stop tables with many zeros low counts in general terms being released so when you're using the system you will find that out very quickly that some areas will be refused if they don't need certain criteria. So in detail targeted swapping any record could be swapped but rare and once more likely swapping stays in the vague local area we swap between OAs or between MS ways generally in the same local authority there are a few local authority swaps but it's not it just is increasing the uncertainty over any identification so general principle everyone who answers the census is still present just not necessarily in the same place so it's always worth filling out that form and getting your question well as everyone has to fill out the census but and it will be accounted somewhere and probably close to where it was even if swapped. Cell keep perturbation this is where we make slight changes to the figures I think the method to go in a slightly more detail than on the slide is we attach a number to every single record and then we work out from those the numbers that are in a box a perturbation so if you get the same numbers in the box same records in a cell you will always get the same perturbation level. Now the purpose behind it is to add uncertainty about any identification from a small count so we do also perturb zeros we have to do a different coding method to perturb zeros which is quite tricky but it does happen it makes it harder to add up totals to work out the missing data and this is just just to prevent disclosure without making the data less useful says to you see rules so if a table has too few records you won't be able to get it and there are specific rules for communals common establishments and households and individuals you won't be able to get communal establishments data sets below MS away within credit customer data set I think there's one which is residence type and that is effectively does this person live in a house or a common establishment yes no it's very simple and those are the rules built into yes again create a custom data set so we don't want the rules are designed to prevent attribute disclosure that is finding things out about a person you didn't already know too many low cell counts dominance so if you have a whole bunch of one category that we couldn't allow you to guess that a person who did fit in that category simply because of the huge amount of dominance in that picture that would would also be a potential fail and building up a picture from repeat queries so all of those are built in and this does result in some slight oddities and we've had plenty of user feedback on this already but in brief totals not adding up so in this wonderful town of Hobbiton if you do a table by age you get a different total with the same town the same residents counted by sex but in different tables you get different totals these results should be really slight and not cause any issue in terms of the validity of the research bearing in mind also that so you know the census produces estimates not counts I think that's an important point to bear in mind everyone likes a nice little sum that adds up exactly but these are estimates and they should it should be borne in mind that there is a slight margin of error and especially low-level geography if you move things obviously up by one or a very small amount it might be added to that margin of error slightly but it shouldn't be so significant in Hampton University research it is just if we like things to add up neatly and we take that the sign of having done things correctly it is slightly but it does on the other hand protect the data which is the good thing now we've got things the rules protect against so for instance how old is Ehrman Trude now we can guess that the lovely Ehrman Trude is in milk engaged in milk production as her main job and so we she must be from this table over 45 and we wonder how old is Danny and Danny probably of these but let's stereotype him as a swim teacher why we would think Dolphin would do that I'm not sure but um yes he's got to be over 45 again but Gertie on the other hand is 26 she must work in a baker shop from this table now of course we can't be certain about any of those conclusions because a they might have been swapped b these figures may be perturbed even those zeros may not be actual zeros and c there is error in the system as well so some of these records could be imputed um some we don't know for certain that everyone filled out their form correctly and all of that adds doubt to these conclusions and you can see we do get low counts in the table I did have one of them but now there isn't after a semester thing and we did ask have a question about why there are ones even in our low area tables and that is simply because as long as they are is doubt about whether they are genuine ones that is okay and um another thing to notice there are structural zeros left in the table you cannot be 0 to 15 and have an industry because the census form doesn't ask you the industry question if you are if you're it got calculate your age as under 16 and also also in doing those calculations I made some assumptions um like I can't assume Danny is a swim teacher possibly possibly Danny is a farmer who knows um so there's some element of personal private private knowledge required to make those assumptions so yes the table builder would probably suppress this table it's it's very sparse and there's attribute disclosures and all kinds of identification risk in this table but one of the oddities of this these rules are that if you gave the table with actually slightly more detail you might get probably not in all areas probably you'll get fewer areas in all likelihood but you might get a table like this because in this one if you ask how old is ermine trude she could be 45 to 64 and she could be 65 plus we don't know because the numbers have been split between the two categories and if you ask ourselves how old is Danny again the numbers are split between the two categories and um if you ask ourselves if Gertie's 26 she she could be a bread baker or she could be in bread sales rather than being a bread shop worker generally so because the category has been split into two different finds we don't know where she might fit in that category so sometimes puzzlingly you will get a table with more detail that kind of gives you a suggestion as to why and um and again there is still a lot of doubt about these conclusions because any of these figures could be perturbed if you notice this total is different to the table above that is because you may have different records in these so in the above table oops you only had two and this one you've got the total of three in the same columns but do we know which table was perturbed we don't so we can't assume that this one revealed um is is right or wrong or vice versa we don't simply don't know so after doing all of this we did wonder well we put a lot of effort into making this um safe is it actually so we did do an intruder test we assembled a team of intruders to um check the data and see if they could try to identify people we actually had 50 people sign up uh 26 took part most were unsuccessful we actually had a number that did not put in any claims they tried and tried and just they would go through the funnel of the system trying to identify someone and the system just would say no at the right point and um so it did confirm however there were risks with detailed classifications that low geography i think i'm fine to say that because that is kind of um going is what you would guess at initial we changed a very small number of outputs in response to that conclusion because obviously given the experience of my team we'd already made judgments based upon that assumption but it was nice to see it born out by evidence which is and an empirical study and what we did in the intruder test was to go a little bit more detail we allowed people to use the internet so they could use publicly available information and that is you know fulfilling the terms of their the statistical regulation and services act 2007 we had to consider publicly available information rather than just privately obtained and um i think this also fulfills the EU data protection directive an account should be taken of all the means likely reasonably to be either used by either the controller or by any other person to identify the said person so anything reasonable just an ordinary internet use usage was fine for this test and as i said i hope to get a paper on it later next year maybe and it's already written and yeah if you're interested by the time it's got that far we might circulate so we also did i've just covered the table builder we also did a lot of tabular outputs now these were the first well it's kind of the second thing is to come out after the initial results were table outputs which are based on judgments so considerations we had all of these thoughts we we worked out frequencies of the tables that is how many ones two small counts how much of any table would be tens or have more than 10 people at various levels of geography so we put a lot of thought depending on what the output was and looked looked at the counts and how much other considerations protected them so imputation swapping and perturbation and then made judgments talking speaking with the other stakeholders so the topic leads topic experts in each category and discussed the level of geography they wanted versus the level of detail they wanted in the classifications and so that that's how we arrived at the already produced topic summaries that have been coming out and getting lots of pressure attention so since January this year and then we talk about microdata obviously as ukds uses you may be very familiar with these we this is the record level data releases there's actually quite a range i did put a link into the policy for social survey microdata which i read i can pop that link in it's going to take me off the screen of that one and but we have a range of products which will be coming out first will be the public file this is a one percent sample the 20 variables and the region geography is regional level don't know why that's background and the level of detail in the classifications is low so at risk level is low i've got a little limbo man there because this this should really not tell you anything about anyone it needs to be absolutely well it's equally not very useful for research it really is a teaching file just to get to use to using the variables and maybe using them in some software and teaching people to how to analyze data is the purpose of this file now um the safe code has household one percent is new for 2021 and region level geography and again 60 ish variables and i think this will be available through the ukds and the risk level on these is medium as you i don't know if you can get that from the picture but i was looking for good pictures medium level risk i think this is the purpose of this is if you where we class things as medium and i think you'd probably would have a more accurate definition than this is where you can it's if you had private knowledge you might be able to just use something but with public knowledge you shouldn't be able to yeah it's uh with a safeguarded data you you should not be able to identify somebody using the data plus any information already in the public domain but you might be able to identify somebody using uh private knowledge for example if you knew somebody was in the data which you won't in this case or or that if you had some private knowledge as to some of the some of the variables some of the information about that person but we don't have to take that into account only to make it safeguarded rather than public thank you the mere fact that it's a sample also really helps with um because you can't be certain that any individual you might think of is in that sample so um and we're actually doing the work on this at the moment the five percent region is safeguarded individual detailed variables so there's a lot of detail in the classifications so you might get um i know um wide more detail in terms of age ranges and um 90 ish variables and there's a link to the 2011 co-book and the risk level is medium i do love the Billy Grotes graph and of course middle-sized Billy Grotes graph is where this one sits in terms of risk and so yeah the again you shouldn't be able to deduce anything using public knowledge from this and not not of course you everyone is signed up to not trying as well which is one of the major um protections on the data sets okay we've got the individual five percent grouped local authority so group local authority is a special um geography especially for this data set and that is grouping up every local authority under 120k head of population so that we did this for 2011 and the 2011 um is co-book is there for a guide it is slightly different unavoidably this time around partly due to population increases and partly due to um boundary changes i well i don't think there's many boundary changes so much as there are a few that local authorities we did in 2011 that have exist don't exist anymore in that form have changed name so um but where possible we tried to make it the same as 2011 for comparative purposes we're also hoping um to get the geography available on the geoportal to make it easier to use i can't confirm that at present but we're hoping to do that so risk level medium so you've got middle sized elephant there now the secure individual 10 percent obviously with 10 percent it's large sample size there's a higher likelihood someone might be in there available through the integrated data service which up which is gradually is being built as we speak and um possible to crossroads with loads of variables and i find that this has been very exciting what they're doing it will be linked to um certain NHS outcomes as well as i think birth marriages and deaths um straight away obviously additional linkages will be possible as and when as per the applications that come in um and 120 plus variables actually it might be more than that and and risk level obviously is high which is why it is very protected and and there will be processes to go through to judge applications now um secure household 10 percent again available for the integrated data service and um we're hoping this will be the first applications to be maybe next year to get this data out and nearly 200 data plus variables yeah there will be documentation once the data set is released um and again there's a 2011 link for a guide you should be able to click through on your end and again possible to cross-reference with a really wide range of other variables this is very exciting data and um although yeah again the risk level like this giraffe is high then we go to the secure 100 percent no justifying the need for this data must justify proof of major policy impact and again the this will be both household and individual and again it's not a sample it's it's everything or nearly i should say nearly everything and um this is the highest risk data that we would allow out but again the the protection is therefore come not by anything we've done to the data particularly but from the situation in which it is held so in considerations the user need what we put into each data set we have just done the stakeholder um i think Nigel as well put in on the stakeholder um engagement that we did uh right detail for the right data set because you know there's quite a range of data sets there what's appropriate for each what constitutes high level classifications what what is less detail and we are also going to intrude a test it so this will be a smaller scale exercise we will get some people to try and find things out from the data once the data set is generated and um there will be some other additional protections in place um but um we try as much as possible to do the bare minimum with this data it's it's all about choosing the the classifications choosing the variables and getting the geography in place and then allowing the setting to maintain the safety of the data okay that concludes the talk