 Hi I'm Mark Elliott from Manchester University National Centre for Research Methods and I'm also lead of the UK anonymisation network and I'm going to talk to you on this video about the basic concepts that you need to understand the notion of statistical disclosure control. Briefly I will tell you talk about what a statistical disclosure is and then we'll spend most of the time thinking about how might a statistical disclosure happen. I'm going to talk to you about the relationship between three concepts, three related concept, privacy confidentiality and disclosure control. Privacy is primarily about people. Confidentiality is primarily about data. Now confidentiality could be about data about people and often when we're thinking about it that's the case but not necessarily. Privacy concerns a much wider range of things for example who has access to my home. Confidentiality is just about information, it's about information and data. Now disclosure control is a particular technical process which helps maintain confidentiality of data, usually data about people. What is disclosure control often called SDC? It's the practice of reducing the risk of finding people or other entities in data, what's called re-identification and or associating data with a particular person or entity known as association. Now these two things sound quite similar but actually they can occur independently. Often they occur together, often re-identification leads to association. Actually it's the association which causes the disclosure not the re-identification as we'll see shortly. Now there's an important need to strike a balance between maximising the utility of the data including meeting customer requirements and management of confidentiality risk. Thinking about these two things together as a constraint satisfaction situation to competing constraints one way you want to maximise the data, one way you want to minimise it. Thinking how to optimise those is an important part of disclosure control. Now disclosure control itself is an active research area as its own conferences and journals and so on. Subfields, disclosure risk assessment which is going to be our focus for the rest of this video. Disclosure control methods, how you control the risk once you've identified it. The measurement of analytical validity, thinking about how do we produce useful data despite the confidentiality constraints and data environment analysis, thinking about the context in which data sits and the relationship between the data you're interested in and its environment. In principle SDC could be applied to any type of data but typically we're talking about quantitative micro data and aggregate data but we've expanded out in recent years to be thinking about business data as well as personal data, thinking about intentional data such as service but also consequential data like for example administrative data sets. Okay so now I'm going to set you a little exercise and in a moment I'll ask you to turn the video off and just to think about and maybe jot down some notes about these questions. So you're going to imagine what's called a data intruder. What would you need to do in order to identify individuals within a data set that had been through an anonymisation process? What would you need to do to do that? And why might you do it? These two questions are actually related and there's a third question as well. In what ways might a disclosure happen without malicious intrusion? So in the first situation we're imagining that there's a malicious intruder and that's who you're mimicking yourself as being. In the second situation we may be thinking about how it might happen without any malicious intent. Okay so if you stop the video now and make a few notes. Okay so you might have thought about the types of things that a data intruder might do in terms of technical processes like linkage. The linkage as we'll see is quite important. You might also come up with some ideas about the sorts of resources that an intruder might need to draw on in order to do the re-identification. Thinking about data that's in the public domain and again this is something that is also quite important. And those are the types of things that you need to think about when you're doing scenario analysis which is what this exercise is about. What might your motivations be? Well you might have focused on the informational content of the data. The data was interesting in some way. Or you might have focused on the secondary motivations. It may be that your intruder is actually not interested in the data themselves but in the consequence of the breach. They may want to cause a problem for the government of the day or the organization that are releasing the data. Now actually for many situations it's a second sort of attacker that is the problem not the first one. And the reason for that is the first sort of attacker would tend to be interested in information about particular person. And that's like finding a needle in a haystack. The second sort of attacker doesn't care. Doesn't care what sort of person they find out about. They just need to find some information. They need to find somebody reliably. And that can be anybody. And that gives them a much bigger pawn to fish in. And so the risk associated with that type of attack tends to be more serious. Now this is useful in a way because if you can cover that then you've probably covered the first one as well. How might disclosure happen without malicious intention? Well, spontaneous recognition. It might just be that you're a user of data. You're looking at some data and you recognize somebody because they happen to have some unusual set of attributes and they're a person who's known to you. So that's the notion of spontaneous recognition. Now I mentioned linkage earlier on and statistical linkage of various different sorts is essentially the notion of identification playing out as statistical disclosure. So in this graphic we have an identification file which consists of name and address and some other attributes. And then we have an anonymized file, the second bottom bar on the graphic. And you see that it shares some attributes with our identification file, but also has some other interview attributes which we're calling the target variables. And if this linkage is correct, this particular one we have here, then the target variables get released about the entity who has the name and address we've identified here. So straightforward linkage. Now, of course, there are all sorts of errors in this process for possible, you could have the wrong person, the person you think it is might not be in the data at all. And indeed, there are error processes in all data. So that so like any analytical process, there is an error process associated with it. Now we have ways within the SDC community of taking account of those types of issues of measuring risk in a variety of different ways, some of which I'll come to shortly. Now here's the second sort of problem. And it's not immediately obvious that there's a problem here at all. Here we have a simple aggregate table of counts. So what I say here is we're imagining a really interesting cocktail party where there are professors and pop stars. And let's imagine that I have this table, I happen to know this table about the people at the cocktail party. What do you think the problem is about this table as a piece of data? So just stop the video now and think about that for a few minutes. Okay, so it might be that you focused on the low paid pop stars, there's only five of them they look a bit lonely down in that corner. And perhaps you thought that was a low number. Actually, you might be able to identify by saying, Oh, look, I know a lowly paid pop star there they are. But you haven't learned anything. Because you in order to place them in that cell in the table, you had to already know they were a low paid pop star. So you've disclosed no new information. The problem is actually the zero, the highly paid professors. Why is that a problem? Supposing I go around the cocktail party and I hear somebody give it talking about the lecture they gave last week, then I've immediately learned something else. I'm immediately learned that they are highly not highly paid. Similarly, if I have somebody boasting about their income, I earned 10 million pounds last year, then I immediately learned that they are a pop star. So the zero allows us to make attributions given partial information. And that's a technical definition of disclosure. Now, you may question whether this is important. But that's a separate issue. The issue is what is the minimum definition of a disclosure and the minimum definition of disclosure is to be able to reliably attach a particular piece of data to a particular person. And that is re identification of that particular data, in this case, not highly paid or is a pop star. Okay, so now we have a situation where one highly paid professor has turned up late to the party. Do we still have a problem? Well, now we can't make those inferences that I was talking about on the previous slide. We have a single highly paid professor. So if I hear that conversation about the lecture, I can't be certain that the professor concern is not highly paid. But supposing it's me. Well, if it's me, then I know who the highly paid professor is. It's that one in the corner there that that's me. And critically, I can take myself out of the table. And then what are we left with is a table with a zero in again. And this is called a subtraction attack, the taking of known units, which needn't be just me. It could be that I know somebody else who's a highly paid professor. I can take them out of the table. It might be that I know some low paid pop stars, I can take them out of the table. And this is what a house of traction attack works. It's the removal of known units from a set of aggregates in order to produce a set of data which is disclosive about those that remain in the table. Okay, here's another type of problem. When we're thinking about outputs, certainly aggregate outputs, as a data product as an open source of open data, we tend to think of them as being problematic. Now the issue that can come up is that outputs can interact. So here's a very simple example. Where we have two by two tables accounts again, of three variables. And the variables are tenure, age, and whether somebody has HIV or not. Now we have no zeros in this table in these tables. So that possibility of inference. Now, obviously, there is the possibility that any one of them could be subject to us a subtraction attack. But we're considering another problem here beyond that particular problem. Now, what we have is three variables. And what effectively we've done is suppressed the three, what the three, the three way table. Okay, these are the two way margins and we've suppressed the interior cells. Now the thing about this particular combination is that there is only one possible three way table that corresponds to those two way margins. And it's possible to work out what that table is using a technique known as integer linear programming. The table in fact, it looks like this. Now the key thing here is that there are zeros in the table. So that means the technical definition of disclosive that we introduced earlier. And in this particular case, this information is also sensitive. So if you look at the right hand column, if I know a young owner occupier, who is in the table, I've also learned about them that they indeed are HIV positive, clearly something which we would expect reasonably that we wouldn't want to have disclosed about ourselves, or any data subject would. So that is the problem that we're dealing with here that you can't necessarily assume that because a particular piece of output or published aggregate looks safe on its own, then it necessarily will be safe in the context of all the different pieces of output that you may want to release. Now the disclosure risk problem can cover other types of data and we won't have time on this video to go into all of these. And it's expanding out as indeed the data for environment itself is expanding out. So we're looking at network data, qualitative data such as text, which throws up a set of complex problems itself, genomics data, stream data, data that comes in streams rather than data sets, and mixed data, the data that's formed out of various different types of data linked together. So summarise. So disclosure is a complex topic. There's much more than we could possibly summarise on a short video. It's still an active research field. Whenever you're looking at a data situation, either as a data controller or a data user, you need to think about the processes of attribution re-identification and how they might happen with those data with those outputs. Thinking through those processes will enable you to decide whether your risk is sufficiently low. Thank you.