 Hi, my name is Maureen Pika and I'm here with Anka and we're going to talk about anonymizing qualitative and quantitative data. So I've worked with the UK data service on a range of projects for over a decade now, everything from ingesting qualitative data to developing teaching resources, answering queries and kind of working on data management training. I also teach at the University of Suffolk in the Department of Childhood and Education. Anka, did you want to introduce yourself? I'm a data officer at UK data service as well. Well, yeah, it's only, I'm only partly at the UK data service. My main role is at Cancer Research UK, where I work in the Cancer Intelligence team and I help manage their TRE, so trusted research environment or secure data environment. I think it's maybe more, maybe well known like that. But yeah, nice to be here today. Okay. So what we're going to do today is give you a bit of background on the anonymization. So why it's important, what is the theoretical underpinning, what are legal responsibilities? We'll then go into more practical overview of how to anonymize with qualitative and quantitative respectively. And we also have a short exercise of mentee meter thrown in there. We'll end with some discussions on de-identifying information and, you know, and some signposting or further resources and answering any of those questions you might have. But first, on the next slide, we're going to start with a bit on covering the theory that underpins anonymization and clarifying a few key concepts. So when we talk about anonymization, we're talking about the process by which data is rendered non-personal, that you can't attribute a characteristic to a specific person. The UK Anonymization Network has published the Anonymization Decision Making Framework, which is a model to help you assess the risk of disclosure or where information becomes identifiable. The National Center for Research Methods has posted extensive tutorials on this. So this is really just to give you a quick overview. But the framework outlines three key aspects to decisions on how to anonymize. And the first of these is the data situation audit, which is specifically considering where do you want to present the data? What is your role and responsibility with that data? And what are the specifics of that data? So what variables have been collected? Where are they stored? And the next stage is to actually do a risk analysis. What are the actual chances that there will be a disclosure? And the final stage is impact management. So if there is a disclosure, what are the plans for what happens then? And just a couple of key points to make here. This framework comprehensively points out that analyzing the risk of disclosure is something that is iterative. So it's not just in here. There's a single point where it should be assessed, but rather think of all the places that data stored or presented. The risk for publishing with data extracts, for example, should be considered alongside sharing data with colleagues across institutions or storing data on a computer. The other points that I want to reiterate is that this model argues that it's not possible to fully anonymize data. So quote unquote fully anonymizing would mean that even a participant themselves looking at the data would not be able to identify their own answers. So stripping data down to that point basically reduces the value for analysis. So if you want to avoid depleting the value of the data, even comprehensive anonymization would still leave at least some albeit theoretical space for re-identification. So the idea is to balance the risk of disclosure. What is the probability of re-identifying? To a point where disclosure can be mitigated and dealt with. So fully anonymizing or guaranteeing confidentiality isn't something that's possible or potentially not even desirable, as it suggests that the data is stripped down to something that's rather worthless. So Anka is now going to go into a little bit more detail about some of the key concepts and explain the difference on some of those terms. Thank you, Marie. Okay, so we're going to look at what exactly is disclosure and why do we need to think about and to consider an anonymization of our data. Okay, so first of all disclosure by disclosure we mean identification, right? So that would happen when someone is able to identify a data subject from one piece of data or information they have access to from one source or multiple sources, right? So from one individual data set or from combining several data sets or data collections, data sources. There are different types of disclosure and we don't have the time today, unfortunately, to go into detail about this but we have added some resources in the final slides. So if you are interested you can read more about this. So I just listed them there for you too if you want to read more. But moving on to anonymization. So anonymization is a process that attempts to prevent disclosure, right? To prevent anyone being able to identify a data subject or from a specific data set. Okay, anonymization and pseudonymization and we're going to see in the next slide you know some definitions and differences between them. They're both part of what we call STC or statistical disclosure control. So the aim of STC is to minimize or mitigate the risk of identification to such an acceptable level that it still allows researchers to maximize data use. So as Maureen just explained, right, in order to maintain some data utility, we wouldn't want in the first place to fully completely anonymize the data because it wouldn't be, you know, that use for research would be completely non-existent anymore if we were to strip it of all the detail, right? So use the data to its full potential, potential or as close to it as possible. And on the bottom of the screen here we see that when this closure risk goes down, information loss goes up, right? So the more we try to, you know, reduce that disclosure risk, so anonymized, right? Anonymized more and more to a point where it's, you know, fully anonymized, where not even myself, I would be able to, you know, if I contributed to a data set and I would look at it, I wouldn't be able to tell, you know, out of 200 observations, you know, this one is me, this row is me. So as much as the more we strip down that detail in a data set, the more the information loss goes up, right? So by information loss we mean, you know, those valuable details that would, you know, that we would need for our research. And that's what we also refer to as data utility in future slides. Okay, so I mentioned anonymization. I also mentioned pseudonymization. So on the screen here we have a few definitions. What we mean by anonymized data. So this is the information that does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a matter that the data subject is not or no longer identifiable. So it cannot re-identify data subjects, even data owner. So this is an official definition from GDPR, but we are going to have a look at examples today at different, we have different examples of pseudonymized data, pseudonymized data to have a look at that in practice. And then we have pseudonymized data. So this is identifiable data. So it would still contain some of that detail that we spoke about earlier, right? So we haven't stripped off enough detail to get to a stage where we can call that data anonymous or even fully anonymous. So pseudonymized data will still contain some information that will allow us to use that in our research, right? So it's processing personal data in such a manner that it can no longer be attributed to a specific data subject without the use of additional information provided that this such information is kept separately and is subject to technical and organizational measures. So as I said, this pseudonymized data will still contain some, what we're going to call in today's session, we're going to learn about indirect identifiers, but it will not contain any personal information. It would still contain some valuable information for research purposes, though. Okay. So some more information here. So also on anonymization and pseudonymization. So according to the ICO, re-identification describes the process of turning anonymized data back into personal data, to the use of data matches, data matching, or similar techniques. It's important to mention that the DPA does not prohibit the disclosure of personal data, but any such disclosure would have to be fair, lawful, and in compliance with data protection principles. Okay. So this is important to mention because we are going to touch on, for example, consent for data sharing. And that consent could go as far as, yes, I'm happy to have my personal data shared. Of course, this would be a specific specific cases, but that is still, you know, that would be an example of a lawful way of disclosing personal data. So to consider when we talk about anonymization and pseudonymization, what are some aspects to think about? So the age of the information will be important. So of course, data becomes less sensitive over time, but we still need to consider ethical implications of processing that data in any way. Okay. The level of detail. So of course, we're going to talk about this is going to be mentioned quite a lot today. So the level of detail in a data set is what's going to, you know, what's going to allow us to make that distinction between, you know, an anonymized data set or a anonymized data set or a de-identified data set. So the level of detail matters of very much context. So this is very important when thinking about a data set in general. So, you know, what kind of detail do we have, right? So is it private life or more about public matters such as their working life or life satisfaction? So, you know, the context when it comes to SDC is very, very important. When we talk about secure data that is considered confidential, data that is available in secure labs, for example, and we look at SDC for that type of data at the outputs that are coming out of secure labs, context is very important when considering if something is disclosive or not. Okay. Rule of thumb. So try to assess the effect if any. So what the disclosure would have on any individual concerned, right? So it might be, as I said earlier, there might be people that are happy to have their identities shared in a, you know, in a data set or, you know, are informed in a concerned form to what degree the data will be, the data they provided will be to anonymized or anonymized. So it might be that they are particularly, they are entirely comfortable with that level of information being shared. Or it might be that, again, the data that is collected is about, for example, their working life information that they're happy to share. Again, think about context, right? So what type of information is contained in the data set and, you know, how is that in any way sensitive or problematic to that individual? And then we need to another aspect to consider is data environment. Perhaps you've heard the term functional anonymization around. It has been around for a few years now. It basically refers to, you know, how important is the environment, right? So the environment that data is kept in, the data governance policies around that data, the people that have access to that data, does that actually make it, you know, anonymous in that environment, right? If you don't have the key, if we don't have any external sources of linking that data to if the people that have access to that data are, for example, trained, secure, what we call in for secure lab, we call safe people. So yeah, so data environment is considered to be very important according to this concept of functional organization. We're not going to focus too much on this today, but we wanted to mention it because it has been around for a while and it's certainly useful to consider in certain scenarios. Okay, so classifying information. So we have different types of information. As you know, in a data set, when we collect data, I mean, I put here in brackets variables, but we don't have to refer to it as variables. We can just refer to that as information. But because we spoke about SDC, SDC usually uses the word identifiers. So here when I refer to identifiers, I just refer to a specific piece of information, right? And then so according to that, we have to SDC, we have identifying variables and we have with two different types, direct and indirect identifiers. And then we have sensitive variables. Okay, so for identifying variables, we have firstly direct identifiers. This is information that directly identifies data subjects. So examples here, we can, a few on the screen. So we have, of course, names, address, national insurance number, social insurance number, but really anything that would point to a specific individual, right? So this can be, we're going to have a look at an example. I don't want to give it away just yet in the metameter. So we can have a discussion there whether you think that is personal, well, that is a direct identifier or not. But yeah, it will be useful to discuss at that point. And then we have indirect or key identifiers that refer to an SDC. So this is information that in combination may be able to identify someone, right? So used in combination. It can also potentially be linked to other sources of data, for example, the Electoral Register, but examples of indirect identifiers are gender, age, region, so any type of geography, occupation, income. So notice how here income is both under, it's both a indirect identifier and a sensitive variable. But moving on to sensitive variables. So this is information that is often subject to legal and ethical concerns. So examples here would be criminal history, sexual preferences and behavior, political affiliation, medical information and income. And this can lead to secondary or attribute disclosure even if identity disclosure is prevented. So what this means is that we might not be able to identify someone, one individual in particular, but in SDC we also have secondary disclosure or attribute, which just means that we would be able to learn something new about even a segment of the population. It's also called class disclosure in SDC. Again, we don't go into much detail about that in today's introductory session, but if you're interested in learning more about this, we have linked some resources for you. So yeah, this is just a classification. We're going to have examples of this and of course we're going to go into a bit more detail, but this was just to present it all on one screen. And here's a slide that puts in context a few terms that we've already touched upon. So we talked about data utility and information loss. On the screen, we also have access controls, which is very important to mention when it comes to different types of data and the safeguards that different types of data need in order to be used safely. On the top arrow we have from left to right where an arrow, so on the left hand side we have raw identifiable data. So this is just who would hold such data. This would just be data that if I collected that data, then I am the person that would have it. So this is data that would contain possibly personal information. Also to mention here is the principle of data minimization. So only collect data that you absolutely need. Sometimes there's personal data that is collected mostly for admin reasons. It's not used in research in the research per se. So with that in mind, make sure to collect only the data that you absolutely need. But moving on to, moving to the right, we have de-identified data. And that is data that has been, so is that the same raw, but just let's just think of one data set, right? This whole arrow. We have the same data set that we collected. However, we have de-identified it, which means that we've removed all the personal identifying information in that data set, right? So thinking back of the previous slides, we had the direct identifiers. We've just removed them. We removed any names, any emails, any IP addresses, any, right? So any personal identifying information. Moving further right, we have pseudonymized data. And notice that we don't necessarily have any clear demarcate, any clear way to demarcate this to say, okay, this is where pseudonymized stops and where anonymized begins, right? It's very much, so we talked about fully anonymizing. So really, it's just the continuous there. There's no stopping where pseudonymized stops, just because we can have pseudonymized data as close to anonymized as possible. If we think of the far right of that arrow as being fully anonymized, then, you know, it's just pseudonymized throughout, right? So as we move from, gosh, okay, so as we move all this way, it's just all pseudonymized, right? Of course, there'll be different levels of pseudonymized data as well. But what pseudonymized means, again, we saw in a previous slide, it's just data that will still contain those indirect identifiers, right? So we've had the direct identifiers or the personal identifying information taken out already. But we still have left information that can be used together to identify, potentially identify someone or learn something new about someone. And that can be, as I mentioned in a previous slide, right? What are these types of, so this is, you know, demographic information, gender, age, geography, occupation, income, ethnicity, et cetera, okay? And then, of course, we have anonymized that the far end where data has been completely stripped of any identifying information and indirect identifiers as well. So everything has been taken out. And then we have the, as we see the other arrows on the screen, right? So as we move from left to right, on that first arrow on the screen, from raw identifiable data to anonymized data, as we move from left to right, it's exactly the, the data utility arrow goes exactly the other way, right? Because as we move from anonymized data to raw identifiable data, data utility goes up. The more detail we have in the data, the better for research, the more valuable that data will be, okay? And the same for access controls, right? So the more detail will be in the data, the more access controls will need to be applied. Because of course, yeah, anonymized data will be open access, right? We don't need to apply any sort of, sort of restrictions to that. But the more information we have in the data, the more we need to place some safeguards in place in terms of who can access it and under what conditions. And then on the bottom, bottom arrow, we have information loss and we've seen this already. So of course, the more we move from, from raw identifiable data to anonymized data, the more we strip out that detail and the more information we lose, okay? I just thought this would be a great way to, to see this in perspective and to put all these, these terms to, to see them and how they, how they work together. Okay. I think, Maureen, can you pass me? So I think following from, from Anka's discussion, it's probably also worth mentioning that researchers still fall under some legal obligations to disclose. So in the UK, researchers have a duty of confidentiality, but that duty of confidentiality is not absolute and it's not protected like legal privilege in the same way, you know, other kind of professions might be. So specifically, researchers are still obliged to inform appropriate authorities where there is an abuse of children or vulnerable adults or where there are crimes covered under terrorism prevention legislation. So some of you will know this as prevent and that does include things like money laundering. So in my own experience for my PhD research, I went through the Health Research Authority for ethical approval and specifically they scrutinized the consent forms for clear statements, which clarified the limits of confidentiality and this included thinking about the limitations of where raw data might be presented, such as in publications and also informing participants that I could actually break confidentiality. Including all of this information, of course meant that, you know, my consent forms and my information sheets were quite comprehensive in terms of what I had to actually discuss with my participants and you can see a sample here of some of the texts from that. So I think the point here is that anonymization is a bit of an art and it's not absolute and so I hope that's one of the points basically that you take away from this. Now, hopefully you are recognizing that anonymization is certainly one strategy for protecting participants, but it is not the only one. So Anka mentioned access controls. So when you're at the end of your project and you're sharing your data, either because there's a requirement of the grant that you've received, perhaps to do your research. So for example, the ESRC requires you to deposit data at the end of your project or maybe you're wanting to get the most value out of your data by archiving and sharing your data with the research community. You should also then consider access options for that data. So at the UK Data Service, there are three access options which can limit who, where, and when data can be accessed. You can have, for example, open data where anyone can download the data direct from the websites. Usually permission for this has to be explicitly given by participants during the consent process and the data should not hold any kind of risk to participants. Most data, however, is held under our safeguarded licenses and this means that anyone wishing to use the data has to register with us. They have to sign our end user license and they have to tell us what they're using the data for. The end user license stipulates that users cannot share the data onward. They have to protect it and in the unlikely event that a participant's identity is uncovered, they will not share that participant's identity. There's some other terms and conditions but it holds users legally responsible for the protection of participants. Finally for very sensitive data, there is controlled data and controlled data restricts access. You need to undergo special training to access it and it must be accessed from our servers. So we won't just transfer the data over to you. These options represent access to data that's been shared but certainly you can apply any of these principles at any point in your own project and be thinking about who is allowed to access the data, when and where can it be accessed. So there's a lot more information about data management in the managing and sharing research data book. It also gives a lot of case studies, examples that help talk through some of those principles that we've covered so far. So do you check that out? We also have data management pages on our website. So now we're thinking about specific types of data. While the basic principles of anonymization will be applicable to all types of data, how they are applied is going to vary a bit. So I'm going to go through a bit on anonymizing qualitative data before handing over to ANCA to talk through quantitative data. With anonymizing qualitative data, I'll start with just a few tips. So usually qualitative data is anonymized at the point of transcription. This way you have lots of opportunities while editing transcripts to check anonymization as well. And then you have the audio files which are usually unanonymized. There is one exception to this rule and that is longitudinal studies where linkages are needed between data types or across time. It might be better to wait and anonymize at a later point to avoid losing any of those linkages. Coming up with a clear anonymization plan will help ensure that there's consistency across the projects. This would outline basically what you do with the direct identifiers and what you do with indirect identifiers. You should also identify where something has been replaced. So in qualitative data, this is usually done with square brackets. But as long as whatever annotation that you're using is clearly documented, then literally anything can work. You're just unlikely to use square brackets in text normally, so that often denotes that there's been an edit or a change. If you are not annotating this, then keeping a lot of those changes in a change log could be done instead. But above all, just try to avoid blanking out completely. You lose a lot of context when you do that. So you can't identify where the same name has come up again or what the relationship people have between each other are. All of that lovely context that makes qualitative data useful is lost when you do that kind of redaction. So don't just blank out names and details. Try to use pseudonyms. Finally, avoid overanonymizing. I found in my time with digitizing and ingesting qualitative data sets, usually the first few times you do it, it's really easy to overanonymize. And then the more experience you get, the less you start to overanonymize. So you should be utilizing access and content options to help ensure protection. So anonymization should really try to be as light touch as possible in order to preserve the value of the qualitative data. And on the next slide, we've got a really basic way of anonymizing an interview transcript. So hopefully this doesn't look entirely boring to you. The obvious details are changed here. Lucas has given a pseudonym of Ken with no last name. The date of birth is aggregated to a wider category of just the year. The location is given to a region rather than a very specific location. But the detail of what the interviewee has said is largely left alone. There are different approaches, obviously, but this is just a kind of idea of the sort of things that you might be looking for. Okay. I mentioned that you should use consent options alongside the anonymization plan. So here's some of the wording that could be included in consent forms. It tells participants exact wording that might be used and shared as part of the dissemination process or perhaps that the data will be shared at the end of the project. So for example, the bottom one says, I'm asking for permission to use anonymized quotations and narrative themes. I'll anonymize any identifying details such as your name and address. It also notes who might have access to the unanonymized versions of the data. It does take time to sometimes explain this to participants. And it can be challenging to ensure that it's written in a way that's clear to participants what is going to happen to their data. However, going that extra step and ensuring that they actually know what these things mean and how their data will be used and doing that will ensure that they're informed. They can ask questions and then it ensures that you're acting with the permission of your participants as well. So once you have a clear anonymization plan and consent in place, then don't forget about the access. Different types of data may also have different types of access conditions. So for example, we have a collection at the UK data service on the foot and mouth disease that was collected in the project was conducted in 2001-2003. There was audio, transcriptions of the audio, and diaries that were taken from participants. The majority of the collection was available under a safeguarded license. However, some of the more sensitive interviews were embargoed. So they were not available for download until 15 years had passed from the end of the project. So time is one of those things that kind of helps when you have sensitive data. Audio files are only available by permission from the researchers. So you can control at a file level or a data type what is available to whom and when. On the next slide, we've got a different example of the pioneers of social research, which is 43 life history interviews with prominent social researchers. We did have an anonymization plan for this collection, and that basically assessed the risk of ethical issues, such as, for example, talking about the health conditions of non-participants, so people not involved in the research, or police and legal investigations, which were not within the public domain. And beyond that, the participants themselves were so well known. There was very little point in trying to, you know, de-identify those life histories, and that also really defeated the purpose of the research itself. So while we did apply some level of light editing where we felt there was an ethical issue on the whole, this collection was available through open access, and that was a point of the consent form. So Thompson, who did the research, got permission to make those interviews available openly. So that collection is freely downloadable to anyone. It is kind of a rare thing to have a qualitative collection that's open, but there is very specific consent for that. And on the next slide, we've got another collection, managing suffering at the end of life, which is quite a challenging data set. These are interviews that were done with families and practitioners for patients who were put on palliative care, which was deep sedation until death. So there was a plan that was put together to gather consent initially when the research was done. But then they ended up going back some months after the patient's death to ask for permission specifically to share the data and archive the data. But what makes this collection really interesting is that it's sensitive data. So we might describe data as sensitive, because it touches on quite difficult topics. But if you just click once, Anka, it's really important to remember as well that sensitive data in terms of legislation, GDPR, means something very, very specific. So when it does fit the definition of sensitive data, then additional consideration should be made for that particular data. And consent then becomes very, very, very important to ensure that your GDPR compliance. All right, I think it's back to you now, Anka. Thank you. Okay, so next we have some practical steps in terms of how to go about anonymizing. So we're going to have three steps and talk through each of them. Of course, it's, Maureen said I will be talking through quantitative data and I will, but of course some, if not all of them, apply to both qualitative data as well. So the first step is definitely the case with the first step. So that would be to identify and remove all identifying information or direct identifiers as we discussed earlier. In line, of course, with what participants agreed to. Of course, so if we have the case that they were happy to share their data containing this information, if they were happy to have their identities made public, then that's fine. But we would just need to adapt this reduction in line with what they agreed to, of course. So this is perhaps easier, I would argue it's easier to do for quantitative data, because it's just about identifying what variables fall in this category, right? What variables would be considered to be identifying information and then it's about removing them, right? We can also recode them if we perhaps have a smaller sample size and we replace it if we want to give pseudonyms or, yeah, I have seen cases of that as well, but yeah, but it would probably just be removing those variables. And for qualitative data that can of course vary, we can of course replace with pseudonyms or not redacted out in the case that we can leave this information in line with what participants agreed to. So this would be step one. So identifying what is identifying, identifying, identifying what is, those personal identifying information, and then removing them or replacing them with pseudonyms in some cases. Step two then is moving on to, so once we have identified, once we've gone to step one, then we have the identified data, okay? Just thinking back of that arrow we saw earlier, right? Moving from left to right, we had raw identifiable data. We went to step one and now we have the identified data and now we're moving on to step two. We still have some indirect identifiers present in the data, right? Even if we have already taken out all identifying information. So we would have information about our research participants so we could have, of course, depending on what was collected, we can have, these are just some examples, we can have the age, we can have the date of birth and different variations of this, right? We can just have the month and the year, we can just have the full date of birth, right? So there's, and of course, the more detailed there, the more information, right? So if we actually have the day that will make it more disclosive. We can have gender, occupation, income, any type of geography, and of course there's some different levels of aggregation there. We can work with ethnical background, ethnicity, religion. So yeah, so this is step two, identifying what these indirect identifiers are and the level at which they're present in the data, right? So how specific they are. And I gave the example for date of birth, you know, do we need the exact day, the exact month of birth or can we just leave the year in, right? So if we take out the day and the month that would reduce the level of detail, same with geography, we can we can aggregate, we don't have to give, for example, the village, we can just give the county, okay? Or even more, we can just give the country. Same for ethnicity, right? So we don't have to have, I don't know, this would probably be a categorical variable, but it might be that there's some other, right, that we cannot, we can just code those other into, especially if they're very low counts, we can code them into some of the categories that we can, for example, not just have eight categories, we can have five categories, right? So or for income, we can round that. We don't have to have, you know, very specific, a very specific number, right? We can round that out, we can, sorry, we can also put it into have some categories, right? So all these are methods of, you know, looking at our identifiers and, and thinking about, you know, the level of detail. And of course, when we get to this point, I think we already need to keep in mind how we're planning to share this data, right? And under what access control? Because as Maureen touched, we don't want to over anonymized. We don't want to strip too much information, because this would be very valuable information for, for future research, right? And just a, just a little touch on, on how important good quality metadata can be for this process. So, you know, if we have a data set where, you know, the variable labels or the, and the value labels are not very clear, that would be very detrimental for future research, right? If we have, say, I don't know how many categories you'd have for, for ethnicity, say, say five, but then it's just the one to five, right? One, two, three, four, five, but we don't know what one is. One, two is what three is, right? So it's very important that we, we have very good metadata for, for someone else to be able to understand that data in the future. Okay. And we, we also have another webinar on data documentation. So if this is something that you're interested in, in how to document your data, make sure to, to join that next time. We just ran a couple of weeks ago, but we're planning on doing that again soon. So, yeah. Okay. And then we have step three. So thinking about step two, and all the direct identifiers that we, that we identified, right? We need to check frequencies to identify potentially disclosive information. And by this, we refer mostly to small counts. And I gave the example already of that ethnicity variable, where it might be that when we collected the data, we also had an open space field for open-ended, sorry, for other, right? So it might be that that other category might have just, you know, one or two observations or three observations. And in combination with, you know, all the other information that might be in that dataset, right? If we know that that person lives in some small village in Essex, I'll give you an example where I live in Essex, and that they make, I don't know how much a year. And we know, you know, so once we put all this information together, you know, the risk of disclosure increases, right? That's what we mean by small counts. However, if we recode that variable into, into just say five or six categories, and in each category, we have, you know, at least over 10 observations, then that level of risk of disclosure reduces, okay? Checking out liars, okay? So for quantitative data, outliers are an issue, especially, you know, again, those could be very low counts, you know, they are usually, that's what they are, right? So going back to the previous slide, let's give an example for age, for example. So if we have a dataset and we only have one or two or three people over the age of, say, I don't know, 100, right? Then those would definitely stand out, especially when we also have the rest of the information in that dataset, as I mentioned, geography, gender, any other, yeah, indirect identifier, that risk of disclosure increases. So checking for outliers and then recoding, right? So if we have age, then maybe we just, you know, we see the frequencies and, you know, depending on what we have in our dataset, it might be that we just recoded to 85 plus or 90 plus, or again, it would depend on the frequency, so just make sure that we have that count there. And I would advise you to check with the, so again, keeping in mind where you're making your data available, where you're publishing your data, you know, checking with the archival repository, because they might have a specific threshold for, in terms of these small counts, they might say, okay, it can only be the lowest count can be five, right? So when you check your frequencies, if you have a count of three, or if you have outliers, then, you know, you need to recode to make sure that nothing is lower than five. So that's why it would be, I would recommend checking with the archive or the publisher, the repository for their guidelines. Certain, they defer, that's why I'm saying, you know, there might be that some, well, some have thresholds of five, thresholds of 10. So yeah, I know the HMRC has a threshold of 30, for example. So yeah, make sure to check that. So yeah, and then last thing, checking any string variables. So I mentioned that other open texts to identify, first of all, if they, of course, contain any personal, potentially disclosive, possessive information, right? So saying something like I worked for, and then the specific company for 30 years, or my brother has a rare type of disease, or, you know, so anything, anything in open text, variables, and then possibly recoding them if possible. If that is, if we're talking about one of the indirect identifiers, especially, yeah, recoding them into a bigger category. Okay, so in terms of finalization techniques in general, so some techniques really to, to, to address that, that risk of identification, again, thinking, keeping in mind where we're publishing it first, you know, if we know that's going to be available on the safeguarded access, then, you know, we do not have to overanonymize. So that is, I would recommend first checking with maybe the archive or the repository we're planning to share that data. It's just to prevent overanonymizing. Okay, but some techniques when you do come to, to reduce, to reduce that level of detail or to aggregate or reduce the precision. So I gave the example earlier of just the geography information and then recoding categorical variables, those indirect identifiers that we spoke about into fewer categories with higher counts, suppressing specific values of indirect identifiers for some units. So by suppressing, we just mean removing. It might be that, you know, we do have to remove, if it's a count, if it's a very low count, it might be that we, that we have to remove it or if we cannot recode it. Generalizing the meaning of text variable, so mentioned on the previous slide, free text responses, checking those, making sure that there's nothing there of concern. Restricting the upper or lower ranges of continuous, continuous, should be continuous variables to hide outliers. So we already looked at this, so we have, for example, the age example, right, recoding, depending on, on what we see when we run our frequencies. Yeah, how to decide checking frequencies for those indirect identifiers, seeing how what we're working with, and then once we have that guidance in terms of what that threshold is, you know, applying that. If there isn't any guidance in particular, the threshold that is usually used is at least five. Some would also say at least 10. But yeah, what I've seen the most is as low as five, but no, nothing lower. Anonymizing georeference data, so replacing point coordinates with no disclosive, non-disclosive variables. So of course, this would depend on what those coordinates point to, but we need to make sure, for example, I remember a few years back, we got a data set and they're collecting data using phones. And, you know, we had a lot of longitude and they were looking at, you know, where people were paid places where people were getting food throughout the day. And, you know, but, you know, where some of those places, people's homes, right, is that a possibility? If that is a possibility, then of course we need to. So again, context is very important for, in that example, right? So just thinking about really what data is selling us and, you know, whether that is problematic or not. And of course, if you have any doubts, we are here to advise. So if you have any specific questions, do get in touch with us. Okay, moving on to a specific example. So on the, on the screen here, we have just a small, it's a mock, it's already made up a small data set, if you can even call it that. So we have four variables, four columns. We have age, gender, profession, ethnicity, indirect identifiers, okay. And we can see, yeah, we can see the information on the screen. And then I'm going to move to the next slide, which is going to show us some techniques in terms of how we can reduce that disclosure risk for this particular example. That's a particular example. So we look at the age variable. So we talked about top coding. So we had some outliers, if we look at that, right? So we had someone who was 118 and someone who was 89. So we just top coded that to 80 plus, okay. For ethnicity, we recoded that into fewer categories. So we have less precision, okay. So this is just an example to see how that works for a specific data set. Okay, another example here. So thinking back of that arrow, a few slides back. So we had raw source data all the way to anonymized data, okay. And here's just an example of a text. So for raw source data, of course, this is also made up. So it's not, yeah, just to mention that. So we have the name, right? In the raw source data, we have the name, we have the age, we have the exact date when this person went for her chemotherapy treatment, we know the exact hospital. And once we de-identified that, then we of course removed the identifying information. We however left some information, which is that we left the gender, right? Which you can, will be replaced in that way just to provide still some more information. As you notice, we're trying to leave as much information as possible. But we need to keep in mind, you know, how we're making that available. This de-identified version of the data would probably go under some sort of safeguard. So it wouldn't be available openly, okay. But we still left indirect identifiers. So we have the gender, we have the gender of age. We have, we still have the exact date, and we also still have the location, okay. So this is de-identified, but this does still have quite a lot of detail, right? We have the exact date, we have the exact location. Then we move on to pseudonymizing that. So even reducing some, even more the detail in that, in that piece of data, okay. So we still have some indirect identifiers, but they have, the precision has been reduced, okay. So we don't have the exact day, date anymore, right? We just have the, I'm sorry, I don't know why that's it from April to May. It should still say April. It shouldn't change the month. So it should still just say the month, but not the day, okay. And then we replaced the exact location. We aggregated that to just a hospital in, and then we just added the county, okay. And then moving even further, we put that age in an age bracket. So 40 to 50, reducing the level of detail, went to chemotherapy treatment, and then we aggregated that geographic information even more to just the country, okay. So hopefully this is useful to see that, you know, transition from raw source data to anonymized data, okay. In terms of some useful software that can be useful to, to, for SDC. So SDC micro, it's a free R package. It also has a friendly interface. So it can be used to, you know, identify, you know, what issues are, and then in terms of anonymization, okay. And there, all the, the package and the material and the information, it's available online. So it's a very useful tool. Then there's also QMI data, which is a tool that we developed that you get data service. So this is also an open source tool. And it, what we refer to as provides a health check for numeric data, okay. So what, what it does is that you can, you can upload, you would have this on your machine, you download, it's free to, to install on your machine. You, you know, you, you put your data set in, and then it provides you with, with a health check, right. So it would tell you, for example, if you have any out of bounds values for categorical data, it would tell you, you know, there's, there's, there's a whole lot of checks that it performs in my mind, my mind is blank now. But yeah, and I included links, right. So there's more information on, on our website, and you can download it from GitHub. And then ARX and New Argus, there are also options in terms of anonymizing, well, redacting data, using an automated, as we don't have to do it by hand. And yeah, I'll just, we just want to present some options, of course. And I include some information and then you can decide which one you prefer to use. Okay, is this back to you, Maureen, or is it still me? I forget. Good question. Shall I talk about it? So I mentioned the anonymization plans a bit earlier, how we create an anonymization plan for every collection that we have. And just to go through what that anonymization plan kind of, you know, encompasses. We have a first bit just around the project background. This is normally just a quick paragraph about, I don't know, 150 words or something, just a little summary of what the project was, what the aims and objectives were of the research. We then have a section titled file management. And the file management basically outlines what is the structure of the, the files that are holding your data. Do you sort by data type, perhaps, or do you sort by file type? We sort by file type, so you'll see that there's directories called PDF versus Excel versus RTF. That's how the archive kind of sorts some of those. But it's just a quick overview of what is the overall file management. You may also want to include who has access to which directories within, within that structure. And then we have mandatory anonymization. This is usually direct identifiers. Now I mentioned the example of the collection that was openly available. This simply said, there has been explicit consent given by each participant for their information to be released openly. But you, but, you know, most cases that that is a pretty specific kind of example, most will say direct identifiers like names, contact details will be given either removed or given a pseudonym. And you can start to list, for example, places, ages and dates, so on and so forth, and what the specific plan is for that. So with names, they're given pseudonyms, contact details, it depends, things like phone numbers, we usually replace with, you know, phone number in bracket, that sort of thing. So it's just what is the plan? What are you going to replace it with when you see those things? And then we also have a possible anonymization. And this usually is sensitive information, things like perhaps potentially libelious, you know, statements about others, details of legal cases, medical information of those not involved in this study, just about anything you can think of that might be considered sensitive, could fall under a possible anonymization. And rather than kind of list out, it's what are the things that might be likely to come up. And when they do, what is the plan? So usually in our anonymization plans, it's a quick statement that just says where these are identified, they're taken to the line manager for a decision about whether they will be edited or redacted or left as is. So it usually goes up the chain as it were. But you might have a plan, you know, if it's medical information, you may know that you want to redact medical information, or you may know that you want to aggregate it to, you know, kind of wider. It's just what might come up based on your specific project, the kinds of questions you asked, etc. So that's, that's anonymization plans. And then we have a kind of final overview. We keep saying, you know, anonymization is one thing, but access controls are another and getting consent is vital. So we just wanted to kind of, you know, combine all of this and say there's a three prong approach to protecting participants. And that is ensuring that there is consent. So consent to share the data in any kind of way, shape or form, even if it's just for publication, which for researchers may seem like, yeah, of course, we're going to publish this, why wouldn't we? But you need to tell your participants what, how exactly you're going to disseminate potential, even extracts of the data. So ask for consent to share. And you must be, you know, really informed yourself about what some of the risks and the benefits are of that and just be realistic with your participants of this is what I'm going to do with the data. Here is the risk to you so they can make a decision about whether or not, for example, they want their data deposited, whether or not they want to be written up as a case study in your publication, whether or not they want specific types of details shared, it just helps them make decisions as well about what they're willing to tell you or not tell you. You should, of course, anonymize, but, you know, think about minimizing any kind of damage to the data, making sure that it's not going to impede on the integrity of the data. This is particularly difficult if you have image kind of data, which we didn't discuss in detail, but, you know, when thinking about the importance of not having any kind of data loss, audio, visual, and image kind of data is particularly difficult. So with that kind of data, make sure you've got explicit consent to share those images or that audio, but otherwise anonymize as light touch as possible to avoid damaging the data. And then finally, regulate access. So if you're sharing the data with a place like the UK Data Service, we will have a kind of end user agreement. There will be a license signed. You might also consider an embargo, so keeping it in the background for a while and releasing it at a later date. If you've got sensitive data or potentially disclosive data, making sure that you might need to ask for permission only access from the data depositor. But, yeah, all of these strategies kind of combine to enable data to be shared, even in very challenging and difficult research examples. So we do have all kinds of really sensitive topics that are covered within the collections we hold. All right, now we're going to go to Menti. So I'll hand back to you on God for our Menti. So if you could go to menti.com right now, and as you go into menti.com, it'll ask you to input a code which has just gone off the screen. I'll just put it in. I just needed to access the one second. There we go. Yeah, I'll share my screen again. I'll put the code in the chat. Yeah, pop the code in the chat. So when you get to menti.com, you don't need to register anything. You just need to type in our code and you'll be able to join our Menti. There we go. So we're just going to wait a few moments. You have the information at the top as well, so you need to join at menti.com and use code 17288986. You can do this from just open a separate tab or you can use your phone or a tablet or anything really. I can see we already have eight of you already in 10. Okay, we'll just wait a couple more moments and then we'll start. So we're going to talk about different types of identifying information. Of course, we've already gone through the slides and we provided definitions and examples. But this is not to test you in any way, so don't worry about that and your answers are anonymous as well. So this is just really to have a quick conversation around some of the concepts that we covered today and have some more examples. And yeah, so even if you're not able to join, I'm sure hopefully you can see my screen, so you'll see everything and yeah, of course you can still participate. Yeah, I see that number keeps going up, so we have 21, but it's slowing down, so maybe I'll probably make a start. 24, okay, three more. I joined just now, 25. Okay, let's start. So just a quick introductory question for us to know what type of data people that joined this workshop are interested in. So what type of data are you looking to anonymize or you're working with at the moment and at some point you'll think about anonymization techniques both. Okay, I see that the most popular answer is both followed by qualitative. Okay, well, thank you. I think, oh, we have 42 people now and that's great. Okay, so I see that the predominant answer is both. So that's good, so we know who to design this for. Okay, so I don't know why this is the case, because I already see Mentimeter sometimes that is very strange things. I did do this today. All right, that's fine. You should still, hopefully you can still see my screen. Okay, all right, so we already done this slide. I'm just going to move on to the next one. Okay, so the question is what information would you think about when talking about anonymizing data? Okay, so what, thinking about if you already have a data set you're working with or data collection, what would you, what are the, what is the information in it that you would think about, oh, I need to, I need to address this, I need to. Okay, so we have personal data, yes. So we do need to remove, we need to de-identify that, that would be the first step we looked at. Okay, jobs and roles, yes, that is, that can be an indirect identifier as well, point us, help us narrow down who someone might be, increase that disclosure risk, age, geographical, yes, small numbers, small ns, yes, yeah, very good. So looking at frequencies, be able to, to, to aggregate, to apply some anonymization techniques, location, yes, very good income occupation, facial images and MRI scans, yes. Age, security number, address, salary, this is, this is all very good. I saw pet names, I'm not sure pet names are as, as a, as a funny one, I don't think I've seen that before. I'm not sure if they fall on it, that's a very good question. I would have been like an animal name or as in like a pet name for someone that might be a specific name to that, I don't know. Yeah, I didn't think of medical conditions, yes, that is, that's one of the, it falls under the, that special, it's a special category under GDPR article nine. So we need to medical information as we need to, and it's also an indirect identifier. Okay, so yes, this is very good, very good, everything looks great. It's actually more examples than we covered in the slides. So that is, that is great. Thank you very much. Reducing impact in case of disclosure, yes, yeah. Okay, demographic effort, yes, this is very good. Let's move on to the next slide. Examples of direct identifiers. So we, we had a slide on this where we looked at the direct and indirect identifiers and the difference between them. So what comes to mind now, it is a little bit of a test. Okay, so name, yes, very good, IP address, very good. Age, age is not a direct identifier because it doesn't point to someone in particular, right? So my age, there's, I don't know how many people out there in the world with the same exact age, right? So it's not a direct identifier. Postal address, yes, very good. That would lead us to a national insurance number, very good. Religion, no, that is not that, that is not a direct identifier because it doesn't point to someone, right? I might be a certain religion, there are, I don't know how many other people out there with the exact same religion. So it's not a direct identifier. I see an HS number though, that's very good. What else? I see insurance ID, yes, yes, that's very good. Social security number, date of birth, date of birth is not a direct identifier. There's, I don't know how many people out there with the exact same date of birth as me, right? So, but it is, it is something that would increase the level of disclosure a lot. I see car registration number. I'm not sure if that is actually, if it would be, if it would point to, Maureen, that's, this is a very good question. Is car registration number, would that point to someone? Yeah, okay. I don't have a car, so I guess it depends if you're, because it would point to the car itself, which you could bet on the owner. Right, so yes, that's a very good example. Yeah, great. Unique job title, yes. We were going to see, actually we have a question, whoever said that was the head of the game here. But yeah, okay, so it's very good. I see also some, some examples on the screen that are not direct identifiers, but they are indirect identifiers. Okay, so we have age, we have date of birth, we have sexual orientation, we have, let me, oh gosh, religion. Okay, so yeah, we're going to, to have a look at it in the next slide as well, because we have indirect identifiers. So what are examples of indirect identifiers? I think we already had some on the previous slides, but that's fine. This is an exercise. It's good to go through them. So what can you think of that would classify date of birth, age, religion? Yes, very good. Occupation. So postcode is, so I keep saying postcode is also on the previous slide, and I didn't touch on it because it's a, so postcode can, it's, it's disclosed, it's, it's very discursive, right? You wouldn't have it. Sometimes you have it in controlled data, data that is in a secure lab, but you would just have possibly sometimes, depends on the collection, just the first half of the postcode. Postcode can lead to, can lead to multiple addresses, right? So technically it's not necessarily pointing to one address. However, that can also be the case. I understand that, you know, some postcards can, for example, just, you know, there's only two houses, right, that postcode applies to. So it can be quite disclosive from that point of view. It is a discussion there, so it depends on what is taken to, but it is a very disclosive identifier, okay? Diagnosis, yes, sorry, I was just, occupation, yes, this is very good, all very good. Political views, yes. What else? Health conditions, nationality, yeah, village, yes, yeah. Job location, club affiliation. I'm not sure if club affiliation would be an indirect identifier. It's kind of also saying like supermarket preference, it's not necessarily, I don't know, if by club affiliation you meet political affiliation, then yes, but I'm not sure if being a member of a football club would necessarily qualify as an indirect identifier, just thinking about the risk there of someone knowing if I support, I don't even want to give an example, but it's a good example, right? I think in combination with other information it potentially could be region, date of birth, okay, yes, so this is very, this is very good, thank you very much. Okay, moving on, which is most disclosive? IP address, postcode, gender, religious belief, okay, so we have IP address as the most common answer, and yes, that is correct, that is right, so that is a direct identifier, and it's, yes, it is the most disclosive in this, follow it closely by postcode, and I see there is a, this is very good, thank you. Okay, is someone's job title personal information? And this was the previous slide, and yeah, that person was ahead of the game in a way, so I see yes and no, and this is a bit of a trick, well it's not a trick question, but it's interesting to talk about, right, because it can be, it depends how precise it is, right, I know in a previous job I worked with someone and they said, you know, I googled my exact job title, and it led to me, because there was just no other job title like that, right, so it can be, it can be, but not always, right, if it's a job title like, I don't know, research assistant, then that's general enough, how many other people in the world have that job title, but if it's very precise, then it could be. Okay, select the direct identifiers from this list, we have job title, we have email address, we have gender, age, party affiliation, and national insurance number. Thank you, this is very good, that's what we were, yeah, going for, notice that job title is not, but we already had a conversation about that, it could be, so yeah, still good for those of you who did click job title. Gender and age are not direct identifiers, they are indirect identifiers, right, so gender, that there are how many other people out there in the world with the same gender as me, same for age, so they're not direct identifiers, they're not identifying information. Indirect identifiers, we have geographic coordinates, we have date of birth, to gender, supermarket preference, sexual preference, ethnic background, very good, I see the most popular ones are ethnic background, sexual preference, gender, date of birth, yeah, those are the ones we were going for, geographic coordinates, it's a conversation, right, because they could be a direct identifier, however, I don't know if they lead to a specific location, I don't know where people go on holiday, it depends what the date is for, right, depends on the context, but they cannot be, they can be a direct identifier, they can be an indirect identifier, they can be nothing, right, they can just be additional information. Supermarket preference is not considered to be an indirect identifier either, just because again, you know, the fact that I prefer one supermarket over the other, does that, is that a problem, I mean, I wouldn't line someone knowing that or, well, I guess it's subjective, but it's not considered to be something that would add to, you know, identifying someone in particular because it's personal preferences, okay, but yes, date of birth, gender, sexual preferences, ethnic background, those are the correct answers in this case, so the answers we're going for. We have a question to our folks here, so the identification is redacting the direct identifier, right, we had a look at that, is this enough to consider the data anonymized? Anka, just to let you know, we've got three minutes left and a couple of questions still in the Q&A that need answer. Should we stop if you want to? Let's just answer this question and then, yeah, okay, yeah, so the answer was no, thank you very much, that's a correct answer, okay. I think we're doing it, so yeah, we're all out of time now, so I hope you guys have found this but if you do have follow-up questions or if you did want to, if something comes to you later, please do feel free to pop us an email, you can do so through our website, the Contact Us page, and we can carry on the conversation there too, okay. Thank you all for coming and hope to see you at one of our future events.