 So hello everyone. Welcome to this online workshop on anonymizing qualitative and quantitative data. My name is Maureen Haker. I am a lecturer at University of Suffolk and I also work with the UK data service on a whole range of projects. So I've done everything from like digitization to reuse projects. And I'm here with my colleague, Anka. Did you want to introduce yourself? Hello everyone. My name is Anka and I work at the UK data service as well. My focus is on quantitative data and I also work at a Cancer Research UK where I manage their security environment. So if you've heard of the UNSSRS, that's the equivalent of you know what we're doing at CRUK as well and Educate Yes. So we have the secure lab as well. So we mostly, yeah, I mostly look at quant data. So we have both quant and qual here presented today. So back to you Maureen. We're going to begin by talking. So I think we aim to talk for around 45 minutes to an hour. And then there's a couple exercises as well to help you get started thinking about anonymization in your own work. Anka and I will be monitoring the questions throughout the webinar to try and answer as many as possible as we go along. But we should also have some time hopefully at the end to talk through any of the frequently asked questions. This is an overview of what we're planning on doing today. So we'll give you a bit of background on anonymization, why it's important, what theoretical underpinning is, what are your legal responsibilities. We'll then go into a short exercise later on using Mentimeter for those of you who are familiar with it and hopefully give you a bit of a practical overview on anonymization and what some of the tips are for quant and qual data specifically. Okay, and then we'll give you some further resources at the end. So a bit of signposting so you can continue your exploration of anonymization. Okay, so we'll get started on a brief overview of what anonymization theory is. So the UK Anonymization Network has published an openly available Anonymization Decision Making Framework. And the National Center for Research Methods has posted extensive tutorials on this, so I didn't want to go into too much detail here. But I thought this framework would be a really good place to start talking about anonymization. So the framework starts with the basic idea that anonymization, you know, it's not just a single process that's done at any one point in time. And instead it kind of outlines these three key aspects of decisions that you make on how to anonymize. And the first of these is to think about the data situation audit, which is specifically considering where do you want to present the data? What is your role and responsibility with that data? And what are the specifics of that data? So what variables have been collected? Where are they stored? The next stage is to actually do a risk analysis. What are the actual chances of there being a disclosure? And just to say, you know, a disclosure is basically where you are able to connect a specific detail to a specific person, right? So that's what we mean by disclosure. And the final stage is impact management. So if there is a disclosure, what are the plans for what happens then? So there's just a couple of points I think to make here. This framework comprehensively points out that analyzing the risk of disclosure should be iterative. It's not linear. There's no single point where this should be assessed. But rather, you should think about all the places that data are stored or presented. The risk for publishing with data extracts, for example, should be considered along with sharing data with colleagues across institutions or storing data on a computer. The other points I want to reiterate here is that it's not possible to, quote, unquote, fully anonymize data. I think guidance from the ICO within the UK currently talks about effective anonymization or effectively anonymous. But fully anonymizing would mean that even a participant themselves looking at the data would not be able to identify their own answers. And stripping data down to this point reduces the value for analysis. So to avoid depleting the value of data, even comprehensive anonymization would still have to leave at least some, albeit theoretical, space for re-identification. The idea is to balance the risk of disclosure. What is the probability of re-identifying to a point where disclosure can be mitigated and it can be dealt with? So fully anonymized or guaranteeing confidentiality, I would say isn't really possible, possibly not even desirable, as it suggests stripping data down to something that is actually rather worthless. So what we're going to go through today, hopefully, will give you a few tools to kind of think about what choices you might make in this process of anonymization. So Anka, did you want to expand a little bit more here on disclosure? Thank you, Maureen. Sorry, I was muted. I'm sure I can hear me. Okay, so thank you, Maureen. So as Maureen has given us already a big picture, but I think it's important to take a few steps back and make sure that we understand all the concepts involved and we don't just throw in terms like disclosure and anonymization and without making sure we understand what we're referring to. Okay, so what is disclosure and why do we need an anonymization? So first off, we have disclosure. I'm sure you've heard about it before, but when talking about anonymization, disclosure simply means identification. So when someone is able to identify a data subject from data or information they have access to, be it from one or multiple sources. There are a few types of disclosure, the difference being in the type of information we learn about that unit of observation. Notice I say unit of observation, so it might be a person or it might not, of course, depending on the data, it might be a company, it might be a household, a shop on High Street, etc. So depending on what information we learn about that unit of information, that's what makes the difference in the type of disclosure we are talking about. So I mean this is also implied by the name, so we have attributes, we have identity, and we have inferential disclosure. Briefly, identity disclosure occurs when it's possible to associate a known individual with a release data record. Attribute disclosure occurs when it's possible to determine some of the characteristics of an individual based on the information available in the release data. And inferential disclosure occurs when it's possible to determine the value of some characteristics of an individual more accurately with the release data than would otherwise have been possible without that release data. Okay, so now this is likely beyond the purposes of this workshop, but this is something, if you're interested in the different types of disclosure, we have the multiple sources of information where you can learn more about this and we linked some of these at the end of this presentation if you'd like to read further. For the purposes of today's workshop, we will just refer to disclosure in general without making any distinctions. So just to say that's important. Okay, so now that we know what disclosure is, anonymization is simply a process by which we try to prevent disclosure. And as I'm sure you know, there are multiple levels to this game to say so. So for a dataset to be, as Maureen already mentioned, to be truly anonymous, it's value for research would be arguably very low if any. That is why we have different levels of anonymization as we're going to see further in this presentation as well, one of which is pseudonymization and we will look in future slides a bit more at pseudonymization as well. For now, it's important to mention that both anonymization and pseudonymization are part of SDC or statistical disclosure control. As the name implies, SDC aims to minimize or mitigate the risk of identification to to an acceptable level that will still allow researchers to maximize data use. So from as the last point of this slide, which is probably one of the most important aspects to remember, especially if you're preparing to or planning to share data for reuse, when disclosure risk, you know, goes down. So the more we reduce that disclosure risk, the more we anonymize the more information we lose from that data. Okay. We mentioned both anonymization and pseudonymization and for the sake of clarification, we wanted to include some definitions here for you. I will read them out as you can see them on the screen. And of course, you will have the slides, but it's important to remember that from a dataset, which is classified as anonymized, we will not be able to re-identify data subjects, right? So not even the data owner. So if I collected, say, a particular dataset after an anonymization, I would not be able to re-identify anyone or be able to point out an observation and think, okay, I think that is X or that is Y. Also, if I was a research participant who agreed to having my data collected for a study and I looked at a fully anonymized version, right? So that's fully anonymized version. I would not be able to tell who is me, right? So that is what we mean when we refer to fully anonymized data. Disclosure risk is extremely low or zero. Okay. Now moving on to pseudonymized, this is data that is not anonymized. This is data that still contains more or less some details that could and would potentially, especially perhaps combined with other sources of information, allow re-identification. So disclosure risk is present with an pseudonymized data to varying degrees depending on the level of pseudonymization. And we'll see that there are different levels as well. So pseudonymization also depending on, for example, what axis level we're going to place that data under, but we'll see that later on. Okay. So now someone might ask why would ever, why would we ever want to have pseudonymized data then? And of course, we need to think back of the previous slide because the more disclosure risk goes down. So approach is zero, the more information we lose. So therefore, our analysis would not be possible or, you know, if possible would not be very valuable. So we'll keep looking at this somewhere in future slides. Okay. So a few more sources here to mention, as I think they're important to flag. So we're talking about identification and anonymization. So the data protection app does not prohibit the disclosure of personal data, but any disclosure needs to be fair, lawful, and in compliance with data protection principles. So there are, of course, situation with disclosing certain information is absolutely fine, even personal data, for example, in cases where research participants do want their data to be attributed to them. And there are multiple factors or aspects we might want to consider or that might apply depending on our projects and data we collect. So we need to think about how is our data going to age, right? So here at UKDA, we do archive data for the long term. And, you know, we hold data which has been collected decades ago. So to consider here retention policies of applicable access restrictions and bargos, for example, we have cases where data had to be embargoed for the first year after publication of the first five years, for example. Second here are the reason why it is, and the reason why it's involved here is because this is the aspect we're mostly focusing on when talking about anonymization. So the level of detail in the data, the degree to which the data has been redacted, anonymized, etc. So, yeah, then we have context. And this is something that we need to consider when putting together our anonymization plan. Certain aspects of our lives are, of course, more public than others. And therefore, we wouldn't necessarily need to address them when thinking about anonymization. A general rule of thumb for data creators in general is to think about the effect of potential disclosure on the research participants that are involved. And finally here is important to mention the role of data environment. So to illustrate this, we can think of a data set or data collection, be it qualitative or quantitative that is available both in an archive and openly online. So let's say on a researcher's website. So we have the same data, but available in two separate environments. One that is secure and one that is not. A data archive or an accredited repository would have several safeguards in place to manage any disclosure risk present in the data, such as access controls, terms and conditions of access which users would need to agree to via license agreements, metrics to have a record of who accessed the data and so on. So while the data available on a website would have none of these, none of the above. So really, this is what data environments, what this whole data concept of data environments refers to. Perhaps you'll also find the term functional analysation out there. Okay. Moving on to, and this is probably one of the most important slides in today's presentation, is just looking, thinking about different information that we might find in our data and in what category that falls. It also says variables for quantitative data, but this applies to both quant and qual data. So first we have identifying variables and we have sensitive variables. For identifying variables, we have two different categories. We have direct and indirect. They're also referred to indirect. They're also referred to as key identifiers in SDC, but yeah, today we'll probably just refer to them mostly as indirect identifiers. So for direct identifiers, this is information that directly identifies data subjects. So this is, we usually think of it as personal data, personal information, and we have some examples here on the screen, social insurance number, name, address, national insurance number, IP address, NHS number, etc. Then we have indirect identifiers. So this is information that still refers to individuals, but it's not specific to those individuals. So we have gender, we have age, we have geography, occupation, income, but this is data that put together can potentially be linked to other sources of data such as the electoral register and therefore lead to identification. And then we have sensitive variables. So this is information that is often, or it is subject to legal or ethical concerns. This would be examples here we have criminal history, sexual preferences and behavior, political affiliation, religion, medical records, income, although that is not subject to legal concerns. There's more of an ethical concern. And yeah, we can see that one variable can be both identifying and sensitive. And again, example there is income. So sensitive variables can also lead to, can lead to secondary or output disclosure, even if identity disclosure is prevented. So what that means is that they can help us learn new information about entire segments of the population. I said we're not going to make the distinction between different types of disclosure more very much today, but this is a key point to mention. And if you're interested in learning about this more, we have linked some references at the end of the presentation for you to learn more. I also added here a link called you're not so anonymous. It's a very interesting piece that you can read more, same about anonymization and what potential linkages can be made between different data sources to lead to identification. So I added that for you if you'd like to read in your spare time. Okay, so this is just a classification of all the different types of information that we can find in data and we'd be concerned with in the process of anonymization, we'll put it together a data anonymization plan. And we are going to have examples and we're going to look at this further. Okay, moving on, we have, I also think this slide is very important because it helps to put a few concepts in context and understand how they change together and affect each other. So on the first arrow at the top, we have a data spectrum, if you will. So different types of data with different types of levels of anonymization done to it. Okay, so from raw or source data on the first arrow on the far left to de-identified. So this is data that has had personal information taken out or enacted. Then we have anonymized data, which is one step further from the identified data in the sense that there has been further redaction done to some of the indirect identifiers or all of them, all the way to completely anonymized data. So then we have data utility, so how valuable this is for research. The more detailed and data, the higher the utility. And then we have access controls applied by archives or repositories. So of course, the more detailed presence in the data, the stricter the access level that the data needs to be, that the data will be available under. And finally we have information loss, which we've already seen in a previous slide. The more we move from left to right on that first arrow of data anonymization, the more information we lose. Okay, so I thought this would be a good idea to put all of these concepts on one slide and see how they affect each other. Okay, I think this is back to you, Maureen. Yes it is. So on the point you've just made about trying to balance all of these, you know, information loss in utility and anonymization and access, it's probably worth mentioning that researchers still fall under some legal obligations to disclose. So within the UK specifically, researchers have a duty of confidentiality. However, the duty of confidentiality is not absolute and it's not protected by legal privilege in the same way that it might be for other professions. Specifically, researchers are still obliged to, for example, inform appropriate authorities where there is abuse of children or vulnerable adults or where there are crimes committed that are covered under terrorism prevention legislation that is the the prevents. And that includes things like money laundering, for example. So my own experience of this is from my PhD research, I went through the health research authority for ethical approval and specifically they scrutinized the consent forms for clear statements which clarified my limits of confidentiality. So this included thinking about the possibilities where raw data would be presented such as publications, but also informing participants that I could actually break confidentiality. Including all of that information into my information sheets of course meant that, you know, my information sheets and consent forms were quite comprehensive in terms of what I had to actually discuss with participants so that they were fully informed and you can see a sample of some of the text here from one of my information sheets. So I think the point that anonymization is a bit of an art and it's not absolute is one of the points that I hope you take away from this workshop. Okay, on the next slide. The other point that I wanted to make around other aspects which can impact anonymization that we talked about is is access levels. So here is an outline of the different access levels that are used at the UK data service, but you may need to think about who has access to the data. So when you share data it doesn't mean it's just, you know, necessarily openly available. If you've shared it through the UK data service, most of the data is what we call safeguarded, which means that users need to sign our terms and conditions in order to reuse the data. And those terms and conditions stipulate things like even in the unlikely event of re-identification, re-users will not share the identity of participants. There are other levels as well, which are more restricted, including permission only access, where the depositor needs to approve a reuse of the data, or even controlled data, where you have to use the data within our controlled environments. And we then perform checks on the outputs to make sure there's no disclosures. So there are different levels of access, even when you are sharing data through an archive, but even within your project, you know, you should also be thinking about the different places that you're sharing data and who has access to that and what kind of conditions there might be in order to grant access. Okay, on the next page it shows you a cover of one of our books. There's more details about the access levels, legal obligations to disclose, and anonymization in our managing and sharing research data book, which is published by SAGE. And the UK data service also publishes resources online. So there's lots of things that you can find. And we have social media as well, where we'll tweet updates and other events of interest. But now we're going to move on to some of the practicalities of anonymizing. And qualitative data can be particularly tricky to anonymize. The very nature of it means that it's full of indirect identifiers, particularly when you have rich biographical data. So the tips that are listed here are much more focused on direct identifiers, you know, the kind of personal data we normally think of, but a full risk assessment should be done anytime that you're sharing the data, including excerpts of data. So when you're anonymizing quality data, you should, for example, anonymize at the time of transcription, unless you need to link your data or you have explicit permission from your participants to use on anonymized data. You should aim for some consistency across your data set. So consider writing up an anonymization plan. And I'll show you an anonymization plan a little bit later on, which kind of details the broader strategies that you're using to anonymize. You should identify any replacements that you do in the text. So this is usually notated with brackets. I have seen notations in other ways, like color coding. But normally, when you see a transcript and there's some brackets in there, it normally means that something has either been anonymized or edited within the transcript. You may also want to consider using an anonymization log, especially if you do not plan on keeping the original unanonymized data anywhere. If you ever do need to go back to check, the log would make sure that you have the original information somewhere. Within the archive, an unanonymized version is usually kept, but never released. So we kind of have this, I've heard it called like a shadow collection, but it's kind of this, we call it the no issue folder that has anything that might be unanonymized. So we keep raw data as well. If you have appropriate security for that unanonymized data, it may be an option to keep that. But if not, consider an anonymization log. You should also avoid redacting. So I've seen a lot of collections where it looks like one of those kind of comical ministry of justice kind of redactions or something out of the Pentagon where they've just blacked out everything. And it does render the data very difficult to reuse. It really minimizes any kind of value. So use pseudonyms where possible to help keep the relationships within the data intact. And then the last one is avoid overanonymizing. So Anka will talk a little bit later about this as well. But think about aggregating variables like towns to larger areas like regions. Or it may not be necessary to give, for example, an entire date. Perhaps you can keep the month and year and just drop the actual day. And that's a way of aggregating to a larger level to help hide some of that disclosive detail. One point we'll come back to later is that it's particularly for qualitative data, it's better to control the access than it is to overanonymize. So the detail within qualitative data is where the value of the data comes from. So it's really important to find the balance there. And controlling who can see the data is a better option than just getting rid of some of that detail. Okay. So on the next slide, we've got an example of where you can see that the name is replaced with the pseudonym. And that pseudonym would then be used in place of Lucas throughout the entire text. The date and town have been aggregated to a larger level. And you can see once the interview starts, some of the biographical details are left in. Those are potential indirect identifiers, but they may be important to keep for the research itself. So instead of taking out those details, it's a better approach is shown actually on the next slide, which is where participants are informed about how the data will be used. So for example, the top text for consent talks about using extracts of interviews and photographs and various outputs and how the interviews will be archived. And the example below is from a participant information sheet from one of my research projects, which says I'll use quotations and narrative themes and outputs, as well as who will have access to the data. Since my data went through ethical review with the health research authority for this project, that meant the NHS could run audits on my data. That's not an unusual kind of condition for ethical review. So it's worth checking whether not ethics panels are able to run those audits so that your participants are informed that that could be checked. Another way to protect participants is shown on the next slide, which is where you control access conditions. The excerpts shown earlier of the anonymized interview where it had taken out and shown where pseudonyms would be replaced actually had quite specific access conditions for the data. The collection combined interviews and diaries and only some of those interviews were available to registered users at the UK data service. There's another few that were embargoed, which means that they were kind of held back and inaccessible for a few years. And then later they were released after some time had passed. Data can become less sensitive as time passes. So you may want to think about keeping it under embargo until sufficient time has passed that it could be released. Audio for this collection is also available, but it's under permission only access. So the original investigators have to approve any use of the audio before they can be released. So you can see the different types of data have different levels of potential disclosure. So consequently there's different access conditions for that data. On the next slide it shows a very different kind of example, which is our collection pioneers of social research. So Paul Thompson conducted oral history interviews with leading social scientists, and this included extensive details about their childhoods, their education backgrounds, their careers. And because of the relative fame of the participants, you know, these are leading sociologists, historians, anthropologists, they're well published and they're well known within their disciplines. So anonymization was a bit of a fruitless task and potentially even problematic for the collection and the research purposes. You needed to know who these people were basically in order to get something out of the data. So instead Paul Thompson sought explicit permission from his participants to use their real names. Having said that, that doesn't mean that we didn't have a clear anonymization strategy in place. So when digitizing and reviewing the transcripts for this collection, we still looked out for issues where participants talked about the details of, for example, closed court cases, medical conditions of others who were not involved in the study, or potential reputational damage. So we still felt there were clear ethical boundaries to do some level of anonymization and editing, even though we had consent in place that basically covered our legal use of any of their personal data. Okay, the final example that we've got here is on the next slide. And this is Jane Seymour's managing suffering at the end of life. So in this study, Jane Seymour interviewed families and carers who had experienced a loved one going through long term sedation as a palliative care measure to manage pain and anxiety at the end of life. So this was an ESRC funded project. And so there was a mandate to archive the data because of the data covering extensive, sensitive, personal data. And this is, remember, a specific category under GDPR laws. So you not only have the direct and indirect identifiers, but also the sensitive data, the consent for the use of that data had to be managed very carefully. So the UK data service worked with the Health Research Authority, who gave ethical approval for this project to collect consent to share the data a few months after the death of the individual. So it was deemed too difficult at the time to give informed consent about the long term preservation of the data. So instead, participants were given the opportunity to have some time away and then make a decision. I did want to point out this collection specifically because we often get queries about sensitive data, which is problematic to share. And I wanted to point out that sensitive data does mean something very specific in legislation. And often data sometimes isn't actually sensitive in that respect. So data protection laws within the UK outline sensitive data as relating to a very specific set of characteristics. And that includes things like racial origin, sexual life, political opinion, religious beliefs, trade union membership and such like. So there's a specific list that they follow that is considered sensitive data. However, that doesn't mean that data that you collect that doesn't technically qualify as sensitive doesn't feel intimate to participants, right? So you can have a sensitive topic that isn't technically sensitive data. So even in cases though, where there is sensitive data like this collection, it's still possible to anonymize and effectively and share the data when consent and access to the data are also considered. So I think this is a great example of where they've kind of used all of these strategies together to be able to share the data onward and make full use of the data within their project. All right. I think it's back to you, Anka now. Thank you, Maureen. Right. So we mentioned quite a few times an anonymization plan and different levels of anonymization. So now we're going to have a look at how to actually go about anonymization, how do we reach these different levels of anonymization. So we have a few steps for it to be exact. Some of them will apply to both quantitative and qualitative data, such as the first step. So first we have to identify and remove or redact identifying information. So those would be the direct identifiers. Of course, this is in line with what the participants have agreed to. So examples here, of course, this would potentially look different for quantitative and qualitative. So for quantitative, if we have a column with names, we will just remove this completely right. If we have a column with address, we might just aggregate this to a city or to a country. For qualitative data, it might be that we choose to replace names with pseudonyms rather than removing these all together. So and finally, if we recall the slide with all the arrows on that we had to look earlier, the arrow at the top with the different levels, different types of data. So completing this step one would produce the identified data. And we'll also point out where we've reached different levels of anonymization as well as we go through the steps. So this would result in the identified data. Moving on to step two. So once we've completed step one, we need to then go further and identify what the indirect identifiers are, right? So we currently have, we complete step one, we currently have the identified data, but there's still information in the data which can potentially be used to identify someone, right, or learn attributes about a specific segment of the population. So we need to look for any indirect identifiers as well as sensitive information. So if we remember a few slides ago, what indirect identifiers are, we have some examples here on the screen. So it's mostly demographic information about participants as well as information that would fall under a special category data of the GDPR. So this is, I think I've already listed a few. So this is racial or ethnic origin, political opinions, religious or philosophical beliefs or trade union membership, genetic data, biometric data, data concerning health or data concerning a person's sex life or sexual orientation. In addition, we need to think about any other information that is present in the data that might be of a sensitive nature such as income or anything really that Maureen already mentioned already that might be considered sensitive, but doesn't necessarily fall under legal concerns. Finally, just to mention here the importance of good metadata at this stage. So having good, clear variable and value labels, both for data producers, but also for archives or repositories when they curate these data sets. So we have data sets that come in and they still have that might potentially still have issues to address from an analysation point of view before publication. So having good metadata is very important for us as archives to be able to carry out our interest process and processing of that data before we can publish it. So that's very important as well. Okay and moving on to the next step. So at this point we're talking about really quantitative data only. So once we have narrowed down our indirect identifiers we need to we need to check frequencies. We need to look for small counts. Checking out liars is also part of the same process here. So just making sure that there are no small counts in the data. So archives here would have the rules and guidance around this. So around what threshold to use for these small counts it might be, you know, a count of five or a count of 10 or for example, HMRC count is threshold is 30. I believe this threshold can also depend on the access level that the data will be available under. So really our recommendation here is to always liaise with the archive or the repository where you're planning to publish data. We always encourage our repositories to get in touch with us about this so we can help and advise on the threshold depending on the access level that the data will be available under. Also important here is to look at string variables to ensure that they don't contain any problematic information. So meaning any personal information or any possible possibly litigious or commercially sensitive information that we would need to address before we can publish. Okay so finally step four. So once we've identified what the indirect identifiers are or any sensitive information in the data, what do we do about it? What are our anonymization techniques? So Maureen already mentioned the aggregation so this is just to reduce precision from for example a very small village with maybe a few hundred people to a town or city. Right we also have the option to recode categorical variables for indirect identifiers into fewer categories and we're actually going to have an example of this on the next slide. Suppressing specific values of indirect identifiers for some units. So with suppression is an SDC method of addressing those small counts. This is quite I wouldn't say advanced but we don't necessarily have the time to go into this in a lot of detail. If you are interested in SDC then there are multiple sources of information out there if you want to read more about this but yeah we can't go into so much detail about it. Generalizing the meaning of text variables. So replacing potentially disclosive so free text responsive with more general text. We mentioned the previous slide string variable so this is yeah so wherever we have those other open-ended answers in surveys we need to make sure that there's nothing there that is problematic. Restricting the upper or lower ranges of continuous variables to potentially to high potential outliers so age it's a very good example here. On the screen we have recorded recording into over 70 but this is of course depending on each individual data set. We need to check frequencies to see how that recording would look like. So for example if we only have two people who are over 70 who are say one is 74 one is 75 or two there are 74 then we'll probably just record that into 70 plus to hide those outliers depending on that threshold that we discussed already. How to decide yeah checking frequencies for all indirect identifiers that we mentioned in the previous slide. Anonymizing geo reference data so of course point commander coordinates can be problematic especially for example with someone's house so replacing this with non-disclosive variables and this step would usually take us to having a pseudonymized or an anonymized data set. Of course there are different levels to synchronization and anonymization depending on the access level that we're going to publish that data under but that is why we advise you to or advise data producers to get in touch with us because depending on what the access level will be then we can pseudonymize more or less yeah so we can advise on that. Okay as I mentioned we have an example so in terms of recoding categorical variables we have here on the screen we have four variables we have age, gender, profession and ethnicity and just looking at it there are some obvious issues there for example just looking at the age we have one potential outlier there right so what do we do in this case we can top code to hide potential outliers as you can see we've replaced 118 and 89 with 80 plus and we can also recode the ethnicity variable into fewer categories so there will be less precision so less detail okay so from black Caribbean that became just black and from white Irish that just became white right so less precision and then I think this is back to you Maureen for this example. Yeah so this is I think a good example of of thinking about the different levels of anonymization with call data so you start you can see what the raw source looks like the level of detail that's in there you know in the greater context of this would be really important details to know but then of course you have the issue of well if you're sharing the data or if you're trying to publish an extract some of these details will be important but you need to minimize the possibility of disclosure so you might think about what level of anonymization is needed for the specific kind of output so you might look to de-identify so you take away those direct identifiers such as the name so we changed the name in that example or you might need to think about the pseudonyms that you would use you know thinking about the next layer do you need to reduce other kinds of disclosive details for example what date specific date the chemotherapy treatment was on so you could take up a specific day or you could even just do the year if needed and then you might think about an even more robust kind of anonymized if needed so even taking out you know the age specific age and putting in an age range or taking out the gender so when Anka and I were looking at this slide we actually had a bit of a debate about you know at what point do we take out for example their identified gender at what point do we do we take out the age and put in an age range so different people might make slightly different decisions about what details to share and at what points but I think it's a really good example of how you could think about the different levels of anonymization and how there may be something a kind of output where a higher level of anonymization is needed or you may be able to retain some of that information because it's really important so yeah thank you Marie okay so moving on so so far we went through a few steps that we can go about when we're trying to anonymize data but there's also software that we can use to make our lives easier so we have a few options on the screen including one that we have developed that UK data service called QMI data this is an open-source tool so it is easily available online this produces a health check for numeric data so it will use of course automated methods to detect and report on some of the most common problems that we have seen at the UK data archive when we receive numeric data such as missing next duplication outliers and direct identifiers so what we've already what we've already looked at in in previous slides so there's also STC micro, ARS and Neuroarchus really we don't necessarily advise for one over the other it's your option if you'd like to use software to decide which one to use so yeah STC micro is a very very useful tool personally I use this I use it it's very very easy to use it has a very friendly interface so no really minimal coding skills needed and it's very useful when looking at organization so yeah just wanted to give you some options if this is something you you'd consider and I think this is back to you Marie yeah so I mentioned earlier anonymization plans and I can see already there's just a little bit of discussion about in the previous example of where for example to take out gender at what points is that something that it might be a disclosive piece of information versus you know something that you can share and I think this all kind of points to like I said earlier the art of anonymization and so it's useful to kind of set up from the get-go what your plan for anonymization is. Gail or Anka I think there's a that worksheet not sorry not the worksheet the handout of the anonymization plan just can one of you just quickly pop that into actually Gail I can get my view because Anka I'm sharing the slides that I can that's sorry so thank you Gail that's lovely so you can see what an example of an anonymization plan is Gail's just popped the link into the chat now this is for pioneers of social research which I mentioned earlier is it kind of a unique collection because there was consent from participants to share their real names so you know you can see what what this example of an anonymization plan that we devised before setting out to digit digitize and prepare for dissemination what what our plan for anonymization was so you can see there there's some broad categories there's a bit of background about the project how the data files are to be managed what the potential direct identifiers what our potential indirect identifiers and what the procedure would be for changing those and I do like like this example because it's one where participants have given their consent to share personal data but nevertheless we did still have to come up with a plan to review any areas that we thought could be deemed problematic or sensitive and what kind of approach we would use with that so in this case is we tended to do a little bit of redaction it was you know usually no more than a few lines and it was just in a few places throughout the entire collection so I think there was I'm trying to remember now was it 43 or life history interviews it was only a few interviews that that were actually affected by this we would have opted for slight editing over redaction if the situation allowed for that but the few places where we did have to do some editing to the transcript for purposes of anonymization kind of meant that you know they were sharing details that couldn't be shared so we just in this case we did just have to redact but it was a very light touch only a few places and it was always notated within the transcript with brackets that said you know this transcript has been edited to remove some details so hopefully this gives you a clear visual of what you might put together before actually carrying out your anonymization so you have a clear idea of what choices you do want to make so for example I could see in the discussion people just talking about the gender redaction of gender perhaps you feel that that is actually really really important to the value of the data and so you would need to ensure that if you feel that is a disclosive detail what is the plan are you are is there another way of indicating what you need to convey about the data or do you keep that in and ensure that there are other details that are anonymized more that when used with gender could be disclosive so do you look for other indirect identifiers and ensure that those are the ones that you either use pseudonyms for or if you need to edit slightly to do that so it just gives you an idea of how you can set that plan out so you have a consistent approach across your data sets all right and on the next slide I've got the the sort of summary hopefully of what what you can gather from what we've talked about so far which is to take what we call like a three prong approach to protect protecting participants so to think about consent and access in conjunction with anonymization so if you are planning on sharing the data and whether that is sharing uh excerpts within publications or to actually share the data sets itself and I know a lot of publishers now are requiring researchers to share data sets before publication so make sure you are asking for consent to share the data so researchers must inform their participants about the risks and the benefits of data sharing and they must think about any obligations that they have to either have audits run on the data or if they're going to publish to to any obligations to share there try to anonymize only if the damage to the data is minimal so this includes really thinking hard about for example audio clips or images or video I know somebody in the q and a has asked specifically about the audio those can be very difficult to uh anonymize simply because there's so much damage to the data that's done so really thinking about what is your strategy then to ensure participants are protected and then to think about regulating the access so if you archive the data with a trusted repository like the uk data archive there is an end user agreement in place and there are options like embargo or to ask for permission from the data depositor in order to access the data so using all three of those strategies will enable you to share most data perhaps there are there are some some very select ones where there is a high risk to participants but you know most data can be shared if you have consent and access in place alongside anonymization so um yeah I think it's back to you uncle now um and we just I think we just reached pretty much the the end of the the slides and now we just have a few um a few tools and uh sort of resources to to mention to you so on this slide we have tools and templates so we have several templates on our website that you can just download and and adapt to your project such as a model consent form um survey consent statement transcription template um and so on a data list template um that is the type of documentation that is uploaded with quality qualitative data so do have a look at these in your own time further resources as I mentioned if you're interested in reading more about maybe aspects that we we didn't have time to cover um in in in today's session um there's some resources for you um of course get in touch if you have any any questions we also have a youtube channel where you can see recordings of past um workshops and um you can you can tweet us um and you can um yeah visit our website and of course the slides will be available on our website um as well um upcoming events so we are running um we have recurring workshops um every spring and every autumn um I think they mostly run so we um yeah and some of the topics are on the screen if it's if there's something that you're interested in we linked here at the bottom the events page we could where you could um learn more information and also register for these workshops and now we've reached the exercise um so we we have a quick mentee exercise um it's it's um don't worry it is anonymous and we're not trying to to test you really in any way it's just to um really start a conversation around some of the the topics we've already looked at today um and and add to those um and it's usually it this ends up really just being a really interesting um conversation for the next 10 minutes or so okay so the first question um just really for us to have an idea of what um types of data you are working um with um so what type of data are you looking to anonymize or you've collected in your current project um is it quantitative qualitative both or maybe it's too early on and you're not sure um just uh it's really fine uh I see we have 100 and okay over 100 people now so that's great um so I see the most um most popular answers both um so that's great although um I see the qualitative is significantly higher than the quantitative um box so but yeah that's good thank you very much uh it's just good for us to have an idea um so this is really just for you to submit your responses so what information would you think about when talking about anonymizing data okay how to manage gender and ethnicity okay personal information that's very good names I see names who am I talking to ensuring participant safety right that's it that's it's it's very important I'm so glad someone put that there um okay risk gratification yeah organizations yeah it's very well so we're not just talking about um individuals um personal information personal information um data uses and data sharing I saw that uh it's very good um workplaces yeah uh reassuring participants yes that is very very important um data protection impact assessments yeah dpias that's that's that's very good um what has been consented to yeah very good um company information right these are these are great I think I think uh uh yeah we weren't expecting this much detail but this is this is great um okay so I'm just going to move on to the next thank you very much um this all looks very good um yeah consent aggregation no identifying individuals yeah how data can be used when out when put it yeah protective courage all right very good thank you very much everyone that's very good okay um thinking about direct identifiers so we have we have such done this in the presentation we had examples and I see names the first example uh yes names that is great thank you very much um IP addresses yes um gender is not a direct identifier gender is an indirect identifier um but yeah social security number I see I see quite a few um quite a few mentions on the screen of things that are not there so age age is uh it's not a direct identifier um you know there's so many other people out there with the same age as me that is not pointing to me um bank details yes that is that's a good example okay religion that is not a direct identifier um but there yes most of most of the ones I see on the screen they're very good email address yeah uh home address it's very good um yeah address national insurance number NHS number yeah insurance number very good yeah um thank you very much and I think the next example would be indirect identifier some very good examples on the screen so we have gender age occupation religion race geography place of work date of birth yes um employer yes these are very good examples um job role um yeah okay this is this is great thank you um okay now some more uh specific questions if we will so which is the most explosive we have IP address postcode gender and religious belief okay so um by most explosive here I I think um what we were really just aiming for was what are the direct identifiers really and this in this example that would be IP address and both as well as postcode right because that is that can be quite narrowing down especially the full postcode um in terms of identification so um yeah I see the most popular answers IP address so that's great followed by postcode so um yeah that's very good thank you let's move on is someone jobs title personal information so a direct identifier right this is a yes or yes yes no answer um I see the most popular answers no and this is this is not a trick question but it's it's an interesting um it's just an interesting thing to think about um if someone's job title can be pointing to an individual and I the answer here is that it depends right it depends on um is someone else out there with the same job title and I think really we actually came on to this because I had a colleague at some point who um who googled their um their job title and it actually googled you know spit out um their name so in that situation um that would be considered personal information right so again thinking about context and um everything else we know about um about someone or about a unit of observation um yeah putting it in context um can tell us if it's if it's personal information or not okay thank you very much select direct identifiers so we have job title we have email address we have gender age party affiliation and national insurance number okay so I see the most the two most popular answers were email address and national insurance number um we also have a few of you who selected job title probably because of the previous question um that's very good so yeah we were going for email address and national insurance number of course yes job title can potentially be um depending on context um but for this specific example the the clear direct identifiers are email address and national insurance number okay thank you select indirect identifiers um and we have geographic coordinates date of birth gender supermarket preference sexual preference ethnic background there's a bit of debate as well going on in the chat about these which I think just reinforces how anonymization can be a bit of an art form in terms of you know I think when we when we select these answers um we're doing so from our perspective as archivists of you know looking generally across but there's always going to be like an exception to the rule isn't there so it's important to make your whole project into account but generally speaking you know yeah um okay thank you I see um most popular answers were ethnic background and gender so um really what we were going for is um well the ones that have so date of birth gender sexual preference and ethnic background and I'll explain uh why the other two were we're not selected here so geographic coordinates those would actually be a direct identifier especially if so we have um latitude and longitude of someone's house for example um so that wouldn't be an indirect identifier um and then supermarket preference that is not necessarily considered something uh that could you know whether I prefer one supermarket over the other I don't think that is someone thing that could help identify me or piece together who I am um these usually these sort of um I don't know what what they're called lifestyle choices preferences lifestyle preferences uh are not considered to be in our experience but again it depends on the context again um yeah that's more important about I'm sure there's some conversation in the Q&A uh that for perhaps for some projects that might be something that you consider to be an identifier in which case yeah that's fine but what we what we have seen in our experience uh that supermarket preference is not necessarily something that um we have pointed to as an indirect identifier okay the identification is redacting direct identifiers okay so we've have seen this in the slides is this enough to consider the data anonymized yeah the court transfer here was no so the identification just takes out the personal information but there's still quite a quite a lot of um details in there um so indirect identifier sensitive information that um is still um is still present in the data so no that would not make the data anonymized okay moving on um and I think this is you taking over Maureen yeah yeah data collected on a sensitive topic is always sensitive data true or false or false most of you are saying false some of you not sure a few of you true okay and and the answer is um I would say the answer is false um so I talked about that managing life at um sorry uh managing suffering at the end of life uh collection where it was sensitive data because they're talking about medical condition but when we talk about sensitive data it is it is a very specific kind of list within law but that doesn't mean that it your data may not be on a sensitive topic so I think it's important to distinguish data from topic and and I think it it kind of um will clarify for you what some of your legal obligations may be to anonymize versus what some of your ethical um decisions may be for anonymization okay um next one another true or false it is good practice to annotate anonymization in qualitative research so for example with square brackets or an anonymization log and thank goodness all of you are answering true excellent yeah you definitely if you are depositing your data somewhere for others to um reuse or to see um it's really important that you have annotated wherever you have kind of either edited or anonymize certain details um it's it's you know although uh anonymization may lead to information loss if you at least kind of point to where information has been lost it helps contextualize some of that so it's a really important thing uh to annotate excellent um which of the following strategies can be used to protect participants identities kick all that apply so this is a multiple and we've got gaining informed consent controlling access to the data anonymizing the data as soon as possible and we've got most of you have said anonymizing and controlling access with a few more of you adding in the consent to that okay so um the answer to this one uncle if you just want to I don't think we have an automated one for this one for some reason it doesn't say percenter just your results but I think we're going for all three right the answer is all three so um I know some of you have not ticked off informed consent just to say consent regardless of anonymization consent to collect data on people particularly if it's personal data of any kind um you do need consent it's a legal obligation so you have to have consent but when you when you get the consent it's really important to design your consent form so that you're thinking about all of the places where you might share data or bits of data and what kind of formats those would take so that your participants are fully informed so you should tick off all three definitely and then once the data has been collected you can also control access to the data and work on work through your anonymization strategies so um and what would be one of the first things you do when anonymizing data I think this is back to you Anka isn't it thinking about the steps of anonymization um so assessing the audience so thinking about in advance who is going to see it am I going to share it with someone am I going to share the data at the end of the project um identifying my role um you know as data owner as data controller um am I going to also share the data am I going to be the person who's going to prepare it out etc consider access control options um identifying direct identifiers and begin changing or reacting or mapping the data so identifying what is personal data sensitive data etc okay so I see the most popular answer is mapping the data identifying what is personal data sensitive data and so on um also identifying my role so being aware of the fact that I'm probably going to be a data controller as the person who's collecting it um and you know um having responsibilities very clear in terms of especially if we have for example a data management plan where we list the responsibilities um you know am I going to be the person who deposits the data the person who anonymizes the data who translates the data etc okay um okay so thank you I see the most popular answer was math data so that's if there are any if there are any further questions we did include our email in the slides which we didn't finish actually that was the last slide though it didn't go through um but our emails are um are attached there so if if there's any questions that we didn't get to or any clarifications you'd like to um to go through please do email us and we're happy to to have a chat um but yeah I think I think we can say thank you very much for joining uh and we hope this session was was useful um and yeah we'll see you in another in another workshop in the future hopefully thank you everyone thank you