 Hi, everyone. I'm Maureen Haker. Shall we just introduce ourselves, Anka? So, I'll go first. I'm Maureen, and I have worked with the UK Data Service for a very long time now, about 10 years or so. And I specialize in working with qualitative data, so I do a lot of training around this. But I've done, in my time with the archive, I've done everything from the ingest and digitization of qualitative data to handling queries, if you've ever sent a query before around qualitative data, I might have been the one who answered, and so on. Anka? Thank you, Maureen. Hi, everyone. My name is Anka. Same, I've been with the archive for a long time. I think about seven, eight years now. I used to work in the ingest sort of side of things, so I used to manage our self-depository-repository reshare. So, I'm doing quite a lot of that, and also training. So, yeah, I think that's it for me. Should we start with the presentation? Yeah, we'll go ahead and start the presentation. So, I think I'm just going to turn off my video since I'm sharing, and I've got blue eyes with the light of the computer. But here we go. Okay, so this is what we're going to do today. We're going to give you a little bit of a background on anonymization, why it's important, what some of the theoretical underpinning is, what are your legal responsibilities? We'll then go into a short kind of overview of each of qualitative and quantitative, just the practicalities of how do you anonymize what are some of the tips that we have specifically for each type of data. And then we'll end with an exercise and discussion about de-identifying information and some further resources. We've also got a handout to give you later today, so later in this webinar. Okay, so this is just very, very brief overview of what anonymization theory is. Mark Elliott at the UK Anonymization Network has published, you know, an openly available anonymization decision-making framework, and the National Center for Research Methods has posted extensive tutorials on it. So it's not something that we're going to go into much detail, but we just wanted to highlight it as a framework that's a good starting place to talk about anonymization. So the framework starts with the basic idea that anonymization is a, you know, single process that's done at one point in time, and it outlines three key aspects to making decisions about how to anonymize, and this is the part that I think is really important. So the first of these is the data situation audit, which is specifically considering where do you want to present data, what's your role and responsibility with the data, and what are the specifics of the data? So what variables have been collected and where are they stored? The next stage is to actually do a risk analysis. So what are the actual chances that there will be a disclosure? And a disclosure is where you've been able to re-identify. You can attribute a particular characteristic to a specific person. And then the final stage is impact management. If there is a disclosure, what are the plans for what happens then? So there's just a couple of key points to make here. This framework comprehensively points out that analyzing the risk of disclosure should be something that's iterative. It's not linear. There's a single point when this should be assessed. Rather, you should think about all of the places that data are stored or presented. The risk for publishing with data extract should be considered alongside sharing data with colleagues across institutions or storing data on your computer. The other point that I want to reiterate here is that Elliott stresses that it's not possible to fully anonymize data. Fully anonymizing would mean that even a participant themselves looking at the data would not be able to identify their own answers. So stripping the data down to that point really reduces the value that you'd have for any kind of analysis. So it's not something that we actually even want to do within research. So to avoid depleting the value of that data, even comprehensive anonymization would still leave at least some, albeit theoretical, space for re-identification. And I know there's been some newer kind of movements toward things like synthetic data, but we're thinking of where participants actual answers are sort of in the data set. And the idea is to balance the risk of disclosure, what the probability of re-identifying is, to a point where disclosure can be mitigated and dealt with should it happen. So fully anonymizing or guaranteeing confidentiality isn't really something that is possible or sometimes even desirable within research, as it suggests stripping data down to something that is actually somewhat worthless. So I'm going to hand it over to Anka now who's going to talk a little bit more about anonymization, what it is and what the different approaches are. Thanks. Thank you, Maureen. Okay, so because Maureen already mentioned the idea of fully anonymized, of full anonymization and completely stripping down data of all valuable information to reach that full anonymization stage. So I think it's important to take a step back and sort of discuss what we mean by anonymization and introduce the concept of disclosure. So by disclosure, we mean identification, identification of a research participant, be it a human participant or a company, etc. It happens when someone is able to identify a data subject in the information they have access from, be it from one source or from multiple sources. Right, so here we're also referring up to data linkage. There are of course different, there are different types of disclosure. But the real, the distinction there is really, you know, the source that and what we need to be able to identify a data subject. So are we able to identify it from a single data source? You know, and where it's very clear, yes, this observation is this X person or data subject or, you know, are we able to link that data with another set of information or data out there. That would provide us more information in conjunction and therefore be able to identify someone. And so then when we're thinking about an organization and what it is, is right. So it's a process that attempts to prevent the disclosure, right, or identification of any data subjects from a specific data set if we're just referring to one particular data set. And now in conjunction with said on my station, we're also introducing that concept. They're both and we're going to have a look in the next slide about, you know, to the, we're going to look at the difference between normalization and standardization. But they're both part of statistical disclosure control will be referred to as SDC. And the aim of SDC is to minimize or mitigate that risk of identification to to such an acceptable level that we can still can maintain some data utility, right. So we're still able to use that data and to maximize that use, use it to its full potential or, of course, as close as possible. So really, we're trying to balance that, you know, that action between removing detail from the data and maintaining maintaining that data utility. So that brings us to the last point there. So when the disclosure risk goes down, right, so we are removing that information from from the data in the process of an organization. The more information loss we have, right, so the more data utility will be affected. Next slide. And also, I know there was a comment in the chat about speaking a bit slower and I will try to, I tend to speak very fast, but I will, I will try to speak slower. Right. So I mentioned in the previous slide that we're going to touch on the difference between anonymization and pseudonymization. So here are some, some definitions from GDPR. So for anonymized data, it is information that does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject, it's not or no longer identifiable. So it is that as, as Maureen mentioned earlier, that even the data owner cannot re-identify itself in the data. And, you know, that is the full anonymization stage. The one that we said is not very desirable from a data utility point of view. And with that in mind, we move to pseudonymized data. And again, we included here the definition. And I will not read it for you, but because of course you can see it on the screen. But this is, this is data that still contains some of that valuable information, right? So it hasn't been, it hasn't been pushed to that point of anonymizing it to a point where it's not necessarily usable or reusable to its full potential. It's full potential in research, right? So it would still have some of the indirect identifiers. We're going to see what they are in a second. So, you know, demographic information, for example, about the research participants, you know, information about their educational background, about their economic status. I see this as a point about slowing down a bit. I will try to, right? So, pseudonymized data still has some of that detail left in for, for us to be able to, to maximize its use. And we're going to, we're going to have some examples further down the presentation as well. Next slide, Maureen, please. Okay. So again, looking at this difference between anonymization and pseudonymization. So according to the ICO, what is re-identification? So it's the process of turning anonymized data back into personal data through the use of data matching or similar techniques. All right. So we were talking about identification earlier. You know, so what, what we think about when, how do we achieve identification or re-identification? You know, when we are able to, to identify someone, so that will automatically be personal data. Again, even though it has been in the past, redacted to a certain point to, in an attempt to, to reduce that disclosure risk. Now the DPIA does not prohibit the disclosure of personal data, but any disclosure has to be fair, lawful and in compliance with data protection principles. So what this means is that, you know, it's not the case that under any circumstance, personal data cannot be disclosed, right? But if it is, it needs to be under certain conditions. If we, for example, have consent from the participants, we have data collections at the UK Data Archive. Sorry, the DPIA stands for the Data Protection Act 2018. So now that's how we refer to the new GDPR. I probably, I'm not sure if I should be answering questions now, but I can see them on the screen and I don't want to move on without clarifying that. So, so yeah, so I was giving the example of consent. So we have data collections at the archive where, you know, personal data is being present in the data. So the, so, so, you know, you would have names in the data and it's very, very simple to try to identify that person, of course. But that has been, that has been shared in a fair and lawful ways and it is in line with data protection principles because there is consent in place. Right, so we need to think if we are, if we are sharing personal data or if we think that for our data, I know that for some research, it is important to share, to share that information. It is possible to do it, but we need to do it, you know, in a, in a, in a fair and lawful way. So things to consider here are the age of the information. So of course, over time, data becomes less sensitive and, but we still need to consider the ethical implications. So for example, even if a person is a person in the data set, if we're accessing data that was collected, say, 30 years ago, if someone in the data set is already ceased, then that, of course, is no longer an issue. However, there might be relatives of that particular person who might be affected by, you know, someone being able to identify that person in the data set. Right, so we need to consider all the aspects of identification. We need to consider the level of detail, of course. So again, we go back to the difference between anonymization and pseudonymization and how much detail is being left in the data that we share. We need to consider the context, so private life or about, or about more public matters such as their working life or life satisfaction. So, you know, what is present in the data, you know, is it actually, we need to consider the context in which the data and the information has been collected. It might be that certain aspects about their working life is not, you know, they're not considered to be necessarily discosive or problematic. Right, so that's why we need to put everything in context, and that's why when we receive data at the UK Data Archive, we need that metadata. We need that information about the data. We need the documentation to understand better, you know, the context in which the data has been collected and all the information comes along with it. So I'm being distracted by the chat and I'm trying to close it every time I see a comment. And then the rule of thumb. So try to assess the effect of any that the disclosure will have any individual will affect on, would have an effect on any individual concerned. Right, so I think this is pretty self-explanatory, you know, if you would be in that situation, if you would be able to be recognized from the data, you know, how would you feel? I think this is pretty self-explanatory, as I said. Try to use your best judgment and of course there's always guidance if you're unsure about something in the process of anonymization or pseudonymization, do get in touch and we can advise. Next slide please, Maureen. Okay, moving on to classifying information and the different types of variables and the different types of information that we can find in data. So because we discussed the anonymization and pseudonymization, I mentioned earlier in the previous slide that in the case of pseudonymized data, you would still have some information remaining, right? Some information that will be valuable for future reuse. And I mentioned indirect identifiers and I mentioned that they will be discussed in future slides and here it is. So these are the different types of information that there may be still present in the data or there may be, you know, taken out or redacted, redacted by redaction. I mean it's a process where you're trying to reduce the detail. And we're going to have a look at how that can be done also in a few examples later on. But for now, just to classify this information, what we refer to into what we refer to as direct identifiers. So this is information that directly identifies data subjects. So this would be names, of course, addresses, social insurance number, national insurance number, IP address, NHS number, etc. Then we have indirect identifiers. They're also in STC, they're also called key identifiers. So this is information that in combination may uniquely identify subjects, right? So it's not necessarily that it points to someone, right? But if in a data set we have the examples here, gender, age, religion, occupation, income, and a couple others, right? Then in conjunction, for example, with the electoral register, we might be able to identify someone. So that is why these are called indirect identifiers. They don't point to someone in particular, but they contain that information that all put together can help potentially to identify someone. And pseudonymized data often includes these key identifiers, of course, they would vary in exactly what information is collected across different projects. And then we have sensitive variables. So this is information that is subject to legal and ethical concerns. So examples here are criminal history, sexual preferences and behavior, political affiliation, medical records and income. We notice here that income is also present in under directed identifiers and under sensible variables. So, you know, one can be in different categories, depending on, you know, someone might consider income as being a sensitive variable, right? If we have the exact level of income, especially if it's not a rounded number, right? If it's, for example, 32,157, that is a very exact number. And I mentioned earlier reduction, so what by reduction I mean that we would round that number, right? So we reduce that level of detail. But we're going to have another example later. And I linked here an article that I think is very, would be very interesting to read for you in your own time. It's called You're Not So Anonymous. And I think it will be a very good piece for you to read later on. Maureen, the next slide please. Okay, to further, you know, understand this spectrum of data reduction and different levels of anonymization. We put together this slide and at the, we see there are different arrows on the screen. They each refer to different concepts that we might find when we're thinking about data anonymization. So on the first arrow, we see from left to right, we see that on the left, we have raw identifiable data. So this is the data that has just been collected. This contains all the information, including personal data. Again, as I'm sure you know, we should only collect data that we have to collect, especially when it comes to personal data. So keeping in mind if we're just using it for administrative purposes to collect only as you absolutely necessarily need. So don't collect personal data that you don't necessarily need. So this would be the data that contains all that information. And as we move to the right, we have de-identified, pseudonymized and anonymized. So de-identified refers to data that has had all that identifying information removed and by identified, I mean identifiable, I mean personal. So data that would point directly to someone. And the examples that we saw on the previous slide, so these would be names, you know, addresses, email addresses, IP addresses, etc. So all these would be taken out. That would be de-identified data. And then if we move further on, then we have pseudonymized data. So as we discussed on the previous slide, pseudonymized data is data that has been further redacted from de-identified data to a point where you'd have those rounding up that I gave examples for an income, or you'd have, for example, you wouldn't have particular ages, you'd have banded ages. And by banded, I mean that you put them in certain categories, right? So you recode that variable into, say, 5 to 10, 10 to 15, 15 to 20. So you reduce that detail in the data, right? We're going to have more examples later. And then after pseudonymized, we move on to anonymized where we strip off even more detail. And, you know, once we reach that full anonymization, as we discussed earlier, it just reaches a point where you don't really have any data left in the data. And, you know, the data utility, as we see on the second slide, second arrow here, is... I don't want to say depleted, but probably that is where it is. So, and then we move on to the second, third, and fourth arrow. So we have data utility, access controls, and information loss. We've already discussed data utility. So, you know, as I said before, as we move from completely anonymized towards to anonymized, de-identified, you know, the data utility goes up, right? So the more detail that we have in data, the more useful that data will be. The same direction is for access controls, because the more detail there will be in the data, the more access controls we will need. So for anonymized data, that will probably be available under open access, and we're going to have a look at what the different access levels are in later on. But so as we move towards, you know, from anonymized to re-identifiable data, the more access controls we will need in place. And, of course, for information loss, as we discussed earlier, you know, the more we strip off information from the data, the higher the information loss, right? So I hope this has been a useful slide to understand these concepts. And I think we're ready to move on to the next slide, Maureen. Okay. Yeah, sorry. I have to double click every time I want to change a slide. I don't know why, instead of just clicking once. Anyways. All right. On that point, it's probably worth also mentioning that researchers still fall under some legal obligations to disclose. So in the UK, researchers have a duty of confidentiality. However, that duty of confidentiality is not absolute, and it's not protected by legal privilege in the same way it might be for, for example, for solicitors or medical doctors. Specifically, researchers are still obliged to inform appropriate authorities where there is abuse of children or vulnerable adults, or crimes that are covered under terrorism prevention legislation, such as money laundering. So my own experience. There we are. My own experience for my PhD research, I went through the health research authority for ethical approval. And specifically they scrutinized the consent forms for clear statements, which clarified the limits of confidentiality. And this included thinking about the possibilities of where raw data would be presented, such as publications, and also informing participants that I could actually break confidentiality, including all of that information, of course, meant that my information sheets and consent forms were quite comprehensive in terms of what I had to actually discuss with my participants. And here you can see a sample of the text that I used. So I think the point that anonymization is, you know, it's a bit of an arc and it's not absolute is one of the points that I'm hoping that you would take away from this. The other point that I want to make is around other aspects which can impact anonymization. So here is the outline of different access levels that we use at the archive. This is something that Onko mentioned on the previous slide with the arrows. But you may need to think about who has access to the data. When you share data, it doesn't mean that it's just, you know, openly available. Most of our data is safeguarded, meaning that users have to sign our terms and agreements in order to reuse the data. And those terms in it and conditions stipulate that even in the unlikely event of re-identification, re-users will not share identities of participants. And there are other levels as well, which are more restricted. And those include things like permission only access, where the depositor would need to approve the reuse of the data, or even control data, where you have to actually use the data within our controlled environment. And then we perform checks on your outputs. There is a little bit more detail about access levels, legal obligations to disclose, and anonymization in our Managing and Sharing Research Data Book, which is published by SAGE. The UK Data Service also publishes resources online. And you can also find us on social media where we'll tweet updates and other events of interest. But moving on to the practicalities of anonymizing. Qualitative data can be particularly tricky to anonymize. So the very nature of it means that it's full of indirect identifiers, particularly when you have like rich biographical data. The tips here are much more focused on direct identifiers like personal data. But a full risk assessment should be done anytime that you share the data or excerpts of the data. When you are anonymizing call data, you should, you know, ideally anonymize at the time of transcription, unless you need to link your data, or you have explicit permission from your participants to use unanonymized data. You should aim for consistency. So consider writing up an anonymization plan, which details the broader areas that you're seeking to anonymize or check. And we will have an example of an anonymization plan later. You should identify replacements in the text. And usually this is notated with brackets, square brackets. I have seen notation in other ways, like color editing, color coding. But, you know, something like brackets are what's often used, so whatever text you actually edit or take out, that's usually notated in that kind of way. Or even if you use a pseudonym, the pseudonym would be in square brackets. You may want to consider using an anonymization log, especially if you don't plan on keeping the original unanonymized data. So if you ever do need to go back and check, you know, what the original said, then the log will ensure that you have the original information. Within the archive, the unanonymized version is text, but it's never released. So we have it on protected servers. It's sort of, I've heard it called like a shadow collection of the collection. But yeah, it's not for release. But if you have appropriate security for that unanonymized data, then that may also be an option that you would consider. You should avoid redacting. So use pseudonyms where possible as it helps keep the relationships within the data intact. And you should avoid overanonymizing. So think about, you know, aggregating variables like towns to larger areas, like regions. Or for example, it may not be necessary to get rid of an entire date. Perhaps you can keep the month and year. So try and keep some level of detail where possible. One point we'll come back to later is that it's better to try and control the access than it is to overanonymize. The detail of qualitative data is where the value of the data comes from. So it's really important to find the balance there. And controlling who can see what data is potentially a better option than just getting rid of the detail altogether. So here's an example of a transcript. So you can see the name is replaced with the pseudonym. And that pseudonym would then be used in place of Lucas throughout the text. The date and town are aggregated to a larger level. And you can see once the interview starts, some of the biographical details are left in. And those are potential indirect identifiers. But it may be important to keep that detail for research. And in this case from this particular collection, it actually was quite important to have some of those biographical details. So instead of taking out those details, a better approach would be to ensure that your participants are informed about how the data will be used. So for example, the top text for consent talks about using extracts of interviews and photographs in various outputs and how the interviews would be archived. An example below is from the participant information sheet of my PhD research, which shows that I'll use quotations and narrative themes in outputs as well as describing who has access to the data. So again, since I got my ethical review from the Health Research Authority, that meant that the NHS, the HRA, could run audits on my data. So I had to keep participants informed that they could potentially be checked by other organizations like the HRA. And another way to protect participants is to control access conditions. So this excerpt shown earlier of the anonymized interview, this one, that excerpt is actually had quite specific access conditions across the collection. So the collection contained interviews and diaries and some of those interviews were available as standard safeguarded to our registered users. Another few were embargoed or they were kind of held back and inaccessible until 2015. So after a certain length of time had passed, you know, Uncle mentioned that data can become less sensitive as time passes. So you might want to think about keeping it under embargo until sufficient time has passed. So this collection is called Pioneers of Social Research. And this next slide shows a very different example in this collection is called Pioneers of Social Research. And Paul Thompson conducted oral history interviews with leading social scientists which included extensive details about their childhoods, their education backgrounds, their careers, and because of the relative fame of his participants. These are leading sociologists, historians and anthropologists. So they're all well published, they're well known within their disciplines. So anonymization was really just a fruitless exercise and actually even problematic in this case. So you needed to know what their identities were. So instead of anonymization, Paul Thompson sought explicit permission from his participants to use their names. Having said that, that doesn't mean that we didn't have a clear anonymization strategy in place. So when digitizing and reviewing the transcripts, we still looked for issues where participants talked about the details of things like, you know, closed court cases, medical conditions of others who were not involved in the study, or potentially, you know, reputational damage. So we still felt there were clear ethical boundaries, even though we had consent in place that covered our legal use of personal data. And one final example is Jane Seymour's Managing Suffering at the End of Life. And in this study, Seymour interviewed family and carers who had experienced a loved one going through long term sedation as a palliative care measure to manage pain and anxiety at the end of life. So this was an ESRC funded project. So there was a mandate to archive the data where possible. And because this data covered extensive, sensitive personal data, this is a specific category under the GDPR laws. So this is data protection laws. The consent for the use of data had to be managed quite carefully. And the UK data service worked with the Health Research Authority on this. And it's the Health Research Authority who gave the ethical approval for the project. So we worked together to collect consent to share the data a few months after the death of the loved one in this case. So it was being too difficult to give informed consent at the time of the interview. So participants were given the opportunity to have some time away and then make a decision. So I did want to point out that this one specifically, because we often get queries about sensitive data, which is, you know, that can be quite problematic to share. And I wanted to point out that sensitive data means something very specific in legislation. And often the data isn't actually sensitive under that definition. However, that doesn't mean that it doesn't feel intimate to participants. So even in cases of sensitive data, however, it's still possible to, you know, anonymize effectively and share the data when consent and access are also considered. And Seymour's project, I think it's a great example of where this has happened. All right. And I think we're back to Anka. Thank you, Maureen. Okay. So Maureen focused on anonymizing qualitative data. And I will be focusing on quantitative data and the next few slides. So if we think of an organization in steps, so, you know, where do you start from and delivering a final data set, it can be certain aspects can be similar for both quantitative and qualitative, right? So in terms of the first step, that would always be to identify and remove or redact identifying information. It will be, you know, the direct identifiers we spoke about or personal data. Of course, this would be in line with what the participants agreed to in their consent forms if you're using consent as a processing ground. So this would be arguably easier for quantitative data because it's all about, you know, recoding variables or removing them completely if that is the case. And, you know, it can vary for qualitative data as Maureen mentioned, you know, especially when it comes to indirect identifiers. Next slide, Maureen, please. Okay. Once we've tackled the identifying information, the next step is to identify those indirect identifiers, right? So here we're talking about, of course, depending on your project and when information is collected. So this would be, you know, age, the exact date of birth if that was collected, gender, occupation, income, geography, ethnic background, ethnicity, religion, et cetera. So this, we need to identify these identifiers and then think about this is where already at this step we need to think how we're going to share the data, right? So if we're planning on sharing it in an archive which is going to set up appropriate access restrictions, then we need to keep that in mind because if we go ahead and we start removing and we start redacting, then we might end up overanonymizing. And that is something, again, we don't want to do, you know, thinking back of, you know, that concept of data utility that we saw earlier. So ideally, really, at this point, if we're planning to share the data and we are, you know, very much concerned with that data utility maintained in the data, then we'd recommend, if you're not sure, just contact the archive or the repository you're planning to share, you know, consult with them as to how the data would be shared and, you know, what their access restrictions are so that you can anonymize and redact accordingly. Right, so here it's very important to have good quality metadata for this process. So, you know, having variable labels and value labels that are comprehensible and intuitive is very important in the data as well. Okay, moving on, Maureen. I think step two, okay. And then in step three, again, this is going further. After we identified those indirect identifiers, we need to check the frequencies to identify potentially disclosed information. So what we mean by this, this is, we probably cannot go in as much detail as would be needed for this because we would need to go into, you know, the details of statistical distortion control and so, but the idea here is to avoid having small counts in your data, right? So thinking back of that situation where you have indirect identifiers in a dataset, right? So, you know, the age, the income, the gender, the educational background, et cetera. And you have certain categories, right? When you group all this information together of people, for example, for certain ethnic backgrounds, and I'll just give an example, if you only have one person from a specific ethnic background who lives in London, who has, you know, has a dummy variable for diabetes and that is positive. So if you have that combination and you have only that one person, that is disclosive and can be potentially identifiable because there might be someone out there that thinks, oh yes, I think I know this person, right? So that is an issue. That is what we are, that's why we are checking frequencies to make sure that there are no small counts and if we do see any small counts, then we should redact that accordingly. And this is again going back to that, you know, advice of consulting with the archive because they would be able to tell you, you know, with what number they're working with. For example, you know, they would tell you there shouldn't be any counts lower than 10 or there should be any counts lower than 20. There are different sort of thresholds that are used. So that is again important to check with the archive. Checking outliers, if any. So again, this could be disclosive. Outliers, just to explain the concept, I think that is out of, it's just a value in the data that is absolutely either very minimal, or to the min or to the max, right? So for example, here, just to give an example, say age, right? So we have someone in the data set who is 115 years old. That would be an outlier because it's just so far from, you know, from the mean, what you'd expect to. It's just something that stands out, right? That's what an outlier refers to. So these would be potentially identifiable if they are outliers. And finally, to check string variables. So if you have in your data set any sort of open-ended questions, right? So people, you know, insert text or if you have an other option in your multiple choice sort of answers, check that because it might be that someone includes some disclosive or sensitive information in there. So for example, you know, I worked for a specific company for 30 years or my brother has a rare type of disease or, you know, so these are examples. I have seen names included in open-ended string variables in data sets. So please do have a look at that and make sure that there's nothing there that can cause issues. Okay, next slide, Maureen, please. Okay, so now thinking of an analysation techniques. So, you know, I kind of gave already some examples in the previous slides in terms of reducing that level of detail. And I surely recommend, again, contacting the archive before, because we've seen quite a few cases of over-analyzation. It is just so unfortunate and sad when you get a data set that has been already, you know, redacted to a point where you see so much detail coming out that really shouldn't have, really didn't have to be because we were going to make it available to access restrictions anyway. But when it comes to redacting the data, having said that, these are some techniques, and also Maureen also mentioned the first one, so aggregating or reducing the precision, right? So, obviously, if we have a village thinking about the number of inhabitants, we just aggregate that to a city where you have higher counts of bigger populations, so less risk of disclosure. Recoding categorical variables, so this would be those indirect identifiers into fewer categories. So, I gave the example earlier of ethnic backgrounds, and in some data sets, these are just restricted to four or five, for example, so that you have fewer categories and you have higher counts, so that we reduce that risk of having really small counts for certain ethical background or the ethnicity variable, right? Suppressing specific values of indirect identifiers for some units, so this might be the case if, again, in an SDC, in the statistical disclosure, sort of setting, if we see small counts, it might be that we need to suppress some of these values and by suppressing, we either mean taking it out or, for example, replacing it with a smaller than five or smaller than 10, and so that we reduce that level of detail as to, you know, we're not saying it's two, we're not saying it's three, we're not saying it's six, we're just saying it's lower than five or lower than 10. Generalizing the meaning of text variables, so replacing potential disclosure and free text response with more general text. This is pretty self-explanatory. And then restricting the upper or lower ranges of a continuous variable to hide the outliers. I gave earlier the example of age, so, you know, we would look at that frequency for an age variable in this case. You know, we see what our outliers if, say, over the age of 85, we only have two or three observations, then we would cap that at 85. So it really depends on your dataset and obviously the population that you have. So how to decide, yeah, check the frequencies. And then analyzing your reference data, so, again, latitude and longitude can be problematic, especially, I remember we had a dataset where, you know, they were using phones to record someone's location. And, you know, when we got the dataset, the Latin launch was still in there and we, you know, we went back to researchers and asked, is there a possibility that this can, you know, be the Latin longitude of someone's home? Because they were interested in where people eat or I can't remember exactly. So they said, yes, yes, it can be, right? They eat at home. So that, again, is an issue. They're not eating in different restaurants that that might not be. But, again, and this is where that context come from because we discussed earlier about context and, you know, understanding what the date is about and, you know, is this a problem or not in that situation? So, yeah, if it is problematic, then that we need to be replaced with a different type of variable. Okay, different geographical variable. Thank you, Maureen. So here we move on to that example. We have a data set. This is a mock data set. We have four different variables. And in the next slide, if we can already move to the next slide, Maureen, there'll be a bit of back and forth with these two just to see the differences. So here's an example of how we would reduce that amount of detail, right? So for the age variable, as we discussed earlier, we top coded this to 80 plus to hide the outliers. As we saw in the previous slide, we had someone who was 118. Also, we changed the... For the initiative variable, we recoded that into fewer categories so we have less precision. And that is another example of something that we can do. Again, we would need to look at the counts and that would be specific for each data set. So the message there is to look at the frequencies. Okay, moving on, I think... Okay, so here we just have a slide of useful software. So SCC-MICO, when we're thinking about... We mentioned this to just go disclosure control. So this is something... It's a very useful tool to use. It's a package in R. So if you're familiar with R, I'm sure you've heard of it. So it is very useful. We use it as well. QAMI data is a tool that we developed at the UK Data Service. It's also an open source and you can download it and under this link you should find all the information that you need. It also helps with other issues and sort of data quality and data integrity checks. So not just anonymization. It also looks for, for example, for duplicate observations and out of bounds observations and so values, sorry. And then we have ARX and URGUS and I'll let you... Again, these are tools that can be useful to anonymize data using a software, not just our own judgment. Okay, next slide, Maureen, please. And I think I hand over to you. Yeah, this is me. So, Gail, if you wouldn't mind just popping in the chat that handout that we've got. That would be really helpful. Thank you. So I mentioned earlier about coming up with an anonymization plan. And before we do a quick exercise and answer some questions with Mentimeter, I just wanted to bring up an example of one. So Gail's going to pop in the chat link to an example of an anonymization plan from the pioneers of social research. And on it, you'll find there's some broad categories. So there's a bit of background about the project, how the data files are managed, what are the potential direct identifiers and what are the potential indirect identifiers and what the procedure is for changing any data. And I like this example because it's one where participants have given their consent to use their personal beta. Nevertheless, we still put a plan in place to review any areas that we thought could be deemed problematic or sensitive and what kind of approach would be used. So in these cases, it tended to be a little bit of redaction, which normally we don't do, but we try to redact as little as possible, no more than a few lines. And we would have opted for slight editing if the situation allowed for that instead. When we did have to implement this anonymization plan, it was usually around details of closed court cases being shared, that kind of thing. So we did have to redact completely. It wasn't something that we could just use pseudonyms and get away from that. It was details that actually legally couldn't share. There were some quite significant issues with it, so we did have to redact in just a couple of instances across the entire collection. But hopefully this gives you a clear visual of what you might put together when planning anonymization. And then to summarize the points that we've made before, I also just want to emphasize this three-prong approach to anonymization. So rather than just thinking about anonymization as something you do once and instead think about how it can affect all stages of research, including carefully considering the grounds for which you're processing the data. So often this is consent. So this should include some sort of discussion on sharing data with your participants. But if you're claiming different grounds for processing, this should also ideally be clearly relayed as well. It should include anonymizing data files and then it should also consider the access to those files once they are created. So all three of those aspects combine to form a much stronger approach to protecting participant identities than just one alone. Yeah, and I think it's time to go to Menti. So I'm going to hand over to Anka for these questions. And we do have a few questions coming in. So I'm going to try and answer some of them while we're doing the Menti, but we'll go through them as well verbally at the end with the time that we have left. Thank you, Maureen. I'll just quickly share my screen now. One second. Yeah, I think we should also have the link in the chat, the Menti meter. So I haven't checked the chat, but I think we're planning to put it in the chat as well. Yeah, thank you, Gail. Okay. Okay. Hopefully you can see my screen. Maureen, can you see my screen? Yeah, sorry, my screen changed. So I couldn't find my audience. But yeah, yeah. Great. Thank you. Right. I don't know why this is showing up because I did just remove all the previous answers, but it's just asking it again. That's fine. Okay. I have a few questions, and hopefully you can join this with the information in the chat. I'm going to wait for a few, a couple, maybe 20 seconds. Oh, so I'm aware of time. So, but we'll give you a few seconds to join. So these are just a few questions, but it will be really interesting to just discuss some of the issues that will come up. Even if you're not able to join, I think it'll be just interesting for you to see it on the screen. It's, you know, there's no pressure. There's no scores. We don't know who answers what, so don't worry about that. It's really just about the discussion through some of the concepts we've touched upon and the presentation. Okay. So I see we have about 20, 20 participants now out of 100 and so that I know are in a hard sign that fit the webinar. So I'm going to wait a bit more, a bit longer. I see that people are still joining. I never know when to move on because I don't exclude anybody, but I feel like at some point it slows down. So I think I'll start now. Yeah, I think just go ahead because I think some people have joined, but just aren't, but if you like. All right. So the first question is, what type of data are you looking to anonymize? And we have here a quantitative, qualitative both or maybe you're not sure yet. Okay. So mostly qualitative, Maureen. That is overwhelming majority. Okay. All right. And both. Okay. So we do have that. All right. Okay. Thank you. Moving on to the next question. What information would you think about when talking about anonymizing data? So, you know, what would you think? Yes, I need to tackle this. Yes, I do have that in my data. I need to think about a solution too. And I think you should have a few options here to name. Yes. Demographic data, job titles, names. Is it sensitive data or identifiable? Yes, that is a good question. Software geographic locations, location of case study sites, organization name, clinical data. Yes. Date of birth, demographic addresses. Okay. This is very good. Political parties. Yes, we do need to, that is something we need to think about how we manage. Personal sensitive, yes, postcode. Yes. This is very good. Personal beliefs. Identifiable. Okay. Very good. Medical conditions. Yes. Very good. Okay. Let's move on to the next question. Examples of our direct identifiers. So, you know, thinking back of the presentation and when we discuss direct identifiers, what comes to your mind now when we're just, we're testing you a little bit. Date of birth. Data birth is not considered a direct identifier because it doesn't necessarily point to a specific person. Even the full date of birth, of course, is not necessary in data, but, you know, how many, how many people are there? It depends how much other information you have in the dataset, right? If you just have one person who is from this place who is born on this date, then yes it is. But in those cases we would, you know, we would just include the year. And in my experience, whenever I raised this with a doctor, you know, I would ask, you know, is it important to have the full date of birth in your dataset? And they would just say, no, I just collected it. I left it there. I didn't realize. But we would just, you know, put the year. But if it's relevant and it must stay in the dataset, it's not necessarily considered a direct identifier, but we need to consider what other information there is in the data. And then if it is problematic, then we would need to think of solutions. I see everything else. Postcode, organization name, yes. Client or patient ID, passport number, yes. Very good. Very good. Email address, yes. Very good. Let's move on to the next. Examples of indirect identifiers. So now, again, going back to the slides to test your memory a little bit. Okay. Occupation, ethnicity, yes. Date of birth, yes. Religion, gender, job role, address. Address is not supposed to be here. It depends what you mean by address. If it's just, you know, if it's the city where you live, then that's fine. But if you mean the full address, then that is a direct identifier. So it should have been in the previous question. Okay. So I think everything else I see is good. So let's move on. Okay. So which is most disclosive? IP address, postcode, gender, or religious belief? Take all that apply. And now we're, you know, thinking back of those steps in the anonymization process, you know, and the value of information and what we're trying to prevent, right? So identification or re-identification. What are the most problematic or most disclosive? Okay. So most of the answers, IP address and postcode. And yes, that's very good. That's what we were going for, you know, in comparison relative to the other options, those are the most disclosive. Okay. Next question. Is someone's job title personal information or a direct identifier? This is a yes or no question or you're not sure you don't know. Okay. We have interesting, we have very, it's very close there in yes and no. Okay. So this is a bit of a trick question. So it depends, right? So if the job title, you know, points you to someone in particular, then yes, it is personal information. Just to give an example, I used to work with someone who at some point Googles their exact job role and it, you know, Google spit out his name, her name. Yeah. So that is not, in that case, it would be personal information. But if it's something like, you know, research assistant, then it's not personal information. Okay. All right. What are direct identifiers out of the list here? So we have job titles, email address, gender, age, party affiliation, and national insurance number. Okay. I like that job title is being selected now based on the previous question. Okay. But the most frequently asked is the email address and national insurance number and yes, that is what we were going for. Those are the direct identifiers for sure. Job title questionable. Okay. What are indirect so not personal identifiers? Take all that applies. So we have geographic coordinates, date of birth, gender, supermarket preference, sexual preference, or ethnic background. So the only answer here that is not considered to be an indirect identifier is supermarket preference. So that is not necessarily something that would sort of play a role in identifying someone, just because someone says in a survey that they prefer to shop in Tescos or Asda or probably shouldn't mention that. I can't paste that, but oh well. Yeah. That is not, that wouldn't necessarily weigh in that decision of identifying someone compared to geographic coordinates, date of birth, gender, et cetera. Okay. The identification is anonymizing direct identifiers. So name, contact, details. Is this enough to consider the data anonymized? Yes, no, or not sure. Okay. I'm glad that the majority answers, no, because that is current. Yeah. So if we think of that arrow that we discussed in the slides, right? So we have raw data. We have the identified data. We have details. We have pseudonymized. And then we have anonymized. So there's, yeah, the identification does not mean that the data is anonymized. It just means that you took out the personal or the, you know, the personal information of the identifying information, the director identifying information, but you still have, you know, those indirect identifiers. So the data is not anonymized. Okay. True or false? Data collected on a sensitive topic is always sensitive data. I think this one is a little bit of a tricky one because it depends on how much data is collected on a sensitive topic. So we have a lot of data. True, false, don't know. And I'll let Maureen comment on this. Maureen, if you're up for it. Sorry, I'm just rapidly trying to answer some of these Q and A's. So some of you will have gotten that. Yeah. There's a couple of people answering. Okay. So I'm just going to give you a little bit of a tricky one because it depends on how you define sensitive. So I would say false to this. Is there, is there an answer? Yes. That is a, well, again, it depends. You can, you can answer. Like you said, I just, I think it's one of those points where often we talk to researchers who know that their data feels sensitive. It feels intimate to participants, but there is a sort of legal definition of what sensitive data is. So, you know, a sensitive topic may not necessarily be considered sensitive data by data protection law. So it's just something to be aware of. There is a sort of deviation, I think, here between legal and ethical responsibilities where you may feel an ethical responsibility because it's a sensitive topic, but it may not legally be sensitive data in that area just because it's a sensitive topic doesn't necessarily always mean it's sensitive data. Great. All right. Let's move on to the next slide. Oh, I think I moved too quickly. True or false. Is it good practice to annotate anonymization in qualitative? Actually, Maureen, I think these are your questions. I'll just put a pause to my answers then on the Q&A. So is it good practice to annotate anonymization in qualitative? True or false? Oh, good. We're getting all truths. Yes, it is very true. Very true. So there's a couple different ways that you can annotate that. The typical is square brackets, but whatever system you want to use, whatever feels natural to you and works with the data set, you can go ahead and then you change the original data set, the raw data into something else. Okay. Moving on. Which of the following strategies can be used to help participants' identities? And we've got three options there. Consent, controlling access to the data and anonymizing the data as soon as possible. You should be able to click more than one option here. Hopefully. And I think interesting. Okay. So the answer is all three. So all three of them can be used. And there does seem to be a bit of separation there where some are saying consent is not a strategy to be used to protect participants' identities. But I think in the sense that it's whatever grounds you're processing the data. And I think most researchers view informed consent as an integral part of the research process. You need consent in order to do anything with that data. So I would say it is. But I know there are some arguments under current data protection legislation that you can process it because it's research and this is exempt. You know, in any case, you should be considering what grounds you're processing the data on and if you're using consent that can absolutely be a discussion with your participants, you know, that you're going to share the data. This is how it's going to be used. And certainly if you plan on sharing data extracts or anything in publications or presentations or any other outputs from the research, your participants should probably know about the data. So excellent. Okay. We did have some further resources. So what I'm going to do is just share my screen and we can start looking at some of those questions while I share this. So give me just a moment to share those last couple of slides. It's links for further resources. Let me just put it in presentation. So we've got some more tools and we've got a transcription template. Transcription instructions. There's a confidentiality agreement for transcription. So if you're outsourcing the transcription elsewhere, we've got a template for the confidentiality agreement with your transcriber. And there's also a data list template to help you keep track of your data. So all of those are available online. Then these slides are going to be made available. So if you want those quick links to those, we'll get back on our events page and the slides will be there. There are also some further resources. So there is an ESRC working paper on anonymizing research data. So there's a guide to social science preparation and archiving from the Inner University Consortium for Political and Social Research. Ruth has an anonymization and social research publication. Timestapes, which is another archive located at University of Leeds also has published anonymization guidelines and the Information Commissioner's Office, the ICO has a code of practice. If you're interested in Mark Elliott's anonymization theory, his book about that framework is openly published online. So just guidance on anonymous data and also some advice from med.data.edu. So again, this is all going to be available on our events page. And then you can stay connected with us via our social media. We're on Facebook, Twitter. There's a just mail email list. There's YouTube as well as our website. And then we have more upcoming events. So if you're interested in these sorts of discussions, we've got some more on data management coming up later, as well as dealing with specific types of data. There's also ethical and legal issues in data sharing if you're interested in that mid-November. And if you're interested in reusing data, you can come back for my next workshop on getting started with secondary analysis. So all of those are listed under our events page.