 Okay. Hello everybody and welcome to Synthetic Data Part 2, Redaction and Masking. My name is Jace Fallon. I'm a research associate at the University of Manchester and thank you all for coming. In our last webinar, which is now available on the UK Dates of YouTube channel, we covered what synthetic data is, why we should even bother trying to make synthetic data, what are the benefits and purposes of these different types of synthetic data, and what are the features of those different forms of synthetic data as well. So if all that stuff sounds interesting and you missed it, go check it out on the YouTube channel. So in this webinar, we'll be covering two common categories of disclosure control. So we covered this briefly in the first webinar. The disclosure control refers to methods that allow us to protect the confidentiality of subjects of research. This is not necessarily synthetic data that it can be, but they're often very closely linked, thematically solve similar problems and can be misinterpreted as synthetic data. If you aren't familiar with synthetic data, as I've said, you can watch that introduction over on the YouTube channel where we go into a lot more detail on that side. Redaction as a definition just refers to removing data deemed too sensitive. So it could be removing an entire row, an entire column, but you could argue that refers to removing parts of individual rows, individual entities, anything like that. So it's very fluffy. Masking instead refers to replacing parts of the data that are deemed too sensitive. So for example, replacing my name with a synthetic name we generate or some initials or an empty string. You'll notice the differences or similarities between these techniques are incredibly vague. And in some methods, we could be using both at the same time. For example, if we replace names with empty string, would you classify that as redaction because we're removing it? Or would you call that masking because we're removing it but replacing it with an empty string? I don't really see much difference there beyond semantics, but we've got to remember we're on the fringe of data technology and techniques when we talk about synthetic data and that surrounding areas of things do get a little bit fluffy. As I've just said, masking is about replacing data or parts of the data with some generated information. Some forms of masking could be considered a form of data synthesis, but this is sort of a larger classification that encompasses both synthetic data and the disclosure control methods. This masking could be done manually by generating your own context inspired data. Alternatively, you could make use of a data generation tool. For example, there's a really good one called Mockaroo, which is all web-based. You don't need to know how to code anything like that. If you are slightly more technical, you could search for particular data generation libraries. There's SynthPok if you're into R. There's Faker for Python. These can generate connected data such as names, emails and more. In this example data set here, for example, we could mask every aspect of the data. We could start by masking the name Graphium Midden with the name Jase Fallon. I've just pulled that from my head, but that is my name. Although we have masked this data, there's still multiple ways this individual could be identified. For example, their email address still seems to contain hints of who they were, G, Midden, Midden being their surname, G being their initial. This email address is enough to uniquely identify this individual anyway, so we don't even have GDPR compliance here, let alone the much more nuanced definition of confidentiality we need when we're dealing with synthetic data. Even in this simple case, we can make quite dangerous mistakes still, and we'll go into more detail with this example shortly. As I've said a few times, masking isn't a method of synthetic data or the synthesis. The simulation on the other hand is. We cover simulation in a lot more detail in webinar three, but let's just break down some of the differences between simulation and masking now. Simulation will use modern machine learning algorithms or data generation libraries, whereas masking, all we need is a curated source of data or some randomness that we can pull from to create new names or ages or genders or things like that. Simulation is usually irreversible, whereas masking methods might be irreversible, but they might be intentionally reversible as a form of sort of encryption. Simulation generally is very difficult, whereas masking in comparison is quite trivial. Simulation can represent the full statistical spectrum of data, whereas masking is more of a low effort sampling exercise. But again, let's go into these in more detail. Quickly, let's just define the definition of simulation. Sorry, not the definition, the differences between synthesis here. So again, we have the ambiguity of these disclosure control methods as a very subtle difference between our justifying masking and what is called simulation. The main difference to me is that simulation comes from the application of those modern algorithms, machine learning or data synthesis packages to generate new data, whereas masking is a more simple find and replace from a curated source of data. Simulation techniques often use some random noise to generate new data, whereas masking will ignore a real data input and output a new or sampled result. Some of these techniques may be reversible. For example, we could map all names to an uppercase version of themselves. It's not particularly useful or that personal data is there, though the data has been entirely changed. And the description of that method would simply be to lowercase those names. Simulation, on the other hand, is very difficult. And depending on your data set, simulating new data might take a couple of weeks. It might take years and a full research team to generate something reliable and fully understand the context in which that data is used, be able to sort of boast that it does protect confidentiality in some regard as well. It may be more accurate and interesting, but usually that's a huge amount of investment and not really applicable to a lot of projects. Instead, masking can be done relatively quickly with a few lines of code, or it can even be done in something like Excel. Simulation may learn the nuances and statistical distributions of the real data. For example, if we had a list of genders and heights, we might notice generally that males are taller and our generated data should also reflect this. Masking, on the other hand, would simply pull from a list of random but realistic numbers of our choosing. And this can be a quicker solution for potential GDPR compliance. Some general best practice advice. Explore your data thoroughly before claiming you've discovered all special category data. Maybe I'll go as far as even say don't ever claim you've discovered all special category data because it's unlikely to be true. It's possible something will be missed, so document everything you can, cover your exploration, cover your disclosure control methods and synthesis in detail, and yeah just cover your back there really. Understand your data before choosing any masking techniques. This will ensure confidentiality, where possible use irreversible methods. It may be that you need to use a tool to encrypt and decrypt data at a later stage, but personally this just opens a lot of doors from malicious use, so be careful. And finally, try to persist the structure of your data, keep the columns, order of columns, and types of data held within similar. This means that really whatever analysis you apply to your synthetic data, hopefully you can just lift those methods and apply them to the real data itself. Businesses are encouraged, but encryption and pseudonymization are both valid safeguards within GDPR. So this refers to the process of mapping data to new nonsense synthetic data with a reversible key, but I would suggest again go a step further unless you really need this reversible step and make this irreversible. Remembering that you might not also only be beholden to GDPR depending on the data you're using, and that using encryption or pseudonymization does not guarantee you that compliance or permission to share your new generated data. It's important to be aware of the legal implications of synthetic data as well, depending on the data you make use of you may not have that right to distribute the data. Consider whether you can legally share any data you've created and the impact of that data, and if in doubt ask your data provider about the data you've created, the responsibility might fall on them to verify and distribute your methods and that data itself. So I have managed to get access to a fake 2020 census data set and our goal is to explore and analyze the people of Manchester, UK. It's based on this fake 2020 census, but luckily there's only two records there, so the census isn't going to be too hard, it wasn't really a good census, not as good as the recent 2021 census. But yeah, it's feasible we can generate new synthetic data for our use case by hand over writing some complicated code. We don't need any Python here, we don't need anything more technical than that, and hopefully we won't need more than a couple of minutes. So we have a first row here, we have an ID 2999, a first name graphia, a last name midden, an email gmidden at 163.com, which appears to be a combination of graphia's first name with a last name. We have a gender denoted with an F, we think this implies female, we think this might imply that M means male, maybe they have an O for other, but we just don't know yet. Then we have an address, and then we have a second row with 93,000, a first name Andy, last name deFriends, an email adefriendsat123.com, again it follows that exact same pattern of that first character and last name, a gender M, so we can maybe safely assume that the M stands for male now, but you know that might hurt us later on. And let's assume we need to protect the identity of both these individuals as well as apply some preprocessing. And our research question is what kind of people live in Manchester? We're not going to have a very deep answer here because we've not got a lot of data and we've not really got much granularity within that data, but let's give it a go. So to start with, let's look at that ID. Does the ID 2999 apply we've missed 2,998 rows? Could this ID be used internally with the organization that provided this data with us to sort of identify that individual? Our second record has that ID of 3,000, so this seems to imply that the IDs are incremented, but it doesn't really give us any information about what happens after 3,000 over 4,999. Without that domain knowledge, it's really difficult to know just how dangerous this ID could be. And there may be restrictions on this data, maybe an API will fail down the line if it's not a four-digit number. So we need to be very careful with the assumptions we make and just lay them all out really. Now let's generate two new sequential four-digit numbers. Let's also note that we've done this, so we're now masking those values with 9,000 and 9,001, so still four digits, still incrementing, but just not to the same ID as before. Next we get onto what are called sort of free text fields. Often they're not useful in analysis, but with some pre-processing we could make some huge assumptions from these names about race, age, but it's always going to be better to get an alternative and more reliable source of data for these. We could remove these all together, but for now let's just try and maintain the shape of the original data. So the name Graphium Midden might not be unique to this individual worldwide, but she might be the only Graphium Midden in the UK, and it's more likely she's the only Graphium Midden in Manchester. It's for this reason that these need to be removed, especially as we reduce the scope of our research. We should generate a new first name and a last name through a masking method, such as just pulling a random name from a curated source. So I've replaced Graphium Midden with Georgina Shiki and Andy de Frens with Jace Fallon. We could have made use of the data generation here, data generation library here, sorry, like Faker or Mocharu, but I've just used my head to sort of make up some names. Cool. As a test we should move along this record and find the next actually that doesn't really gel with the rest of the data we have so far. So for example it seems kind of odd to me that Georgina Shiki's email address contains Midden, but perhaps that's a previous last name or Midden name. It's not impossible, but while we're here we could change this to something we can sort of quickly synthesize. We've already recognized the rule and emails that we think make sense. So further to this, these emails are also unique identifiers under GDPR. So this is something that we will have to change anyway as GMidden at 163.com is still Graphia's original identifiable email. So we need to replace both of these emails while maintaining their connection to the columns before. So that's where it gets a little bit more complicated. Generating random emails wouldn't work here as well. I couldn't just make something up. It has to be connected to the names Georgina and Shiki potentially. I think I can you find on average that that holds up. So instead we need to generate an email related to our names and those good generation libraries like Faker will handle this for us. So will Mocharu, they'll make realistic email addresses for us. We could argue that this is the first part of the process that's really involving synthetic data. We have a model which happens to be the first letter of the first name, a Pender lowercase summing to it and generate a random three digit email provider. Sorry, just a question there. Until this point, everything has been masking just with a little bit of logic. So while it's not entirely realistic, we have matched the format of the emails we've seen so far and that's really the best we can do with this data set. Next we look at the gender column. Again, we've made some big assumptions. We've assumed that F refers to female which feels like a safe assumption but we never know with business logic and that M refers to male. This assumption may be false and hopefully there's some documentation we can check but in this case there isn't. But luckily in this case those genders appear to correspond with the names provided. There may also be options beyond being male and female but we just don't know. It is possible that after our synthesis of a new name it could have contradicted these genders as well. The gender may obviously be connected to a first name column but it also in some cases isn't. If we make use of sort of a data mocking tool or a library, we can generate these whole coherent individuals all at once making sure all of these fields are consistent. And depending on our research question we might not even care about that gender, the names, the emails at all. So with some redaction or preprocessing we could really get some more useful information out of these fields. Finally we're going to look at the address column. This again is GDPR protected information. This does identify the individual. We shouldn't be able to find this individual even if we don't know their name. So removing the house or apartment number might be sufficient here and we'll cover a lot more techniques and this is something called coarsening. We'll cover that in the third webinar. But for the sake of this project we know that we want to analyze the people of Manchester. So perhaps reducing just to the city or the top level postcode of this resident is enough just to you know hand it to somebody else who could say yes this person is or isn't in Manchester. So let's just generate for now a new and synthetic address for each user preserving the city they live in. So now at this point we have two entirely non-identifiable records. There may be a real person named Regina Schickey and Jace Fallon that do happen to live at these addresses but this entire individual with their email genders and addresses were generated using that random data or using data synthesis and that is really enough. We have to be able to prove that we think we've made up a new individual but it's statistically unlikely they exist anyway and it did it would be a coincidence and we've documented that all and that could be verified. And depending on our data provider this could be data we could share alongside our research. We go a long way in making our research a lot more reproducible through these methods but perhaps there is more we can do. So remember our research question for now is only focused on Manchester so there's just some aspects of this data we don't need. Really we don't need any data on the users outside of Manchester so we can filter all non-Manchester users out by their postcode or city and make a new dataset that's only Manchester based data. Also this pre-text data the first name last name and email while they're useful to create that indistinguishable dataset and might be really useful for a machine learning model we're not really going to gain anything from sort of a high level analytic view by knowing their emails and names so we can redact those as well. So to summarize that entire process we've generated a new ID of 9,000 it might increment and our assumption is that it has to be four digits. We've generated a new first name and last name using a curated source but that curated source happened to be my head. We could use an internal file of realistic names or names sorted by religions or races or all sorts of stuff depending on what we actually need or we can make use of the data generation library there. We've made use of these generated first names and last names to generate a realistic email address and we've created what we think is a realistic gender and I recommend using the data generation library here but again be very aware of those implications that even assuming the first name is connected to a gender in some cases that is the case but in many cases it's not the case as well and then finally we generated a new realistic address for this user restricted to the same top level postcode. We could perhaps imply that that first name implies an age range and ethnicity and more just having knowledge of a user having an email could for example elicit information that they have access to technology or some details about their age these are very bold assumptions to make and many analysts refuse to do so rightfully so make it clear if you do anything like this it's in your documentation and why you did it and personally my goal would be you know work with this data to showcase those assumptions and use that to get access to the reliable data you know if you think you found something there you need to prove it with something real you can't just impute everything you didn't understand. Using historic data for example we could imply that the name Georgina has an Italian heritage it might however just be a nice name somebody had we can actually use the office of national statistics baby name explorer to see the rise in the popularity of different names for different babies different babies and we can see that the name Georgina peaks in use around 20 to 30 years ago so we could sort of assume that Georgina must be between 20 to 30 years old even though that's obviously not necessarily true that's just an invitation and we could showcase these assumptions use those findings to try and elicit more reliable data however the more we're making these assumptions the less precise our results are going to be and any policy changes that happen would really be in less good faith if it was informed by this research it's really I'd just say let's speak to what we know we finally have the following we've created a synthetic non-identifiable record which represents and in this case entirely matches the distribution we did see in the real data we had 50 percent of the people in our data set were were female we had 100 percent of them in Manchester were female we still have that here we've not we've not lost that finding the ID won't really serve any purpose in our research as I said we've claimed that 100 percent of the people in Manchester are female this is a limitation of the poorly collected fake 2020 census more than anything else and I've course and those addresses to a top level postcode to make sure we're only collecting those Manchester individuals and we could plot a population distribution if we wanted to but there's it's not going to be a very interesting one as I keep saying there we've been forced to make a huge amount of assumptions even though this is such a simple use case there wasn't really any documentation so we don't know many things we're just making a lot of assumptions F means female and means male IDs must have four digits for some reason but there are no rows missing you know in some cases seeing those IDs 2999 and 3000 might make me think okay I've not got the full data set here have I downloaded something wrong by mistake there's only two people in that entire census it's very very odd and that only the ID gender and the course and address are needed for our study again another huge example assumption even the responsibility is on the person generating that synthetic data to document and share all of these assumptions without the context of the original data it's really difficult for us to ethically generate more or to claim that we protect confidentiality or anything like that but to summarize that process of redaction we've redacted all rows not containing Manchester data because they're relevant to our research question we redacted the free text columns first name last name and email while we could imply countless things we just can't justify keeping them for the study and then finally we've redacted parts of the address or we potentially masked parts of the address with nothing keeping only the top level of that postcode so let's discuss what masking is in more detail and the advantages and disadvantages through a variety of very strange masking methods some of the masking methods can be summarized as following so we have substitution which involves replacing data with data from a curated source this might be a simple text file of names randomly generated numbers or even use of a sophisticated data generation library I guess we could pull numbers from our head as well if you want to do shuffling which involves randomly changing the order values appearing columns using real user data but sort of mixing it up this is very similar to sort of what mimicking is in the third webinar as well variants which is adding or removing a small amount of noise to numerical ordinal values encryption which is using any modern encryption techniques to translate data into nonsense values so that only the person with the decryption key can return to those original values scrambling which refers to rearranging the order of characters using real data nulling out which refers to replacing data with clear null values and masking out which is replacing data with clear fake values such as first name or x out characters things like that of these techniques I really only say that encryption nulling out masking out and substitution are the ones that might ever be appropriate to use I'm thinking why did I include the other ones in but they're quite they're quite fun to to walk through very quickly but they all depend again on your data provider very often you can't reshare data just because you've done something to it you have to go back through the process of depositing that data and getting it verified but let's go into more detail on each of these techniques to start with substitution personally I believe this is the best form of masking you could do when done right there should be no personal data your data should still be indistinguishable from real data because you're still using sort of realistic names realistic emails they're just not real names and real emails and while it's a technical process you'll probably get the most value for your time doing this you could do it realistically in excel but personally I go for a sort of programming approach using that mockaroo or faker from before but you know you could get away with this in excel you could have a sort of curated list of random names that you pulled from and things like that so in this example table again we have a our graphy and midden again who is 24 years old and we have an andy otto who is 45 years old in substitution we use a curated source with those realistic names connected data so for simplicity sake let's ignore that correlation between names and ages even though it is there and we're going to substitute in two new names that have been generated using a data generation library and your friends and jace fowlen well they weren't those obviously came from my brain because that's my name but yeah the ages are still there but realistically these ages should blend in with the sort of thousands of similar individuals in this full dataset using combination with a variance based method here to adjust those numeric values we could have quite a good solution to disclosure control here shuffling instead refers to changing the order of values in columns because we're using real data here we aren't really doing much to protect an individual under gbpr a real name alone is personal and identifiable even though the remaining data isn't correct just using somebody's real data like this isn't really useful personally i believe if you're capable of shuffling shuffling random values in a column you're capable of doing everything we just showed before so use substitution instead and in our example this means we just swap the two names around i mean the shuffling could end up with the new dataset being identical to the previous dataset but in this case they swap so now we have an andy otto who's 24 and a graphe and then he's 45 this isn't a real individual but we've used their real name and that's still identifiable so we're still just showing that we have graphe and then add the otto's data variance based methods instead involve adding or removing random values to numeric or ordinal data with a little bit more logic we lose precision here and that will have knock on effects to any analysis on machine learning you do actually apply to this data that if we do it to sort of a small degree the general trend should stay there and appear the same but yeah as as always we're going to have these trade-offs between confidentiality and sort of the benefits of the analysis we have we can perform we could apply this to any numerical ordinal data but it makes less sense to try and do this with free text or categorical columns which often contain that more dangerous personal data in this example we're simply adding a number from minus two to plus two to each age while we still know the names we could still make true sentiments like saying graphe is younger than andy but we're not really protecting much about the individual themselves that's still identifiable i would suggest doing this in combination with what we did with substitution will resolve most of the privacy concerns with this dataset we have some relevant questions in the q&a so i'll go through them could you not just redact the email address so looking at the previous example yes we could just redact the email addresses and we kind of do in the end of that case study my first pass through is just to see if we can make everything indistinguishable in the data set can we make it look like it's real data but then we can apply our research question obviously if we're analyzing the people of manchester and all we know is that all of them have an email address it doesn't really do much throughout analysis to just say they all have an email address and in reality a column that said do they have email addresses true or false would give us the exact same potential for analysis than the email addresses themselves scrambling in encryption methods we can use any modern encryption algorithm to convert our data into an unreadable form this will normally turn our data into some unreadable nonsense meaning we can sorry it's not scrambling it's encryption we do have to ask so what is the point of doing this why are you using encrypted data when substitution could be reasonable and have more benefits one main benefit here is just for compliance sake if we did get a request to delete this data simply deleting the encryption the decryption key rather than tracking down all the files that contain this data could be enough to ensure this data is now valueless but in some aspects just having that reversible method is a benefit as well we could give our data third parties they can analyze the encrypted data give us the analysis and we can apply it on the to cryptid data being the only ones who can actually gain anything from that data set then this example we're applying some encryption to our names and they don't really look like names anymore it doesn't really you know it's not indistinguishable we as humans could identify which one was synthetic and which one was real and I would worry that you know sometimes these encryption methods can move things to special characters for example the quote quote strings the pound sign the exclamation mark and some of these might break some api's or training we do down the line so yeah something to be careful with scrambling involves randomly rearranging the characters in a string so similar to shuffling there are cases where your data could be identical coming out of this process joseph could just become j-o-s-e-h-p which would still almost look like joseph and your brain might interpret anyway you could probably crack that scramble also there are countless anagram solvers online that could really make short work with these names it's potentially reversible data doesn't really look real and it requires some really technical skills so I would really suggest just again use substitution instead of this in our example below you can see that while the names are scrambled capital letters and space characters give us extra clues about the name putting xarig into the anagram solver gave me 10 results and one of them was the correct name graph the emitting anyway so not great and again I just recommend substitution over this nulling out refers to replacing entire pieces of data with a clear null value showing this data exists in the full data set but has been intentionally removed for this data set's purpose a benefit here is that an analyst could perhaps speculate on what these columns could add to their research and use this as a foundation of expanding that research question can see this as a slight benefit over the full reduction for example removing the name column this can be done in excel but may require something more powerful if the data set is quite large and again in our for example you can see we've replaced names with a large and obvious null value five documentation I would try and make it clear that we have removed this not that they're just null values in the data set potential problem with this method here is that we are signaling with these null values but we're assuming that there aren't actually null values in the data set and those of you that have died into any large sort of industry data set will know that there are frequently mistakes there's frequently missing data and our null values might not stand out as proudly as we're hoping and this is why I prefer the next method masking out masking out refers to replacing parts of data with obviously fake data and this could be replacing every bit of data with a nonsense stream but often we use some clear fake data allowing us to showcase that this data is not null it's just been intentionally removed and replaced with something else for example we might simply use the data actually its name so we might just have in all capital letters age name gender to show something that was there but it's also important that we document we have done this a relatable example that you might have seen is if you ever try and put your credit card details or debit card details in a bank statement or try and pay with them online when the card number is given back to you later they'll often show three groups of four X's sort of mimicking the structure of the card but still revealing the last four digits you can check it's your card this is safe enough to display without risking a bank account and in our example data set we can simply replace our names with just the large word name or something like synthetic name fake name whatever whatever you think will make it obvious to the users of the data personally I prefer this option to the flat out redaction or nulling out because we still retain some of that context we still did research as the ability to sort of speculate on what was there and why that might be useful and potentially how we could get it so here we recap all of these methods I won't go through them all again but I'll reiterate if you're going to go through the extent of trying to provide confidentiality avoid the methods that don't give you unidentifiable records so try to keep the context where you can and use methods which generate indistinguishable data if you're proficient in excel or programming language my recommendation for this set will be to apply substitution or masking out sort of masking out all protected fields or fields connected to protected fields and adding a light amount of variance or any numeric data to any numeric data and perhaps ordinal data if you carefully understand it keep this amount small obviously because this will have enough on effect with any analysis you do do if you're adding five percent to all ages all of your age-based analysis is going to be off slightly if you're not too confident here I'd suggest that nulling out protected and special category data and I strongly suggest you avoid scrambling and shuffling they're just quite silly I didn't think they really belong in this presentation the more I talk about them an encryption can be a really good option for that GDPR compliance for sharing something with a third party and allowing you to sort of make use of that data but be aware again if you're beholden to any legally binding data providers probably can't do this you have to check with them first so whatever process you do choose please document all of your methods protect yourself as much as you can and always check with your data provider whether you're allowed to do this they might be grateful but they might also be concerned about the terms you agreed to so yeah be careful cool and that is us done with masking so next one move on to redaction you may be familiar with the media presented redaction by documents and manually scraped and restricted information is painted over with a black pen often hear these entire sentences are redacted and the general context of the document can be lost when we talk about redaction as a method of disclosure control we refer generally to any process where we remove data from our data set this could be removing sensitive records or a set of columns you might want to remove particular rows from our data set as they may be particularly dangerous for example we may not want to present the locations of our military bases or hospitals if in a data set of an exit warzone we also might need to redact a record as the nature of an outlier might be so extreme that the outlier itself is a personal identifier and therefore protected under GDPR there are now online tools and published methods to automate the redaction process these involve uploading out documents running a model trained on the specific language of that document and returning the redacted document we could replace the redactions with a relevant term in that similar masking out method we just talked to that context is not entirely lost we can also review these redacted documents see if our models need extra training or if there are any edge cases we've missed here we have an example output from such an online tool yeah we can we can see the bank name logo customer details dates and amounts have all been redacted their privacy here is well protected what's the point of holding this document with almost no information as we discussed in the last webinar what we choose to redact is heavily context dependent for example with the research question how did you how do personal savings change over 2020 that might warrant us storing the end balance of this account but we don't need to know this customer's name or address for example so Jill it's the next poll but before we get into that redaction example I want to ask you all how many sisters do you have we're not going to be holding this data outside of this you're welcome to live you want to live that makes you comfortable but this is an optional survey so if you don't wish to answer you don't have to you don't think you can all see the poll results so I'll let it get to the end and then I'll read it out so we've got so 55 percent of you have zero sisters 35 percent of you have one sister five percent of you and only one person has two sisters and five percent of you have three sisters so if you are the person right now with two or three sisters you know who you are you might feel like I've identified you and that could be kind of scary um and that's that's the sort of danger of being an outlier in this small data set but in the sort of global UK data set you wouldn't think that having two sisters would be something to be scared of in a data set really so as an example let's say we have a date a table of data on four students in a class our first student has an average grade of B and they have one sister our second has an average grade of A with two sisters our third has an average grade of A and 11 sisters which I believe was the UK world record for number of sisters which means we could identify this person if we knew they were in the UK our fourth has an average B grade and one sister so the question here is which of these students could potentially be uniquely identified even with this limited data set and yeah it's that third sister there we could it's that world record basically um so yeah if you answer there was student three you're correct I kind of gave that to you um but in fact yeah this this is a UK based student and there's only a handful of students in the entire of the UK that could have ever been um particularly if we knew this was from a recent year or how old these students were it would be so easy to identify this person um because we're very close to that record number of sisters that might even be published and you know it might be in the news for a particular town uh so yeah the point here is that having a high number of sisters could be enough to identify an individual in a particular school with confidence that's quite a scary thought what about student two they only have two sisters um if we knew more about this data set for example which school and which class we were looking at we could probably also identify student two uh if the class was larger than these four students it's likely that those two sisters would be less significant though even in our our class here of 26 participants watching this soon that two sisters wasn't enough to identify that person as well um whereas the students students participants with one or zero sisters you've all blended in with the crowd I can't identify any of you um and yeah this is not the case for student three and in that same logic student one and student four are somewhat protected just because they share the same average grade and the same number of sisters if I went to this class I couldn't point to the individual uh who has an average B grade and one sister so context again is incredibly important here what if our study question was for example what's the connection between the number of sisters the student has and their grades obtained the outlier with 11 sisters could be the most important data point we could ever find in our project and redacting it might not make any sense we might want to put the whole focus of the study on that outlier in many cases machine learning is applied to predict these rare outcomes when a medical scan contains a tumor when the car crashes when credit card fraud takes place redacting on whether or not a row contains outliers might be the worst thing you could do for our analysis this is where other synthesis techniques need to come in or I suppose disclosure control methods to protect the confidentiality of those individuals and it's only with knowing that context that we can decide whether that redaction or disclosure control method is even appropriate to protect our individuals with 11 sisters we could remove them entirely we don't have anything telling us which district school or class we're looking at but our data set could now benefit with thousands of students in the UK this record could realistically probably represent you know a three-student sample from any school even in this small data set there seems to be an implied correlation between the grade attainment and the number of systems a student has that deleted row might be very very useful but we can't really use it for this analysis we could perform our analysis on the full data set but only release the redacted data set for verification and outline the steps needed to obtain the full data set to make our results reproducible in this regard. Depending on the problem we're trying to solve removing columns might be more appropriate than removing those outliers or sensitive rows pretext names emails and similar may not be useful in their raw forms in a machine learning context but with some pre-processing as we talked about before we could potentially impute ages gender religion races and more these could be more useful features and we could better protect privacy in this regard but that being said assuming those protected fields off of another must be stated very clearly it's a huge and perpetuating a very dangerous assumption to make and I would suggest if you ever do do that make sure you're only doing it because you want to get more access to better data not because you want that to actually be your final answer. A research question covering you know how is age distributed across Manchester warrants us having some sort of age and location data but the names of the individuals is completely unnecessary here and in this case redacting that column is entirely acceptable in this example the names column doesn't help with our hypothesis we could remove the column altogether creating an age data set with no other details alternatively we could try and process this column for example if we did take the approach of trying to assume genders based on these first names we could we could sort of do some high level analysis here we could report the difference of these age distributions in males and females and then we could seek out further data to validate that redaction can also be referred to only removing part of the data so for example we could remove the last name of every name but again that that could also be considered course name which we'll be covering in the next webinar removing the sort of house number in a street address or adding those x's to a credit card example but this could all be considered redaction and arguably we could say we're merely redacting the precision of data but yet again what we're seeing here is that these methods are quite vague these categories don't really classify or add much use to what disclosure control and synthetic data can be but it's purely a matter of semantics and I wouldn't worry about describing it beyond using them as vocabulary to define what you're doing so to conclude redaction is a quick fix to remove protected and personally identified data masking can provide confidentiality but further to this it can provide some context in the absence of that data I think that's much more important but it is a lot more technical obviously and then yeah my suggestion would be try and use substitution method throughout data generation and use a data generation library such as faker if you're very good with python or mockaroo if you're not very good at any programming language that's a nice web-based tool that we'll probably cover in the code demo actually and then finally use variants on any numerical ordinal fields but be very careful how much you do it because it's going to harm your analysis almost immediately and next time on synthetic data we'll talk about a course name which is what we did with those addresses there reducing the sort of randomizing the house number reducing it to Manchester things like that we'll cover it mimicking as a form of disclosure control as well which is I'm not going to cover that now we'll do that next webinar and finally we'll look at simulation which is the sort of real cool meaty actually generating some new data from those distributions and if what you heard interest you there's some further reading here the data in government blog is where all these categories of disclosure control come from it's really good there's a research AI blog post on the differences between data masking and simulation and there's a blog post covering I think it's a master student who made their own automated reduction tool which is very good as well