 Hello everybody and welcome to this UK Data Service webinar, part three of synthetic data, where we cover coarsening, mimicking and simulation. I am Joseph Allen, a research associate with the UK Data Service here in the University of Manchester and thank you all for coming. So in our last webinar, we covered redaction, which is a method of disclosure control, which basically involves removing entire rows, columns, or parts of the data and why that might be appropriate. And also masking, again, another method disclosure control, where we sort of replace data with new and hopefully clearly faked data, but it doesn't have to be faked data. And in this webinar, we're going to finally get to touch on sort of the cooler parts of this topic. So we'll start by finishing off those disclosure control methods. We're going to cover something called coarsening and something called mimicking. And then we're going to dive into case studies on simulation. And finally, we will whiz through a spectrum of classifications for different types of synthetic data by the Office of National Statistics. And finally, we'll cover the actual process of how to synthesize a dataset, what tooling we should make use of, and things like that. So in this webinar, we're going to cover the following methods, coarsening. This refers to reducing the precision of data to protect an individual. An example here might be, for example, to reduce somebody's birth rate to just their current age or maybe an age range, for example, 20 to 29. The more we coarsen, the less correct or precise our analysis will become. Next, we have mimicking, which is generating a dataset that closely matches the real dataset without containing the same entries. This could be sampling from a dataset and adding some noise, for example. Then we have simulation. So this is finally data synthesis and synthetic data. Here we're going to build a model of what we know about a real dataset and use this to generate a new dataset while dealing with the sensitive data separately if we want to. We didn't really have to do that. This obviously requires a very deep understanding of a dataset and its nuances or at least some very confident machine learning skills if you think you can ignore the context of a dataset. As usual, I'm going to take an opportunity here just to point out how similar mimicking and simulation may seem. We could obviously combine and mix these methods together. There's sort of no strict classification in these senses. These are just sort of us trying to make sense of what's quite a complicated topic. As usual, we're on the very fringes of an emerging topic of synthetic datasets. So these quirks have yet to be ironed out. So if you think any of these definitions might be a bit fluffy, don't worry because I totally agree with you. And before we discuss these various methods, I'd like to introduce the synthetic data spectrum as provided for the Office of National Statistics. Apologies, I've just seen the question from Michael. This PowerPoint isn't up yet because I only just finished it yesterday but it's on my to-do list to get it up almost immediately after this. So we'll be up shortly. But yeah, before we discuss these various methods, I'd like to introduce the synthetic data spectrum as provided by the Office of National Statistics. The general gist of this diagram is that as a dataset becomes more realistic, it becomes more risky to make use of. Also, as it becomes more realistic, it has more analytic value. We've discussed these trade-offs in depth in the last two webinars. There are always going to be trade-offs between the precision and the safety of a dataset. It is only with the context of the dataset we are working with and the skills, money, and time that we are willing to invest to develop a synthetic dataset that we can really begin to make any decisions about where on the spectrum our data needs. Needs to be, sorry. But again, we're gonna get into these specific definitions of what a synthetically augmented replica dataset is but we'll touch on that later. So now that we're three webinars in, I think it's worth going back to those initial definitions of what synthetic data and disclosure control are. So our definition of synthetic data at the moment is data generated by a computer simulation that does not directly measure a real-world event. With the ONS propositions, synthetic data likely does not contain any personal data as it's based on the public understandings of dataset structure and realistic guesses at the distribution. I largely disagree with that claim from ONS but that's what they go into. And so a lot of the themes in this talk will lean into that terminology a bit more than we have in the last two webinars. Let's also look again at the definition of disclosure control. So these are methods which protect the confidentiality of any subjects of research. Again, I somewhat disagree with this because I think disclosure control has potential far beyond research. It's not just sort of an academic definition. There's a much, much larger scope that's appropriate here. This ONS data spectrum doesn't cover dealing with disclosure control. Really, it does sort of imply that some methods are needed with different datasets but I think these two topics are gonna move really tightly together and I think they both sort of work together to solve the problem of confidentiality in useful data. So next let's cover coarsening as a method of disclosure control. So coarsening can be defined as reducing the precision of data to protect individuals. So again, as that example before, you might have an actual birthday or birth time even in a dataset but really age might be enough for your sort of research question or whatever you are trying to do with the dataset. So say we have a research question, what gender are the individuals of Manchester? To some people that postcode in the address there, M36GA doesn't even mean Manchester. We really don't need this level of precision. We certainly don't need a street name or a street number to tell if they're in Manchester. Obviously, this information is also quite dangerous. Even if we masked everything else about this individual, which I've for some reason chosen to do with my own name, we still appear to know the actual address of this individual. If this dataset contains criminal activity or certain health data, sexual attitudes data, finance data, we could be putting the person at this address in some real danger. So we really need to keep the notion of this location but again, for our research question, to what extent do we actually need to keep that? So we have multiple coarsening options here. And again, it all depends on that research question. We need to choose a small enough area that if there were other records, they could sort of blend in together to protect each individual. But we also need enough granularity that we can answer our research question, we can compare with other cities and things like that. So some options we have here, we could start out just by randomizing that street or apartment number. We could also argue that this is masking, we're replacing that number with another realistic number. We could remove the street number altogether. Why do we actually need the street number if we're only interested in a city? Here, this is coarsening. We've reduced the precision from a postcode with a street name to a street number to just a postcode and a street name. We could actually remove the street name altogether. So just leaving us with a postcode like M36G, we could remove parts of that postcode. So the sort of top level of a postcode, again, it's a little more complicated than that. We can't just take the first two characters because some, I know Manchester, I think they go up to like M20. So reducing only to the first two characters would leave us with M2 and M20 being sort of grouped together. So we need to understand a little bit of postcode logic there. It does get a bit more complicated. But at the end of the day, we could just remove the postcode entirely. We only care for this study that this participant is in the city of Manchester. So really as long as we can process the postcodes in some regard, we don't really need to worry about that complicated format. And obviously we could go further than this. We could aggregate by the county, country, continent or even planet. Maybe instead of Manchester, we say in West Midlands, the United Kingdom, British Isles, Europe, Earth. Obviously at some point, this gets ridiculous, but it's only within the context of that research question we can justify having this potentially dangerous address at all. So for the purposes of this study, all we need to know is which rows of our data are in Manchester and which rows aren't. There's only one row and that one row is in Manchester. So now we can answer our research question, what gender are the individuals of Manchester? And we get 100% male. It's not a great study. It's not a great data set, but we've somewhat protected this individual. We don't know their address now, which we would have done before. We could kind of make this research reproducible in that regard as well. In another quick course in the example, let's assume we have a small data set of student IDs and their corresponding weights and the heights. Our goal with this data set is going to be simply to answer, does weight correlate with height in the student population? What I've done here is rounded each of those weights to the nearest five kilograms and the height to the nearest five centimeters. It's definitely better in terms of protecting the individuals as they may know their height to the precision that we had in the original data set. And if the school or the class information is known, anyone with access to this data could identify those students at that time. For example, the lightest person in this class might recognize their own record and that's student three or similarly student four might know that they're the shortest in the class, but they're not the lightest. So in a larger data set, this sort of coarsening might be enough to protect each individual. And obviously it's unlikely that we're gonna have an individual class. We're probably gonna have a sort of much larger data set of all students in a school or all students in a county or something like that. And when we plot this data, we can see the impact that sort of very subtle coarsening is already going to have on a line of best fit or a linear regression. We've lost precision. We can see the points in the coarsen data set on the right are landing exactly on all those multiples of sort of five kilograms and five centimeters. We can also see the gradient of that line has decreased very slightly. So yeah, our sort of predicted power there is weakened. We can still at least claim that weight and height have a positive correlation though it's not as strong as it was before. And depending on an algorithm we use, if we do go sort of a machine learning group here, we could have really harmed our analysis. Certain algorithms might even start to predict only multiples of five, for example, for each of those weights and heights. Regardless, we can still answer our question, does weight correlate with height in the student population? And the answer is yes, yes it does. We can also aggregate columns or group similar rows where it makes sense to do so. So one way of doing this is taking means over columns. We could instead just broadcast the average weights and average heights for this class or for each class in various different districts and use that for sort of useful comparisons. And it's also quite useful just to broadcast totals. We use the total number of vaccinations at the moment, the total number of people in the UK. These are all still very useful aggregations and that's still coarsening because we're still reducing precision to get that metadata. Okay, great, so that's coarsening. Next, we're moving on to mimicking. Again, this is a method of disclosure control as outlined in that sort of data and government survey that'll be in the links at the end. Mimicking's very closely related to simulation and sort of the traditional perspective we have of synthetic data, but it's much more reliant on real data. So mimicking can be defined as generating a dataset that closely matches the real dataset without containing the same entries. This could, for example, be sampling from a similar dataset and applying other disclosure control methods. Again, this is not data synthesis, perhaps because we're making use of real data and piecing that together like a jigsaw rather than understanding its distributions and truly synthesizing something new. One example, we could just mimic a row and we could just add a small number to all the ages or something that would still be mimicking because it's not containing exactly the same entries. But I'd argue it's not really useful. There's not really a reason you would want to do that other than conforming to this definition. So in our last webinar, we had a dataset of students, their average grades and the number of sisters they each had. And with this dataset, we explored the question, does the number of students a student has influenced their average grade? To mimic this dataset, we can basically repeatedly sample these rows and we can add a random amount to each grade and each number of sisters. And through this process, similar to coarsening, we might lose a little bit of precision. So I've taken an aggregation of the mean number of sisters per grade and we can see that there does seem to be a positive correlation between these variables. The average grade B student has two sisters and the average grade A student has nine sisters. Obviously, we've really weighted that with our student with a world record number of sisters. And what I've done here is I've sampled the entire dataset, I've just done it one row at a time. So we've got the exact same four rows in order. And I've added some noise of up to one grade or down to minus one grade and up to one or minus one number of sisters difference. So you can see, for example, student one has gone down a grade from B to C and they've also lost a sister from one to zero. I've kept these real student IDs just so we can see how the sampling works. But in a real application, we need to re-index these quite clearly. Also looking at that outlier student who had 16 sisters, this variance of plus or minus one sister hasn't really done much. They're still clearly the outlier, they're still clearly the student with the most sisters by far. And arguably that makes them identifiable if we know enough about this dataset. And when we compare the sampled values here, we've sort of flipped the relationship because of how powerful that outlier is. At the higher grades, there appears to be that correlation between the A's and the A's stars, but the student with the most sisters happened to have just dropped from an A grade to a B grade and our student with two sisters got an A. So this is completely flipped how powerful that outlier is. This raises sort of a complicated question of how much randomness is too much. In this case, changing an entire grade can randomly flip our entire relationship. It's just too much. Maybe we need to have some sort of normal waiting. Maybe we should have removed that outlier. Yeah, things like that. So perhaps instead of only adjusting the number of sisters would be good enough here. And now that we're talking about mimicking, we can do something that we couldn't really do with any of the other disclosure control methods. And that's that we can now grow our dataset. So nothing says we could only sample our four rows four times. We could really keep sampling this thousands of times if we wanted to. It might not necessarily be useful, but we can do it. So instead, we can continue sampling and we get sort of repeat student IDs, but we sort of shuffle around those points using the randomness we're adding. And what that means is we end up getting clusters of sort of similar students around the students in our existing dataset. And in a large dataset where we didn't only have four students, this might be quite useful. But we can see, for example, the second time we sampled our student two, the randomness just didn't change their record at all. It picked zero for both the average grade change and the number of sisters change. In a larger dataset, that might be okay, but what we've done here is just kept that exact data the same. This kind of identification has to be negligible, but it's always a case by case situation depending on the data. So this time I've oversampled the data and I've only added noise to the number of sisters and not the number of grades just because that gave us a bit more wiggle room. We can see here what is essentially those clusters of random results around each data point we did have. Our relationship here looks somewhat clearer, but at the same time, all we've really done is take that relationship and enhanced how real it really is. I would suggest if you do want to do something like this, make sure you have a much larger dataset before you apply this. The purpose of mimicking here isn't to create more data and to oversample around our points, but it's to make that private data safer to use. Simulation, on the other hand, will try to sort of solve both these problems. So that leads us on very nicely to simulation. I'm just gonna have some more water. So finally, three webinars in, we're finally talking about the creation of new data rather than these disclosure control methods. And what we're really finally talking about here is synthetic data and data synthesis. So simulation can be defined as generating part or all of the dataset that is similar in essential ways to the real data, but is different with regard to sensitive information. Yeah, I mean, again, I could argue that the sensitive information part of that isn't really necessarily a definition for simulation outside of simulation as a method of disclosure control or data synthesis. But yeah, it's all very sort of closely coupled. And again, I think it's really important that we highlight just how similar these definitions look and sound. The major difference to me here is that simulation is generating entirely new data with an understanding of the essential properties of the original data. Whereas mimicking, on the other hand, is pulling from the original data and adding something for the sake of confidentiality. In this small image on the screen now because the definitions are so big, both techniques could realistically generate the dataset on the right from the dataset on the left, but only the simulation method would technically be synthetic data. Both retain the structure of the original data and are not the same data, but yeah, only the simulated one is new synthesis data. So to quickly summarize some differences between simulation and mimicking, so again, I do think they're quite fluffy and quite close to each other. So simulation can make use of basic random number generations, some simple heuristics, machine learning algorithms or data generation libraries and similar, whereas mimicking, on the other hand, is essentially a sampling exercise, perhaps with some disclosure control methods added. The outcomes of simulation and mimicking might be indistinguishable without some really high level of analysis. We definitely wouldn't be able to see it with our eyes, really. In simulation, we could sort of add in our own logic or intentionally not process explicitly personal data, whereas mimicking will simply sample these entire rows and again, may add noise, but doesn't have to. Simulation in general is much more technical. It requires us to really understand the problem to simulate realistic data, whereas mimicking, again, is simply sampling that data. It's bias and perpetuating those assumptions where we could have a nice opportunity to try and fix some of them. It would be bold to claim we can do so, but yeah, that's a difference. Simulation can also happen at a huge scale and we can even intentionally over-represent our outliers through methods such as oversampling. In mimicking, on the other hand, we could statistically just miss our outliers entirely. Again, in that student data set, we could sample a thousand times and we might never get that sister with 15 sisters or 16 sisters and miss that entirely. And then finally, a simulation can exist on its own. A simulation doesn't need the original one million row database to create a new record. Mimicking, on the other hand, does. Mimicking is really coupled with that data set itself and therefore can't really be passed around outside of an organization. So now we've covered simulation and mimicking. Let's simulate some rolling dice and see sort of what the visual differences between mimicking and simulating might be. So I've collected a data set of 100 dice rolls. We would expect about 17 instances of each result from one to six in 100 rolls. The real life obviously has some slight randomness. Dice don't play as nice as they should do statistically. We can see five here, for example, is underrepresented with only 13 rolls out of 100 and four is overrepresented with a fifth of all of the rolls. And mathematically speaking, in those 100 rolls, we expect sort of 16.6 recurring instances of each outcome on a six-sided dice. This uniform distribution looks nice and closely represents how we think a dice should behave, but obviously it might not behave this way, depending again on our sample size. Obviously we can't get 16.7 rolls of a dice. It might also be that a real dice isn't really as uniform as we think it is. The plastic that makes it up might not be uniformly distributed. There might be numbers carved out of the dice affecting the weight of the dice itself on each side. And we expect and simulate this uniform distribution because we think we know better than what the dice's data might tell us. The uniform distribution is effectively our machine learning model in quote marks. We will be using this to make predictions, but there's no machine learning involved in this simple case. A model is a system which takes an input or a number of inputs and predicts an output or a number of outputs. The uniform distribution is effectively our machine learning model, only there's no machine learning involved in this simple case. We're basically going to randomly sample this uniform distribution to make what we're calling predictions about dice roll, but it's not machine learning. In the same regard, if we were trying to predict the outcome of coin tosses, we could toss a physical coin and use this as an effective model for the synthetic coin or other tossed coin we're creating. We can choose all sorts of different mathematical relationships to be the foundation of these models. And this is where the machine learning algorithms that we may know of can really help us. When we build a model, we abstract a real world situation into something a computer can represent and recognize patterns from and experiences. If we simulate those dice rolls, there's a hundred dice rolls, six more times, we can see that general shape of the uniform distribution is still there, but there are simulations where we deviate from this with differing amounts. We've seen that simulation two, for example, one is hugely overrepresented, taking from the roles of the other numbers. If we were to simulate the data set that we actually collected and generated, we're perpetuating whatever bias that data set has collected and the assumptions we made when we were collecting that data set. And if we mimic that data set, the same thing would happen. Whereas if we simulate or mimic that uniform distribution on the right, we're perpetuating our bias of the understanding of that situation instead of what the dice actually told us or what the data actually told us. And again, mimicking will make the same assumption. We also find that the more values we synthesize or sample the closer we get to that underlying relationship. Here we've got three graphs showing what happened when I only rolled the dice 20 times, 100 times, and 1,000 times. You can see generally we're getting closer to that uniform distribution, but there's still a while to go before we make it look negligible. Also, if we rolled a six-sided dice and it somehow rolled a zero or seven, understandably, we'd be very confused. That's arguably impossible in the real world. But in the world of simulation, these edge cases can be very possible with some bad maths. Our random number generation might just be a little bit off and we might manage to calculate or floor some numbers incorrectly and get zeros or sevens that we don't notice. This would immediately reveal that our simulation is not real data. So we have to check carefully for these outputs and make sure they still all appear realistic. Our simulated dice is also much more adaptable. If we realize we made a mistake and actually need to simulate a seven-sided dice over a six-sided dice, this is a one-line code change for us, pretty trivial as long as we know how to code. In comparison, without our simulation, the process of collecting this data involves finding an online retailer to source our dice from. Again, this is much harder if we've got a non-standard dice, like a seven-sided dice or a 777-sided dice, probably doesn't exist on major retailers. We also need to budget for the purpose of this dice. Maybe if we're within an organization, we need to take time to get that budget and get approval to buy this product. We need to then wait for the delivery from that online retailer and then we need to dedicate time to ourselves rolling and re-rolling that said dice. This is arguably days of passive work and maybe a couple of hours of active work in comparison to what in our simulation was only a one-line code change to generate that data and then we can immediately dedicate those days of effort to analyzing data instead. It's also possible that we can, with this simulation, simulate dice that simply don't exist in our world, in our world as if there's another world where there are better dice. But what about those huge dice? Again, things like 1,000 or 1,000,000-sided dice. Often these are simulated through combinations of other dice, so you can sort of roll a 10-sided dice three times and multiply the answers together. But for our simulation, again, this is just as complicated as upgrading from that six-sided to seven-sided dice. It's just one line of code change. This is where, again, the data collection of the real world situation is infeasible. We might not even be able to read a million-sided dice with our bare eyes and it's simple enough to abstract with that uniform distribution. Next, we'll get into some sort of other, more visual examples of simulation aren't dice-based. So another form of simulation is agent-based modeling. In agent-based modeling, we simulate individual agents and we restrict the things they can do within an environment. As always, it's best to create some sort of abstraction of these individuals and keep things as simple as we can. One such example of this is the famously known prisoner's dilemma where two prisoners are given the option to shorten their sentence in exchange for betraying their partner in crime. If both prisoners talk, both will get longer sentences. If one talks and not the other, the one who talks will get a shorter sentence, but their partner will be condemned for longer. And if neither talk, both prisoners might benefit or they might sort of have no change to their situation. There's a fantastic website that visualizes this way better than I can and talks about the strategies different agents might adopt depending on the rules of the game. It's called the evolution of trust and this and details on all the following simulations we're about to talk about in the resources slide depending this webinar. A company called ai.reverie was looking for a way to classify airplanes from image data. In particular, they were curious if there was a way to get more confidence classifying the rare aircraft that don't show up often in datasets. One simple method of mimicking, for example, here would be cutting and pasting existing planes into new locations in that image data. So on the right, you can see I've copied the blue plane from the left and I've just made it green so it's a bit easier to see. And then I've pasted it various times around the airport and I've made those ones red again just so they're easier to see. We could also automatically tag these planes because as we paste them in we know that one, they're synthetic and we also know that they're a duplicate of another type of plane that we did have tagged. So we have sort of a reliable way of growing that training set. There are obviously more complicated problems than just simply cutting and pasting aeroplanes into an airport. Airplanes have shadows that you can see here in these duplicates. They all have sort of black wings underneath their wings. That's going to change depending on the time of the day. The photo was taken, the weather weather photo was taken. Planes obviously don't always point the same direction that the shadows will do in a single photo. So I can't simply copy and paste the shadows with the plane. Planes also change direction. So in this case, I've kept them more facing horizontal. There are also positions for planes in locations where those planes aren't allowed. So the plane can't be, well, hopefully isn't on top of the terminal building, for example. We'd hope that planes are never appearing on top of other planes or about to crash into other planes but they do happen. This is where simulation would be a sort of more handy example where we could sort of fix all of these problems in a rendering engine and actually make realistic photos. And that's what AI.revvy did. Another example is a famous sort of cityscapes dataset. If you Google citysets dataset, you'll see this. It's a large dataset of sort of 25,000 images of photos taken from the perspective of cars driving around Germany with tagged segmented objects and obstacles in those pictures. So things like bikes, poles, traffic lights, trees and more, obviously a lot of manual effort has gone into collecting these, segmenting them but to enhance this dataset, we need a system where we can automatically not only create new scenes and grab a huge amount of new photos but also reliably tag these scenes. Obviously there's like thousands of hours of dashcam footage around the world, but it's not tagged. It would require this human intervention. But this tagging is already solved in video games with impressive graphic systems. When video games render scenes, they already have the materials and entities labeled for cars, people, trees and things like that. So labeling that data can actually be quite easy. Grand Feth Dotto 5 is frequently being used in research. In particular, Intel has recently released a study where they use training data from Grand Feth Dotto to improve their model as it was easier than the manual collection and tagging of these images from the cityscapes dataset. And then using that, they can enhance the images so they have on the left is actual footage from the game and on the right is supposedly a sort of more photorealistic version of that same image. These graphics aren't necessarily photorealistic but in almost all computational models, our data is not really going to be perfect going in. So this is sort of good enough to use now. As this technology becomes more common, it's becoming essential to use in the development of councils, cities and countries worldwide. New buildings, for example, in the digital twin city of London can be added to the city and immediately you can assess their impact on daylight, which other buildings daylight they're affecting, how it'll affect pedestrian foot traffic, street traffic, what happens if the Thames floods by an extra two meters where your building be affected and things like that. And these are things that previously we would have had to sort of guess about or know quite a lot about London but now we can just add to these models and see how the city changes. So, oh, very good, so that's very quick. If we created the record 1,001 point avenue from 66 point avenue, would you consider that to be coarsening, simulation or mimicking? All right, I'll give it another 10 seconds. I'm not sure if you can all see the results. So I read out the results as well. So we've got zero percent of you have said coarsening, which is good. And I've got 14% of you saying simulation and 86% of you saying mimicking. So it's not really coarsening because we aren't directly reducing the precision of anything. Perhaps we could argue that by using that 1,001, it's nonsensically large and we're flagging it is not real but I think the street name is the same. I think it's a bit of a stretch to call it that. Maybe if we went from 1,001 point, sorry, yeah, 1,002 66 point avenue, we could say it's reduced precision because we have less characters. But again, I think we're really stretching if we try and go for that. To me, I say simulation is one of the correct answers because it makes sense to me, a model could realistically generate combinations of random numbers and street name pairs. Though these simulations may learn to use street numbers that statistically don't show up. For example, how many streets you know that actually go up to 1,001, but in large apartment buildings or office buildings, this is fine. It could also evenly be mimicking, I would say. We don't really know. If the record sampled our data and added some noise, then it's mimicking if it created it from some random number generator at simulation. I would argue that to get from 66 to 1,001 in the mimicking sense, you're adding a large amount of noise. You're doing something quite strange to the data to get such a big number. It could be considered a combination of sort of sampled street addresses with sampled numbers or randomly generated numbers. But the only difference between that simulation and mimicking is whether it's sampling based or whether it's synthesized. Again, it's purely a matter of semantics. I think both simulation and mimicking are correct. So everyone's got the correct answer in my eyes. I didn't really give you enough information, but these disclosure control methods don't really call for this classification in common use. It's just about our individual methods. Nice, good stuff. And finally, it's time to put the pieces together. Next, we're gonna walk through the Office of National Statistics Synthetic Data Report. Here we can summarize how we determine what tools and types of synthetic data are appropriate. So we looked at this diagram before, but let's go into a lot more detail on it. So there are explicit definitions of what these different types of synthetic data are. And they don't even refer to them all as synthetic data sets. They use this term synthetically augmented data sets as well. Which ones we need really depend on what the purpose of that data set is. So in the ONS definition, synthetic data sets are only suitable for code testing with no desire to preserve any underlying relationships between columns. The more we try to preserve the underlying distributions, the higher the disclosure risk of this data becomes, but also the more analytic value it has. It is expected that synthetic data does not have any disclosure risk. This is sort of the way that most people who haven't researched synthetic data academically perceive synthetic data to be. And that's why the ONS paper has this new term synthetically augmented data sets tacked on. Synthesis on its own, obviously has no guarantee for confidentiality and that's why the latter part of the spectrum is referred to by this new name. Again, it's very specific to this paper. So don't worry if that doesn't make sense. So to start with, we have structural synthetic data which is data identical only in structure to the real data set, fairly well-named. It preserves only the format and data types of columns. Variable names need to all be identical to the real data. There's no noise added. There's no magic here. So if you had personal data, it would still be exposed. And this is supposedly constructed only on available metadata. So we might have mean ages, lists of valid occupations, lists of valid cities, lists of valid genders, things like that. We might just know that the first name column exists and we have to sort of impute that from other things or make use of some sort of data generation library. Because of this, there should be no disclosure risk because there should be no private data. Next we have what's called synthetic data valid which preserves the format and data types exactly the same structural data did before. The addition here should be that all rows in isolation should make sense. So we shouldn't have any sort of conflicting information in a record. There shouldn't, for example, be any employed infants in the data set unless there actually were employed infants in the original data set. We didn't apply any rules or logic to try and protect against this kind of stuff. We draw our values from realistic sources but we make no attempt to copy the actual distributions. So in a data set based on Manchester, a 100 year old might be no rarer than a 10 year old. We might just simply sample that uniform distribution instead of the actual distributions of the data set. In this case, I've adjusted genders to sort of match up with first names and cleared some occupations where it doesn't really make sense to do so. You might also encounter missing data or spelling mistakes as in the real data. Our 12 year old Dana, for example, probably hasn't filled in an application which explains what her occupation is. Next, we have our first synthetically augmented data sets called synthetically augmented plausible data. This data, similar to the last two, preserves the format and record level plausibility as previous and replicates some distributions where it's trivial to do so. This is our first data set based on the actual original data and values are generated based on the observations of this data set but we're gonna add some sort of disclosure control methods to make it less personal. We also keep the frequency of any missing data or mistakes here. The synthetically augmented and multivariate plausible data is identical to the previous form but with the addition that we're now making effort to replicate the relationships between variables quite loosely. In the ONS paper, this is described as only being applied to high level geographies. So it might be grouped at like a Manchester level or a West Midlands level or something like that. This means this data set should have some inherent value from an analytic or machine learning perspective but again, it's only a very low level pass at data synthesis. Visually, this won't really look any different to our previous data set but we might notice if we did some analysis that the average age of the people in Manchester in our synthetic data set is similar to the one in the original data set. Next, we have synthetically augmented multivariate detailed data similar to the previous level again but now we're trying to match the full granularity of our data. Again, visually, this won't really look any different but now instead of seeing similar relationships at the Manchester level, we might see that all of the male research associates in Manchester have similar age distributions to that that they did in the original data. And finally, we have what's called synthetically augmented replica data. This data is like the real data in basically every way. It preserves the format, the structure, the distributions, the quality and the full granularity of that data set. And the purpose here is to use this data instead of the real data but it's very likely in the UK that if this data exists anywhere you would only get access to it in a secure research facility. It's effectively as dangerous as the real data. Yeah, the ONS paper is in the resources at the end of these slides which will be distributed but I'll get to them shortly as well. Also if anyone wants to Google actually if anyone can Google ONS synthetic data you'll probably find the blog post I'm talking about if any of the facilitators are free. So again, ONS expects the data in this framework referred to as synthetic data is only going to be useful for testing code. I posit there's far greater uses than this. For example, showcasing a new open data set or hiring staff with data that has sort of a familiar context. It doesn't just have to be for testing code. In these cases you usually want data that can represent the extremes of a system specifically for that coding one. You want that 100 year old user that's going to cause your API that's expecting a two-digit number for age in a database to crash. Further to this what they refer to is those synthetically augmented data sets pose analytic value. But again, further to only analysis good synthetic data and methods such as oversampling can really enhance the predictive power of the machine learning model. In these cases you want real representation. You don't want the edge cases unless the edge cases indicate something interesting analytically or a well represented in your dataset. There is obviously a little wiggle room between these two cases. And again, I think that's okay. It's not really a system I think needs to be classified to this degree but we do need to conform to some case by case data protection standards. We have to remember that the purpose of our synthetic data well, we have to remember what our purpose is for this synthetic data and hence what form of detail what form of generation is actually appropriate. So we've got another poll to wrap up this ONS recap. I want to ask you all what you think you might use synthetic data for in your future. So I don't think I will use it. I think I might use it to test some code at some point. I think I'll use it to make some work more reproducible. I think I'll use it to create some training data to enhance a model or I think I'll use it to create some dataset that I could use to hire a PhD candidate or some staff. Wow. Cool, I'll give it 10 more seconds again. This is multiple choice as well. So if you think you can use lots of different things, go for it. Let me check that link as well. Yeah, I think that's the right paper that's been sent in the link as well. Okay, so we've got, oh, it's all still moving. We've got 13% of you have said I didn't think I will use synthetic data and that's totally fine. I think it's quite fringe but I think you might find there are some cases where you're already using it and aren't aware that you're doing so. We've got about 50% of you think you'll use it for testing code. 50% of you think you'll use it to make work reproducible. 60% of you are expecting to use it for training data which is very cool and then only about 10% of you are saying you'll use it to hire staff which I think is one of the nicest uses for it because it's always nice to have sort of data in the context you're about to work in but it's not very sort of academic really. Cool, good stuff. And finally we'll talk about some of the different commonly used tools we have at our disposal. Again, that ONS link at the end of this presentation and that's now been shared in the chat has an amazing comparison table between some Python and R libraries that will really outline the features that you're looking for. So there are web-based tools such as Mockaroo which have come a really long way in making sort of a basic level of fake data creation available to users with no programming knowledge. Again, I use sort of so synthetic data intentionally referring to that creation of new data whereas fake data is sort of used to sort of generate fake identifiers instead of synthesizing identifiers from a data set. So this isn't necessarily data synthesis but it depends where your definition lies in that regard. You don't need any programming knowledge to use these web-based tools. Standard variables already exist. You can get names, emails, genders, cities, NHS numbers, car license plates, post codes all sorts of cool stuff. You can add a percentage to make sure that some of the data is corrupted or blank and you can even represent quite complicated mathematical relationships such as plotting that normal distribution. For example, if we had ages we could say the average age wants to be 35 but it'll do a normal distribution around 35. But as we become more reliant on that custom logic and combining the cells of that custom logic you might realize that a programming solution is more better suited for what you wanna do. We also have Faker. Again, for making that sort of fake data this is a Python package. And the primary purpose of this package is to generate realistic data to test software. As such, it prefers to generate that sort of structural data we just talked about instead of valid data. It can create again, fake identifiers, names, addresses, emails. It doesn't have the depth Mockaroo has. I didn't think it does sort of post codes, license plates and things like that. But it does offer multiple language supports. You can have names generated in different languages. Again, very good to test systems based in the UK that might not expect some Japanese or Chinese to be written. At the surface, the addresses generated this might appear valid. Then Faker, they conform to address rules. So they have valid postcode formats but they don't actually correspond to any real addresses in the UK. So if you do want to visualize that data Faker's probably not the right choice because you won't be able to visualize it on the UK map. There's also no way to preserve relationships between variables, which is something we need if we're gonna do any analysis or machine learning. This could be a very nice way to add non-identifiable data to the outputs from that more dangerous synthesis to sort of mask out those identifiers but still make it look quite realistic. And finally, we have SynthPop which allows our users to synthesize data from an existing dataset. Values are populated using sampling and then replacement and eventually, I guess we call that masking instead of replacement and eventually machine learning algorithms are applied to maintain the relationships between fields. We can even control which algorithms are used. There are built-in disclosure control methods to remove unique records in the synthetic dataset that represent real day records in the original dataset. And as with most machine learning we're prone to overfitting on this data and as the number of variables increases the training and accuracy will become weaker. Finally, there's really limited functionality for dealing with any free text. We can't add any noise to free text because generally we don't understand the context of that free text. SynthPop won't generate you new realistic names for your name column or tweets for your tweets column, things like that. That's a bigger problem than the data synthesis that's been solved so far. If you're confident with our, I would give this one a go. This is probably the one we're gonna look at quite a lot in the code demo in two weeks. There are also some similar packages, SynthPop and SMS, but they're all quite R heavy. And again, if you're not sure what software is the right fit for you, that O&S paper have provided a flow chart that they use to make these tech decisions for us. So to summarize this quite quickly, if you wanna generate a dataset to use in testing, showcasing or recruitment, you should probably use Mockaroo or Faker. If you need to preserve the relationships and make use of some sensitive data we should probably use those advanced R packages, SynthPop, SMS and SynthPop. And if you want anything more complicated than that you probably have to make your own custom solution, join up these options and make some sort of larger data pipeline using the sort of advanced tooling. And finally, we'll break down the process of how we synthesize a dataset. So before we get started, we need a source of data or at least an understanding of the structure and metadata around a dataset. We also need to determine or maybe state the purpose of that synthesis. If we know we're only going to use it to test our code, we can cut a lot of corners. Next, it's time for some exploratory data analysis. Dive into your dataset and assess the quality of your data. Get an understanding of what this data shows and how we can make use of it. And you might find that you only really need a few columns to make the most use of this data. I would say simplifying before synthesis is gonna save you the most work. Select a small number of variables that seem to have the highest prediction power. You may find close correlation between some of the other variables. For example, you might find that age correlates very tightly with birth date or somewhat loosely with marital status. Maybe we could get away with only trying to simulate age. If there's a lot of missing data, could that missing data itself be a very powerful feature? Could we impute around that missing data and sort of ignore that it's missing at all? And at this point, we should have a dataset we're ready to hand into a synthesis model. So again, I would suggest synth proper looks pretty fun. And finally, we just need to check if the results of that model are coming out looking okay. Do those synthesized rows appear to match the structure of the original data? Have the correlations between those values remain similar? Do you think it's good enough for your use case? Could we add noise or disclosure control methods to bring us closer to a de-identified dataset that maybe we could be depositing with the data provider or publishing? And that's it. So to conclude these webinars, we've seen that coarsening is a practical way of reducing the granularity of a dataset, especially with concern to geographic data. Mimicking is a great and easy way to increase the size of our dataset, but we're essentially creating random clusters around existing and personal data. Simulation is the method of creating new data and it sort of solves the problems of mimicking, but does require much more tech skills and time investment. The better our synthetic data, the more analytic value it has, but also the more risky that data becomes. And if you aren't worried about personal data in your dataset or need something to test code or showcase, use mockaroo or faker, if you need data that's useful from an analytic or predictive perspective, try to make use of R in its amazing data generation libraries. Again, synthpop looks very cool to me. That's probably what we'll be looking at in two weeks. There's gonna be a break between now and the code demos. There isn't a webinar next week, but come back in two weeks and we'll actually try and synthesize, I think some sexual attitude survey data. And after this webinar ends, you should see a prompt to leave some feedback. Please go through this short survey. It helps us make better webinars for you. And I have to further readings as I've been saying the whole time. So the data in government blog is where we get all these classifications for disclosure control, though they didn't call it by that term. It's very good in entry level. And then the ONS synthetic data pilot is sort of the level of depth I wish I'd seen from the synthetic data research I was doing earlier on. It's basically inspired this whole webinar on the code demo. There's also links for all those simulation demos I discussed. So Rare Plains is the airplanes dataset, enhancing photo realism enhancement is the one that makes use of video game engines to get new data sets. And Evolution of Trust is a really lovely walk through the prisoners dilemma with lots of cool visualizations. And then there's links for the toolings, Mockaroo, Synthpop, and Faker, but these will all be on the Eucadata Service website anyway. Thank you very much, everyone. Please feel free to start asking any questions. I don't know if you have been asking questions all along, but I'll have a look through the Q&A tab shortly. If you have any thoughts that come after this talk or if you're watching the recording on YouTube, feel free to message me on Twitter at josephallon1234 or email me at joseph.allonatmanchester.ac.uk.