 Hello, everyone, and welcome to this UK Data Service webinar, What is Synthetic Data? I'm Joe Fallon, a research associate with the UK Data Service in the University of Manchester, and thank you all very much for coming. So in this introductory webinar, we'll be covering some of the core concepts around synthetic data. So we'll ask ourselves, what is synthetic data? We'll ask why we should even bother making synthetic data. We'll cover some examples, benefits and purposes of that synthetic data, and discuss the sort of features that make those different purposes and forms of synthetic data slightly different. Beyond today's webinar, there are three more sessions in the series. Next week, we'll be running a session on masking and redaction, which are two broad categories of disclosure control. Then in the following week, we'll be covering coarsening and mimicking as disclosure control methods, and then we'll dive into simulation, which is sort of the meaty, fun part of synthetic data. And then two weeks after that, we'll be doing a code demo where we hopefully try and do some data synthesis ourselves. So as a definition, synthetic data is data that is created that no longer measures, but still represents a real-world observation. It may retain qualities of the real-world measurements, but it doesn't really have to. This definition might sound very vague, and that's because there is a variety in sort of the methods of synthetic data, the uses of synthetic data. We could argue that making up random numbers in our head and writing them down is technically data synthesis, but it might not be as useful as a sort of huge machine learning model, which generates realistic dice rolls thousands of times a second. And particularly in the world of machine learning, this term synthetic data specifically refers to data created by these models. And that synthetic data is also used to improve the training of models. So there's lots of very fluffy, different definitions of these terms, but we'll sort of cover that throughout this webinar. Again, something very similar, disclosure control. It's related to synthetic data, but it's not synthetic data, but I think it's really useful that we define this upfront as well. So disclosure control refers to methods that allow us to protect the confidentiality of subjects of research. Some simple examples of this might be just removing the first names from a dataset, adding noises to ages, weights, or any other numerical value, or it might even involve sampling the real data. So there doesn't actually have to be something to enhance confidentiality in that approach. It's all about the way we apply it. It's up to the researchers to determine which outputs of these processes count as safe or unsafe. And as such, there are no hard rules and the outcomes of this process is quite inconsistent. We could argue that synthetic data is kind of the solution to the problem disclosure control is trying to address. Another useful definition of synthetic data is simply any data generated by a computer simulation. Though through this definition, we could separate out what is considered those disclosure control methods, such as removing a column, sampling, and more. We could also get a little bit more philosophical with definitions like this. We could argue that our brain is doing this synthesis and imagining dice rolls and shouting out numbers does count as synthetic data. And in what ways is that even different from sampling? And in that regard, it would be disclosure control. So yeah, it's all very strange and we really are on the fringes of the data world when we talk about synthetic data. In an academic sense, the term synthesis specifically refers to the creation of some data. So again, any references to removing, replacing, or reducing precision data is not academically speaking synthetic data. Though these methods of disclosure control grant very similar benefits and there's such a large thematic overlap that we cover them in quite a lot of detail. Yeah, but let's move on to some examples. So a common example that you might have heard of is called Lorem Ipsum Text. So Lorem Ipsum is a placeholder text commonly used in design to allow designers to sort of step away from the meaning of text while focusing on visuals. It's just a repeating string of Latin text that obviously does have a meaning in Latin, but for our purposes, the meaning is not important at all. It's not the primary purpose. And Photoshop and other sort of graphic design tools have a plugin that would generate this text and visual elements for you. So all it is is it's the creation of something and we're kind of misusing it. So our eye count is synthetic data. You may have heard of generative adversarial networks. They're very popular in media at the moment, just referred as GANs as well. And they're famous for their ability to sort of make realistic photos of synthetic celebrities, baseball players, all sorts of weird stuff. But they're a really nice sort of visual example of data synthesis. And I'm sure if you look at these pictures you might think you recognize some of the celebrities, but they should only be inspired. I can definitely see Nick Cave, for example. Yeah. Another example of synthetic data would just be any random number generation. We can ask questions about, you know, just how random this generation is and is that fit for our purpose or is that adding some strangeness to it? For example, imagine using one of those random number generators to simulate the rolling of a six-sided dice. Well, difficult to exactly represent what we would call the sort of uniform distribution of a dice roll. An approximation might be fair. There may be an impact statistically from the small weight distributions from the different carving of numbers on different sides of the dice. But really we can abstract this away. We don't really care about that complexity for our simulation. But reasonably, we'd be concerned if I rolled the synthetic dice and it came up with a zero or a seven when we're expecting it to be six-sided. We would quickly lose trust in that synthesis and we would also be able to tell that it was synthetic data. And if we can tell, machine learning can often tell too. So even this example, as simple as it is just rolling dice, we could make mistakes here. It could be inadequate for the needs of our synthetic dice as well. Next we ask, why might a researcher even want to bother producing synthetic data? We have a problem. It's difficult to make data available for research and simultaneously protect the privacy of the people that data represents. For example, we might have records on crimes committed by individuals or medical records over a 10-year period. Researchers need to create and use data that does not represent real observations as a legitimate part of their research, processing increasingly that some journals are requiring open data sets for papers to even be published. This synthetic or fake data can be entirely artificially generated or can be based on real observations that have been manipulated or processed in various ways. This means that synthetic data can mimic sensitive data but without holding those sensitivities, I suppose, when it's done right. Some purposes we might have for synthetic data include improving a machine learning model, creating data to be analyzed by a third party, creating data to be used in a recruitment tech test, creating data to showcase a data set and sort of market that data set or to test an existing system. But we'll cover these in a lot more detail later on. Synthetic data is not yet a trivial problem for a number of reasons. A data set may have non-linear relationships that connect fields. For example, if we stored the age, weight and the heights of some children, it's not realistic to simply sample these values individually. There is a correlation between those fields. And if we sampled individually, we may end up with very tall, unrealistically underweight synthetic children. And yeah, we would recognize that and so would a model. So as a data set becomes more complicated, the process of synthesis becomes more complicated in tandem. It's much easier to simulate that dice roll than is to create these realistic synthetic individuals. You also need to understand the confidentiality concerns associated with your data. GDPR protects some fields. Personal data refers to any data which relates to a living individual who can be identified. A non-exhaustive list of data that you could never be sharing on names, identification numbers, location data, online identifiers. But there are always exceptions and that's your responsibility as the researcher to understand and showcase those. Further to this, we have special category data defined as any data consisting of race or ethnic origin, politics, religion or philosophy, trade union membership, genetics or biometric data, health and sex life or sexual orientation. Finally, we also need to be aware of indirect identifiers. For example, in many regions, a postcode can very accurately predict race and religion depending on how we open up this data, these individuals may still need protection. Again, as synthetic data does not represent real individuals, we could benefit from a simpler confidentiality concern here. It's important to be aware of the legal implications of synthetic data as well. Depending on the data you make use of, you may not have the right to distribute that data synthetic or not. Consider whether you can legally share any data you have created and the impact of that data. If in doubt, communicate with your data provider about the data you've created. The responsibility may fall on them to verify and distribute your methods. And finally, we need to watch out for any domain-based complexity. Different organizations have very different views on representing gender and databases as one example. They might use a column, is male, and then a true or false value. They might store an M for male and F for female. They might have an O for other or they might have all sorts of other things. As we get deeper into the complexities of an individual dataset, we see a lot of assumptions are made that are often unique to an organization and maybe even within that organization unique to that dataset. Data in one dataset may become dangerous when combined with another. We might have an ID value that is seemingly harmless, but when joined with transaction data, might expose an address or payment details. And without understanding the organization and the quirks of these datasets, we're likely to make mistakes. We need to understand this domain well before we can ethically process and synthesize any of its data. A solution here is communicating well with those domain experts, seeking out documentation if it exists. But with the above difficulties in consideration, there is not yet a one-size-fits-all solution to data synthesis. We need to ask ourselves if the benefits are even worth the time and cost. There are some questions to ask ourselves before we even engage in data synthesis and that's, are there any relationships between our variables? For example, is it realistic that a user, Joe Allen, might have the email AndyDeFriends at gmail.com? If we have data on individual weights and heights, should they correlate? This all depends on the purpose of our synthesis. And for machine learning purposes, the answer to those questions is almost always yes, we need correlation between variables. Next, should we mimic the real-world problems or try to fix them? There are countless news stories about models giving harsher criminal punishments based on race or postcode, again, in direct features. Or even the recent GCSE and A-level results use location as a factor in those grade predictions. So should your synthetic data try to perpetuate those historic biases or try to fix them? And I guess it's both to think that we understand all the biases we're going through now as well. And finally, should we apply some methods of disclosure control? So for machine learning purposes, using personal data can sometimes be justified. But if we're hoping to make that data open or verifiable, data needs to become non-identifiable. There needs to be almost a negligible risk of re-identification from that data. We should also ask ourselves, is there any benefit to this data synthesis? One benefit is that synthetic data can be made open, making your work verifiable and extendable with ease, and I suppose, crystallizable as well. The process of generating synthetic data itself could be an excellent exercise in understanding the problem. Generally models improve in accuracy. This is sometimes true, but generally it is true. This means well utilized synthetic data could improve your model. Many online competitions have been won using synthetic data techniques because it gives them access to a larger pool of data. Analysis of synthetic data might be a useful trial task before committing to hiring an employee or a PhD candidate to work with the full data set, and this ensures that they actually have the needed skill set and they're actually interested in the data itself. Synthetic data might be the most cost effective way of acquiring more data. And finally, safety. If you're investigating airplane crashes, cancer cases, it's not really feasible for us to wait for more to occur. To protect individuals, organizations protect their data. If there is no data on real individuals and our data set, we may not need to be concerned about privacy. There has been an increase in black box attacks on machine learning models. This is where attackers try to pass in data to reveal information about the individuals the model was trained on. This means that if a model was trained on synthetic data, we would only be exposing these synthetic individuals. And since it doesn't represent a real individual, these attacks would hopefully be meaningless. Anonymization is again heavily context dependent as with this entire webinar is going to suggest changing the name Joseph to BOSIF or the uppercase version of Joseph isn't really going to do anything to protect my data. We can still understand it. We still know who it's related to, especially if everything else in that same record highlights who I am, for example, my email still says my name. So we've not really saved me there. Instead of better technique might be to replace or remove these names altogether. So we might mask the name Joseph with Andy. We might mask the last name Alan with Shiki. But yeah, as I said, even though my name, Joseph Allen has now be completely wiped from those two columns that you would think hold my personal data, my email still exposes my record. It still identifies me uniquely. And so what we should do here is again, make use of those new synthesized or masked first name and last names and generate a new realistic email address. Saying so is trivial, but we could do this quite easily in Python, but that's an expectation that I have you all that might not be true. So it does become slightly more complicated already. And then finally, we course in our address. So this involves reducing the granularity to the first two characters of my postcode. Normally, that's enough to sort of get a decent area, but especially for the Manchester postcode, that's enough. So we can change this data in many ways, as you've seen here, to protect the anonymity of individuals in a dataset. One method is replacing those common details like emails, names, and addresses. But depending on your work, your data may become less useful without accurate addresses. So in this case, you have a case to make that you need those addresses and your research won't make sense if those addresses are course into such a high degree. If we generate a new name, then we are creating synthetic data. But in this case, it's likely that we sampled that name from our minds or from a data generation library or from a list of names online. So this is a method of disclosure control and not data synthesis. But I think this is where the lines get quite blurred between them. When we course in the address, that specifically is disclosure control because we're reducing the precision of that data to protect the sort of anonymity of that individual. There are fears that synthetic data is not GDPR compliant. According to Recycle26, if data does not contain personal data, then it complies with GDPR, even for statistical and research purposes. Also it's further stated that pseudonymization is appropriate protection for data. This is a process of replacing all identifiable information with artificial identifiers. It is of course possible that our data synthesis is simply a model which happens to replicate real data. In this case, obviously we're not in compliant with GDPR. So just because we're using synthetic data doesn't give us a golden ticket to use that data. On the other hand, our model could take that good data and just output random noise numbers. That would be compliant with GDPR, but it wouldn't be useful to us. So it's these shades in between each method where we need risk assessments. It's not enough for us as the researchers to say, my data is anonymous, please trust me. We must document our approach and verify our results to protect ourselves. As long as we can show that to the best of our knowledge, this data is anonymous, then we should be okay. Generating more data could be an effective way to improve the results of your model. So data augmentation is one successful method. It's very easy to see when applied to image recognition as well. So data augmentation is defined as techniques used to increase the amount of data by adding slightly modified copies of already existing data or newly created synthetic data from existing data. So for example here, instead of having 100 photos of handwritten digits, we could instead shift their position, rotate them, shear them slightly, change their colors and combine all of these different methods to create thousands of realistic images. Again, that's not synthesis there where taking what we have and moving it, synthesis would be creating new characters. But yeah, these techniques are mature enough now that they're often built into popular machine learning libraries. I know Kiris definitely has a data augmentation function just for images. Again, in cases such as car crashes, medical scans or fraudulent transactions, the obvious application of machine learning here is to try and predict those negative outcomes before they happen. Also in these cases, the majority of the data is hopefully on normal driving, healthy scans and real transactions. So generally, we won't be able to quickly discover more legitimate cases. We might be able to buy large data sets on these, but it might be expensive. The format might be incompatible with what we've done so far. So we would really have to invest time in correcting them. In many dangerous cases, as I said before, we don't have the luxury of waiting for more of these events to occur as well. And this is where synthetic data can help us enable us to begin to solve these problems. Whilst data synthesis can be very expensive, sometimes it's the most cost-effective way of collecting more data with limited resources as well. So for example, imagine you wanted to build your own realistic backgammon bot. How would you collect data to train this bot? You could hire players, record the games, oops. All right, you could hire players, record the games using a camera and personally convert moves into a backgammon notation. You might make mistakes in transcription, but you also might not make any mistakes. You could play the games yourself, inputting only your play style and again, dedicate many hours to transcribing games yourself. Again, you might make some mistakes here. You could pay for a huge data set of backgammon games and just hope that because you paid for it, it has good quality. Alternatively, the rules, notation and even backgammon engines already exist and we can make use of these to create synthetic games. This might be the most cost and time effective and you could now generate as much data as you can realistically afford to store. Would these synthetic games be any less valid than the real games for your purpose? I mean, we didn't really say what the purpose was, but if we needed to mimic human delays, if we were trying to make something that seemed like a human player, maybe we do need human delays, human mistakes instead of these backgammon engines. And this again is where the context of your problem is a key part of the synthesis techniques and tooling that you'll have to use. When organizations require a sign-up process, paperwork or permission, any sort of larger commitment they can often scare users away. When we make data open, it should be as easy to access as we can allow it whilst also protecting the individuals within. If your data is safe enough to be shared publicly, you can open your work up to other academics, industry leaders, third parties, all sorts of people and together we can crowdsource better answers to the same problems. Thanks to synthetic data, even your sensitive data sets and corresponding analysis and models could be made reproducible, criticizable and more importantly, extendable. And again, I must reiterate the warning from before. It's important that you always follow the rules of your data provider. If you aren't sure what those rules are, you can contact support at ukdataservice.ac.uk. As an example, let's say you're all my students. Over this school year, you're going to sit four tests and they're all out of 100 and your final grade will just be determined from the sum of these results. I could do this by hand, but I teach a lot of students, there's 46 people here, that might be too much work for me. So this repetitive task to me calls for a programming approach. So we'll look at code only for two slides here. So don't worry if you don't know any Python or any programming, but if you do know Python or you'd like to know more Python, make sure you sign up for the code demo that comes with this series. Personally, my language of choice is Python, especially for anything data related. I could start by writing a function to sum these values and assume it's correct. But there is this newer notion called test-driven development where testing is the priority and you write tests before you even write the code. I have a function I've written that expects a data frame of scores from zero to 100 of these test results. I have a function here that creates a new column sum. This column finds its results by summing up the values in other columns of the same record. An axis equals one while not intuitive basically means sum the rows instead of summing the columns. If it was axis equals zero, that would mean sum the columns, but that's not what we want here. Then finally, I'll just run df.head. df stands for data frame. Head refers to getting the top of the data frame and by default it will return five rows. So these sums look good, but they require me to sort of manually make up valid data, put it in the correct format. And I probably won't test it again until I get to the end of the school year and I put everyone's data into a spreadsheet or something unless I have synthetic data. So instead of just assuming that that function works and hoping that it's gonna work when I get to the end of the school year, let's write something that generates data in the format that we expect to test the function. Again, we're writing in Python. So I'm importing two libraries here, numpy, which adds support for large arrays and matrices and a bunch of more complicated maths functions. And pandas, which adds support for data manipulation. These two libraries encompass most of the Python data world. Next I'll create a variable df as we used in the page before. It's short for data frame that might be familiar to you if you use R or Python or anything like that, but basically it's a glorified table with some extra functionality. So we're creating that data frame and we're filling it with random integers from zero to 100 defining a size. And then I'm creating the column addings a, b, c, d. In the real world, you might expect some exams are slightly more difficult than others and the average will be brought down. We would probably expect that some students on average are better or worse than other students. And we would expect to see that consistency, but in the synthesis, we've lost that notion. We're happy with the uniform distribution that kind of looks random. We could dedicate more resources to generating meaningful synthetic students and good students, bad students, or maybe I should say bad at exam students instead of bad students. But yeah, my only purpose here is testing that sum function and this is more than adequate. Cool, just gonna have a little drink of water. Right, we've seen loads of benefits there. I'm obviously trying to sell synthetic data to you, but there are reasons that we might not want to make synthetic data and they're often equally valid. First of all, a lack of understanding about what synthetic data is, there is worry that synthetic data still contains personal information and it's possible that it could do depending on how you do it. So these same safeguards still apply regardless of that source. But yeah, if you do it correctly, that's not the case. Secondly, there's a concern about privacy breaches. So if the synthetic data was stolen, for example, if it's internal to an organization or to a particular university's research, if it was stolen, you would now be responsible for the identifiable individuals in that data if your synthesis wasn't done in a way that protected them. With good documentation of your methods and hopefully it's quite difficult for outsiders to understand what that data is even referring to, it should help us, but yeah, it's something to be aware of still. Thirdly, there's a concern about controlling narratives. If for example, we work for an organization where we are aware that our data may make it look like that particular employees are sexist, for example, or that there are known issues within our data that the media may spin or even truthfully represent, it may be in an organization's interest to claim privacy concerns to protect that organization. But again, as I've said, when it's done correctly, there shouldn't really be privacy concerns. Next concern about synthetic data being misinterpreted as real, if we have a dataset well documented with the same name, the same distributions and very realistic looking individuals represented within it, it could be a problem. Even internally, we could mix up datasets, we could mix up our synthetic customers with our real customers, but a solution here is a very clear separation between that real and synthetic data goes far as renaming all the files to synthetic transactions.csv, prepend all the columns with synthetic or test or something like that. And if we do generate individuals, give them, you can give them clearly fake names like fake person or Mr. or Mrs. Fake Person or whatever helps. If you release it publicly with these clear flags, it can be a bit easier. But without them, somebody could claim to have real data and we wouldn't really know without someone internally looking at them. In the first synthetic dataset produced by SYLLS, all columns were prepended with SIN, short for synthetic and the data itself is actually watermarked with a synthetic data message, which makes it very clear in Excel, but that wouldn't really apply if we were opening it in Python or R or anything like that. So it's a good way to make it very obvious what you've got there. And then finally, it's okay to admit that we might lack resources. We don't have to be good enough to create synthetic data. We don't have to have a purpose for it. It might just be out of our remit. Data synthesis, as we've shown, grows in complexity with the complexity of our data. And as we've seen, simulating a dice roll might appear trivial, but generating a cohesive city with lifelike individuals is gonna be a huge year-long technical undertaking. Data synthesis discover information on the individuals. So that's use of real names, addresses, oh, yeah. Should be back now, is that all right? I got a warning about internet connection there. Anonymity, this refers to when, whether there are still methods by which we could discover information on individuals, the use of real names, addresses, or other protected features here are problematic. You have to be careful when training a model that we aren't exposing our outliers in particular. Next, we have structure. We ask, is that data in the same shape as the real data? Would it be useful to redact certain columns? For example, our models probably don't actually need the first name field. Do all columns still hold valid data? For example, if we have ages of minus one or we have missing values and do our columns hold similar distributions to the real data? Next, indistinguishability to a trained eye, could we separate the synthetic individuals from the real data? If we can, it's likely that a model could as well, but this is where we can have those sort of fake names and interesting ways of doing that in a way that won't affect models or analysis. Relationships, are the relationships between columns maintained as well as valid? Does weight and age correlate correctly as we saw before? Size is our data set the same size or larger than the original data. In some cases, we might need five or 10 rows to showcase data. Documentation, are the quirks and metadata well documented? Could we safely expose some averages or standard deviations of our columns? And do we understand the quirks of the data set? For example, how that gender column is used internally? And then these are five common purposes for synthetic data. It's not an exhaustive list, I'm sure there's more, but this is what I could think of. So machine learning, this synthetic data is used to improve the training of the model. Analysis, this is data that would be shared internally in an organization or with third parties for analysis, research, peer review, recruitment. This data could be used to assess a candidate's aptitude before applying for a PhD topic when hiring a professional to work at your company. Showcase data, to me this is sort of like marketing data. This is display data that you would use to encourage users to engage with a particular open data set, but they might have to apply to get full access to the unrestricted data set. It could just be a few simple rows or it could be a large data set constructed for use at a hackathon or something cool like that. It's purely to sort of demonstrate the layout of the data and convince people that it's easier to engage with than it might be. So yeah, you could alternatively say that this showcase data is primarily a marketing tool. And then finally, testing, this is data used to test existing systems. So invalid data here could cause an API to break, could cause a website to crash. So the purpose might be to sort of break those systems, but only within an understanding of what those systems are actually expected to receive. So we need an understanding of the requirements of those systems if we're gonna test realistically. Now these purposes all require different features and with each of these features potentially comes its own complexity. So to start with, we'll look at the machine learning purpose. So it must be anonymous, it doesn't have to be, but we're risking these black box attacks as we talk. So I would suggest try and make it anonymous if you can. If our synthetic data is showing the same weaknesses as real data, we're just exposing individuals and our predictions, we're getting half the benefit we could be getting really. The structure must be similar to or a subset of the real data, whatever your model expects really. Free text fields could be ignored, but most categorical or numerical data should be kept until it's shown it's understood or irrelevant. While we could get away with replacing free text with something less personal, it depends heavily on the context. Using nonsensical names might be a good way for humans to recognize this as synthetic data without affecting the model's training. The relationship between variables must be maintained. This is the crutch that machine learning is relying on. So if we deviate from this, our accuracy will suffer very quickly. The size of this data set should be at least as big as the real data set. And as synthetic data is hopefully at this point, a cost-effective way of generating more data were limited only really by the storage costs of that data or at the point where the training sort of tapers off in accuracy, I suppose. And finally, any quirks of this data need to be documented. Whoever's performing the machine learning needs to understand exactly how this data is relating with the problem context, otherwise we'll make assumptions and we'll end up with something that's just not correct at the end. They may also need to pre-process that data based on the context you've given them. So the analysis purpose, we may not need to anonymize the data if we're only sort of sharing this internally or it's staying within a particular team. That being said, anytime we are using personal data, whether it's in research and industry, wherever, we need clear justification for the use of that data within GDPR constraints. Also, you're limited in who you can ask for help. For example, if you did employ a third party to perform your analysis, there might be more hoops to jump through. The structure could be identical to the real data, but a subset of the data might be acceptable here. Again, data for analysis, when synthesized needs to be indistinguishable from the real data and maintain relationships between those fields. We could get away with approximating our full dataset so that the analysis can be performed in less time to understand those trends and documentation of quirks is still definitely needed. Otherwise the analysis is gonna be making a lot of assumptions. For the recruitment case, I think it's important to preface this because not everyone watching this is in industry or we'll be aware of this, but it's very common if you apply for any sort of data analyst or data science role to be given an interview and that interview will be based on a take-home task and that task will normally consist of some famous dataset or internal dataset that's been cleaned up and the prompt explore this dataset and find three insights or assess the quality of this data or something like that. But what could be even better here is generating a synthetic version of the internal data those candidates would actually be working with. And this way your discussion can exist in that domain very close to work in the real role instead of using the Titanic dataset of people who lived or died on the Titanic. It gives a, there's some cool stories but it's easy to be curious about a cool story over what the job might introduce. And yeah, when you're synthesizing the data yourself you can introduce errors, you can force distributions between columns and hope that candidates will find them and bring them up. So it can be a nice way to sort of improve the recruitment process. So yeah, again, for recruitment we must guarantee that anonymity. It's super important here because when people don't get jobs they can be quite spiteful, they can be malicious and they still have that data. The structure could be identical to or sort of simplified version of the real data. We might want to simplify it just out of respect for those candidates time. In this case, we could intentionally make it obvious that the status synthetic using those intentionally fake sounding values for names, emails, things like that. Depending on what we want to assess in the individuals we might want to maintain or tweak some of the existing relationships. We might limit the size so we can send it over email instead of having to upload to Dropbox just keeping things easy for us as the employer basically. And documentation of any quirks is needed for the candidate to understand that initial meaning. But we could hold back here. I personally think it's quite interesting to see what people do when they don't understand the dataset and see if they sort of silently assume things or if they get upset and maybe I shouldn't be saying this a little bit now but yeah, it can be quite interesting to see how people respond to that. Next for showcase data we can probably cut the most corners. As I've said, it could be as simple as five handcrafted rows that have the general shape of the data. And we might just list this in a GitHub page that describes how to access the full dataset or on the UK data service. It could be a nice sort of marketing tool for a new dataset. It could also be a large dataset that lets you sort of practice your analysis before you get access to the data, for example, in a secure lab. But the purpose here is to showcase what a user would get if they had the real data without requiring them to fill in that paper, you know, the paperwork go through the training or any of that. So if you maintain any data that's behind closed doors that could benefit from being open this would be my suggested first step, you know, get a showcase five rows out there somewhere. Of course we need to maintain anonymity in any dataset that will be freely seen. We can cut some corners on structure. Maybe we only show a few rows and a few valid columns. It doesn't matter if these rows use obviously fake data as users should hopefully know it's an example. We could, but we don't need to maintain any relationships here. As long as our data is valid as these showcase rows it's probably not going to be used for any final analysis or potential training. Again, for the size, we probably only need a few rows here but we could make it as large as we want. And as this is showcased data I would suggest that we have the clearest documentation we can. This is data that we're hoping to sort of entice people with the people outside of the context of the particular dataset might find interesting, might build an event around, might ask for more access to the unrestricted data, things like that. And then finally we have testing. So for testing data we can use clearly synthetic data and it might actually help us a lot to do so for example I used to work at a travel company and we would sort of place hotels in the middle of the ocean or we'd make them look like they were on the moon so that nobody would try to book them or think they were real because of the absurdity. This data can be nonsensical but it still has to be valid. It still has to conform to the rules of existing data. We don't have to actually try and support a hotel on the moon with a moon currency. We know that our system is probably only going to exist on earth. So we're not intentionally trying to break those systems but breaking them within a sort of safe environment rather than taking down those live services. Even though this data is used internally it's not ethical or safe to use personal data. With the right mistake we could accidentally make hundreds of bookings on a personal card if we were using real data. So it's best to create artificial users and test cards and things like that. The structure of the data must be identical to what these systems are used to receiving. So if there's functionality that should exist but doesn't, for example, maybe a first name field should accept handsy Chinese characters in the name field but we're just a small UK based company right now and we just happen to have not encountered that yet. That's useful testing. That's something that could break the system. We could already have encountered that in the real world. It just so happens we don't have users that want to write their name in handsy. Our data doesn't have to blend in with real data though. We could use intentionally nonsensical names like test, McTest person. So if anything does go wrong, our reporting will make that very clear that it's only an internal problem and we're not actually mistakenly making bookings for a real person. We may want to maintain relationships between columns. So this will depend entirely on sort of the business logic we're testing. In most cases, as long as it's valid in structure, I don't really think we need to worry about this. In terms of the size, we probably want to generate a handful of cases or one of each sort of realistic edge case. We might want to test our APIs every hour, for example. So we're probably talking about less than a thousand rows of data a day, though it might need to be generated in real time instead of picked from our heads and consistently ran every hour. But yeah, hopefully within an organization, if you care about testing, there should be enough access to this kind of knowledge that documentation should be the easiest out of all of them. Yeah, cool. Thanks for being with me on those very information dense slides. As we assess these different purposes and features, we find the context-specific solutions to the problems of data synthesis can vary quite a lot. So to make data fit for a machine learning purpose, for example, we should probably run some data processing, apply some machine learning, or alternatively, there are some really nice modern data synthesis packages, such as Synthpop, Faker, and Mockaroo. Again, we'll cover these a lot more in sort of the third webinar and the code demo. For the analysis purpose, we could probably ask, do we actually need synthetic data or is it enough that we apply statistical disclosure control? If we redact the private data, re-sample some other columns, add some random noise, that could be enough, and we might not even really need to dive into a programming language to do any of that stuff. For recruitment, we can cut a lot of corners. The context should be realistic, documentation should be okay, but the data doesn't really have to be. We could use a cool web-based tool like Mockaroo that allows you to generate 1,000 rows of data, modify it yourself, add outliers, add and valid data, force obvious relationships if you want to, showcase data. I think this is the easiest one. You could just get away with five rows of data really, could be done in Excel or Google Sheets, but if you need a larger data set, you can redact and re-sample and add noise from a real data set. And for testing data, we need realistic data that can be generated, hopefully at the time a test needs to run. It might be good enough to have a handful of hand-picked examples that we sort of randomly select from, but will probably benefit from real-time generation of data if the purpose there is to sort of push the limits of your system rather than break it. So Faker from Python is really good for that. And to conclude, so synthetic data can create sensitive data, minus sensitivities. I guess all of these come with the caveat that they can not as well if you sort of don't do it well, it won't help at all. Synthetic data can be a cost-effective way of improving models, but again, it might not improve the model. Synthetic data can make your work more reproducible, extendable and verifiable, but if you're not doing anything for privacy, you'll probably just get in more trouble than it's worth. And then finally, we could use synthetic data for a variety of purposes, including improving models, making data open and testing code. After this webinar ends, you should see a prompt to leave some feedback. So please go through this short survey. It helps us make better webinars. Yeah, and then there's some further reading. If you are interested in any of what you just heard, there's the data in government synthetic data blog that was really good. Most of this talk is based on that. There's a nice podcast on gradient descent with the founder of AI Reverie about how synthetic data is used in the industry. And then I wrote a blog post, which is pretty much this talk with some extra technical details. There's also some more practical technical frameworks. There is the synthetic longitudinal study that I mentioned before. And finally, if you want it, there's a book recommendation, Synthetic Data Sets for Statistical Disclosure Control. But otherwise, I'll see you in the next webinar and I'll take some questions now. If you do want to contact me outside of this, I'm very active on Twitter or you can email me at joseph.alan at manchester.ac.uk. Yeah, obviously some people might watch it on YouTube so that'll be useful for them. But I think that's everything.