 is Dr. Julia Kazmeier. I work with the Kathy Martian Institute and the UK Data Service, located at the University of Manchester. And today, in partnership with the N8 Research Partnership, I am presenting an Introduction to Synthetic Data. This is a workshop, so we'll start out with a presentation. There'll be a little bit of a break, and then there'll be some hands-on coding, which you're welcome to join me for, or you can watch me and then do it at your own pace later because this is being recorded and will be posted on YouTube. Let me introduce my fellow panellists and facilitators in this group. Grania is manning the chat, and that is the best place to ask questions about technical problems, things like, oh, can I get that link, or why is my audio not working, things like that. Louise and Nadia are also manning the Q&A, so you can ask questions there. Actually, I'll go through this. I have slides about this, so let's see if I can make this work. So, interaction in this workshop. You can use the chat here. Of course, you're all familiar with Zoom in the chat. This is, I'm sure, old hat to you by now to talk about technical questions, comments. This is for the workshop facilitator, Grania. And then we also have the Q&A, so you can ask or upvote questions that are directed at me, so that's more about the content of the presentation. That's in the Q&A because this is a Zoom webinar and not a Zoom meeting, so there is a separate Q&A function. And finally, there's also interaction in this workshop through menti.com. You can go to menti.com on another device if you have a phone to hand or something like that, or on the same device you're watching this on, although it may be a little bit tricky to switch back and forth. You use the code at the top to get into the interactions on this workshop, so let's give it a try. Can you hear me now? Oh, we got some yeses. Fantastic. That's fantastic. Again, the website and code that you need to participate in this are at the top of the screen, but you can also find them in the chat if you lose them for any reason. You can ask in the chat and Grania will copy them back for you. All right. So, it looks like pretty good participation. We've got at least eight people who all agree that they can hear me. Nine. This is going up fast. This is also a demonstration of how menti.com works. Your answers to these questions are anonymous and they will appear live on the screen in roughly real time. However, if you are watching this on the live stream or watching it back on video recording, of course, they will not be live if you're watching it on video. If you're watching it on the live stream, there's more of a time delay, so it might be a little harder to participate. I'd say we've got nearly half of the participants in this workshop have voted on this, so I'll move ahead pretty quickly now to tell you about what to do if you're having trouble hearing me, which so far no one who's responded to that poll has had. Of course, you can do the basic things like check that your speaker and headset is plugged in and the volume is on. You can switch to listening via phone if that's of interest to you. You can also give it up as a bad job and watch it recorded back on YouTube later. Or you can go directly to our UKDS live stream. Sorry, I've just noticed that apparently the chat is still disabled for participants. In that case, everything will have to go through the Q&A, which will be a little bit tricky because that will mix the technical facilitator type questions with the content questions, but hopefully with three people manning the Q&A, they can sort through all of that quickly. But yes, this session is also available via live stream. You can watch it there if for any reason the Zoom webinar is giving you trouble. Here are some other UKDS events that you might be interested in. So we've got the developing area profiles using census 2021 data. That's the new data that happens next week. And the following week, well, maybe just after a following week, mapping crime data in R in intro to G. Sorry, hopefully everyone can still hear me. I've just been notified that my microphone has changed and my video seems to have turned off. Let's see if I can start it. Yes, interesting, getting some trouble there. Okay, but we can still hear me. Great. There's a live code demo for the mapping crime data the following two days after the introduction and then CSS drop-ins every second Tuesday of the month. You can sign up for those. Those are really valuable if you just have questions or just kind of want to find out what other people are doing in computational methods. So that's quite a useful one if maybe you're working on a machine learning model and you're like, oh, I don't know what package I ought to be using for this. Maybe pop into our CSS drop-in and talk and see what everyone else in that group uses for their machine learning models. Okay, so grant. Let's move ahead if I can. Yes, so the table of contents for this workshop. Now we're into the meat of it. Start off with discussing what is and is not synthetic data. This is trickier than you might think. A lot of people get it wrong. Don't worry. If you've gotten it wrong so far, pretend it never happened will get the useful definitions to you. Following that is a section on fidelity, which is a really important concept for synthetic data. And then we go into the uses of synthetic data, some methods for generating synthetic data, and a break. So the break, depending on what time we get there, at least five minutes, maybe 10, maybe 12, it depends on the timing to allow you to stretch your legs and get a drink and pat the dog on the head and all that kind of stuff. Following that, we'll have the hands-on session with the code demo. Now the hands-on session with the code demo, you can get the code in advance from our GitHub page. You can also follow along with Binder. It will launch in your browser if you want to do that. I think most people would prefer to download it and have it run on their computer. You can do that via GitHub. We'll get there, don't worry. So to start us off, synthetic data. What is it? Usually when we say that data is generated, we mean generated by a computer, but this is not strictly necessary. So synthetic data is any data that is generated rather than observed. And usually we mean by a computer, not always. So for example, a string of numbers made up on the spot is still generated, but I'm not a computer, I'm just a computational social scientist. And let's say we want to generate a commute made by starting at a point identified on a map. You throw a dart at a map and say that's where the commute starts. And you then go to that point in the real world and you turn left every time you pass a red car and you turn right every time you pass a green car and you end the commute at the nearest business after a chicken shop. These kinds of bizarre made up rules that depend on lots of external input, that's still a generated commute. That would be a synthetic commute. So even within computers, there are many ways to generate from creating lists of random numbers to creating elaborate simulations and recording the output of the actions of the states of that simulated actor. Machine learning models can generate synthetic data to fit the classes that they've learned, for example. All of this will cover more in detail on the how synthetic data is generated part of the discussion. But it's important to start off by knowing that synthetic data is generated and it is not observed. So observed data in contrast would be things like responses to surveys, temperature sensors, essays written by students, ticket sales, people making hash marks every time they see a target action happen, things like that. That's observed. Synthetic data is not. So let's look at some examples that you are probably familiar with, but may not have known our synthetic data. Many of you will have seen maybe even used low room if some placeholder text. So this is very useful because it looks a lot like English, even though it's generated rather than real text. No one ever sat down and wrote this out from their mind as real text. This is useful as behaving like English text without anyone having to actually write it. Another example of synthetic data that is generated rather than observed is fake celebrities. These are all pictures of people that do not exist, but they were created by an AI that was trained on photos of real celebrities and it was told to generate some new celebrities that matched what it had learned about real celebrities, which is genuinely creepy. I mean, lots of you will have used random number generators. If you're trying to pick a raffle from online participation or something like that, this one is from Google, but of course there are other competitor random number generators. And slightly more complicated than a random number generator is you can get a dice simulator. So you can say how many dice, how many sides each of those dice has, and then you can throw a virtual handful of dice and get the results. This is very useful if you're playing some kind of online role playing game with your friends and you don't want to, not everybody has all the right dice and you kind of don't want to let people cheat about like, oh, how many did it really land on 20? Oh my God. Yeah, don't just get a dice simulator that everyone could see the results of and it's all fair and good and synthetic. So synthetic data is not the same as real observationally sourced data. You know that, that even if the real observationally sourced data has been de-personalized or it's been anonymized or it's been noise-ified or it's otherwise treated to remove identification, this sort of like de-personalized data is very important, but it should be accurately described in terms like anonymized or de-personalized. It is not synthetic. Some people use synthetic data when they mean anonymized or they mean de-personalized, but they say synthetic. They are wrong. Not only are they wrong, but they're making it hard for people who work with synthetic data to work with synthetic data. So now let's play some games. If you go to Menti and enter the code, and I've got a few of these, tell me is this synthetic or not? So is data from a cyclolane sensor synthetic? We've got a very strong indication so far that it is no. And of course, you will all know that that is because cyclolane sensors make observations of the real world. Oh, we've got a couple of I have doubts. That's great. It's great to have doubts because obviously this is, you know, an introduction to synthetic data. We may not be clear. Certainly, cyclolane sensor data is anonymized. You know, it doesn't ask who's going past the sensor. It just counts how many people, but it is observing the real world based on some kind of trigger or real world observation. So it is, this is not synthetic data. Okay. Thank you very much for participating. I've got another fun one. Are the predictions from weather forecasts synthetic data? So let's say we have the average high temperatures for next week, according to Met Office forecasts. Oh, we've got more division on this one. This one's much more evenly split. Oh, no. No, it looks like yes is winning. Oh, we've got some people who don't want to say totally agree. Like this is a very scientific answer. Say I don't have enough information. Ah, still quite evenly split though. I think this is interesting. I will tell you that weather forecasts are made by very complex simulations. So the output of a weather forecast is going to be synthetic data. Now, those simulations are built on real world observations, but the predictions come from simulations and therefore are synthetic. Thank you very much for participating. And I'm really pleased that that one was actually split because it shows that we're still learning how synthetic and observations work. Very good, everyone. Thank you very much. Is census micro data synthetic? These are the small samples of data that include actual responses, but are treated to prevent identification of the household or individual. Now, I did prime you quite well with my rant against calling anonymized data synthetic. So thank you very much for getting this right. I really appreciate that you're paying attention. Okay, wonderful. Is the output from chat GPT synthetic? So I don't I assume many of you will have seen this in the news. These are the answers to questions that real people put in pose real questions to this AI chat bot. And it gives you answers, which it has learned via supervised and unsupervised learning. Oh, we've got one. No, we've got can you rephrase the question? I can rephrase the question. The question is, is it synthetic that chat GPT? Most of you have said yes. And I believe so. I did not create chat GPT. So there may be some, you know, tiny man in a box who's typing away the answers. But I'm much more inclined to believe that it is synthetic output created by an AI or a machine learning model. And therefore it is not based on real observations of how real people have answered those questions. Thank you very much. Got another one is stuff you just made up in your head synthetic, a completely fabricated anecdote, for example, about that time a co-worker stole your lunch, but then almost choked because you inexplicably put peanut butter on a coordination chicken sandwich. Is that made up anecdote? I synthetic, not synthetic, or look over there. I love how many people are replying look over there. I can only imagine it's accompanied by sound of running away. Most of you have agreed that yes, this is synthetic. And I believe so. Yes, this is generated is generated from people, generated by storytelling and fiction. But that is generated, at least in this kind of context, unless it was generated, unless someone's writing down observations in response to a question. So I can see why there is some yes, no. But yes, this a completely fictitious story would be generated. Right. Very well, everyone. Thank you for participating in my little game of is it synthetic? I need like a sting, little music that plays for my synthetic game. Now we're going to talk about fidelity. Fidelity is important in a synthetic data context because fidelity means faithfulness, but data faithfulness is not a binary thing. Things are not faithful or unfaithful in a data sense, or at least not nearly as often as people might think. So synthetic data can be faithful on many different features and unfaithful on other features. So you can have the number or type of variables, the mean, the standard deviation, the distribution, can be faithful to the documentation, be faithful representation of the volume of data in the original, it can be faithful to the relationships between variables, lots of other features. So does it have the same number of categories for a categorical variable, and do the category counts match or have the same ratios, for example? Those are different ways to be faithful. If the category counts match exactly, then the volume of data must match exactly as well. But if it's the same ratio, the volume of data could be different. So you can see how it can be faithful on some and not others. How about the documentation one might need some more explaining. So for example, if the original data has male coded as zero and female coded as one, is that true for the synthetic version or have they been switched? So the female is zero and male is one. So it can be faithful on the fact that there's these categories, but it can be unfaithful on the documentation between those categories. So just something to think about, faithfulness is not a strictly binary feature. So let's look at this low fidelity example. This whole set is faithful in that they have the same number of columns and rows, and the labels and columns roughly match. But let's look at column Q. The synthetic version is faithful in that they both the real and synthetic have an even distribution over two options. But the synthetic version has three and four instead of zero and one. So it's faithful on the distribution and the number of categories, but it's not distribution on the contents of those categories. Then column Z is pretty close in a lot of ways. Same main, same basic distribution, etc. Even the values 100 and 100.63, for example, are quite close. But the synthetic version has the decimal places in the post decimal digits. So it's not faithful in terms of precision and has a different level of precision. So do these differences, these ways that synthetic data is or is not faithful to the original? Do they matter? Well, it depends on what you're using the synthetic data for. And we'll come back to that in the uses of synthetic data. So just know that differences can be there. They can be big or small, and it might matter or not matter. So now we've got relationships between variables. So this is when I highlighted something could be faithful or not faithful. And it is often important but not necessarily straightforward to maintain relatively simple relationships between two variables. And it gets much harder when there are more variables and thus more relationships. Or when the relationships are more complex. So here, for example, we've got weight and height. Presumably people, I don't know. This is just an image I nicked off the interwebs. But let's also note that there's some clear outliers in this. And we can't be entirely faithful to the pattern and to the original data. So we can't have the maximum weight be exactly the same in the real and the original because that could potentially recreate this outlier, which would be a disclosure risk. That would mean there's there's a problem. The synthetic data isn't properly synthetic. If it matches so much that it recreates outliers, exact outliers. So that's something you have to be careful of. If you're trying to preserve a relationship and the minimum and maximum, for example, then there's a real chance that you might accidentally recreate the real data. If you're trying too hard to be faithful. So let's think about this a bit more. High fidelity synthetic data is very hard to make as essentially it must be custom built to suit the particular data set, the particular research question, the use case and how it was generated. And you have to clarify exactly which features of the original need to be faithfully reproduced and which can be left to vary without being faithful. Nothing can be 100% faithful, especially if disclosure risk is a problem because synthetic data cannot match any real observations if it's properly done. So as the complexity of the data and the relationships between variables grows, it becomes hard, increasingly hard to make everything faithful. This makes sense. Of course, we understand. But we have to carefully check to ensure that no case of the generated data matches a case in the original. So you don't want a bunch of fake people and accidentally create a name, city, residence income that matches a real person because that's potentially really problematic. So that's just something else to think about with high fidelity. But fortunately, greater fidelity is not always better. There are many reasons to make synthetic data. And some of these reasons may mean that we actually want synthetic versions that are not faithful to the original in some important ways. High fidelity data is absolutely not required for most of the purposes that people currently use synthetic data. And medium or medium high at best is often good enough. The real skill is matching the fidelity that you need and specifically what isn't is not faithful to the purpose for use. So let's look at an example. If we're trying to build a machine learning or an AI tool to help identify potentially problematic skin lesions, but the data set we have comes from a northern European country that has relatively non-diverse populations. As a result, our real data set has very few photos of skin lesions on people of color. We want the tool to work on all kinds of people. So what do we do? Ideally, we would go out and get more representative photos of which is a valid research project. But the reason we have few photos in the first place, namely that the clinics that are taking good photos are well funded ones in countries that are not representative of the world's population. So while a research project, a valid research project on photographing skin lesions on all kinds of people all around the world is a good project, we might need to justify why we want to collect that data. And to do that, we might build a synthetic data set that mimics some of those features, develop an AI that shows it has a good use, and then we can use that to justify why we need to pay to go collect the real data. So this brings us nicely to our next topic, which is purposes for synthetic data. Hey, we're doing great, everyone. I really appreciate how much you're paying attention. So one purpose for synthetic data is simply a preview. This is a purpose you will probably be aware of if not used yourself is to preview your data. This might also just be called a sample. So, for example, someone says, oh, what does your data look like? Can you send me three rows just so I can see the basic shape of it? Have you asked the questions that I'm hoping to ask? That's fine. That's a perfectly valid use for synthetic data if sending three rows of real data is not allowed under the license agreement or you're worried that there's some kind of problem with sending three rows of the real stuff. That is potentially issues. So things about preview data. They tend to be very small, maybe one or five or 10 rows, something like that. They demonstrate the structure and the style of the data set that you have. They should be faithful to the number and types of fields, the format, and maybe enough data to highlight features that make the data unique. There may or may not be clear indications of features or relationships of interest in the data, although that should accompany it with the documentation. Here's another one that you may not be aware of because this is often used in computer science and computational methods exploration is the proof of concept or toy data sets. They're called toy data sets, especially in computer science, so you may have heard of that. And what these do, they are sufficiently large to usefully demonstrate theoretical concepts that might work on the real data. So let's say you want to show that a heat map is a good way of exploring this particular data set, but that data set, for whatever reason, you can't get it right now. So you make a fake one, create a heat map on the fake one, and say, look, you can see these relationships very usefully in the heat map. That would be a proof of concept use of a toy synthetic data set. This is often a preliminary step. So we might, for example, going back to the skin lesions, the AI method, you might create a toy data set that has absolute black and white fields, and then the black fields are a varying roundness and things, and you can show that your AI can learn to distinguish not only what is the size and the roundness of your mocked up synthetic lesions, you might then develop it on additional toy data sets that have more skin tones rather than pure black and white and show, okay, no, it still finds, identifies the lesion, categorizes it by size and shape or something like that. Those are toy data sets. So here's another interaction. I'm interested to know if you have ever used synthetic data for preview or proof of concept purposes. So lots of people have used neither of those. That's totally fine. Several people have used proof of concept. That's quite good too. One person so far has used both. Oh no, we've got two both for both. Great. This is really good response. I'm pleased to see that we have several people who've used one or more. We've also got a not totally sure, which is valid, because you might think, did that time I tested, you know, I made up five numbers and made a used it to just make sure I understood this R code. Is that synthetic data? Maybe. Maybe. Okay. Very good. That's fantastic. I'm really pleased that we've got quite a people using proof of concept as and both, because I thought preview would be more common than proof of concept, but you have shown me that I am wrong, which is always good to know. Okay, availability. So you can use synthetic data when the real stuff is not available for various reasons, or even when the real stuff is impossible, which is why I chose this image. So availability, your synthetic data would be sufficiently large to do whatever it is you're trying to do. For example, if the real stuff is not currently available in the right format, if it relates to rare or very negative events, we may never get sufficiently large data of very rare events, you know, plane crashes or earthquakes on Richter 9 and above or something like that. If the data relates to unethical situations, certainly we can't go deliberately creating unethical situations in order to get good data. That's not a good approach. We should create synthetic ones instead. Or if it's just unfeasible for the scope of the work, and I'll go through some of these. So to some extent this overlaps a bit with the proof of concept slide we saw just a couple slides ago, but it extends beyond the things that are currently unavailable into things that may never be available. As I mentioned, for rare or very negative events, we do not have excellent data for thousands of large earthquakes, for example. So testing products for earthquake safety requires that we use simulated earthquakes. Similarly, we don't want to create unethical situations in real life to study how people might react. We can simulate those situations and see if the outcome surprise us. I used a lot of this actually in my master's degree on language evolution, because it is of course unethical to raise a bunch of children without language just to see what happens when you put them together. That is unethical. We cannot get real data for that situation. We can treat computers so that they simulate that situation, though. Also in this group are the times when we want really detailed records of things maybe as they're developing, but it's impossible or burdensome to obtain in real time. So imagine you want to study the changes in chess strategy as a person goes from a total beginner to an advanced player. No total beginner is going to agree to have every single game they play recorded in excruciating detail, especially if they're not sure that they want to be serious about chess. So I actually used this in a paper I'm presenting in summer in which I argue that innovations that have a radical effect are really hard to predict or study closely in real time at early stages, and instead they're almost always recognized in hindsight, but the way we look back at those early stages in hindsight is often not very accurate. So in this context, the context of availability, faithfulness is hard to judge because we don't have the real data to compare it to. So instead of thinking about this sort of synthetic data as faithful or not, we need to ask is it useful or not, which frees us from worrying about faithfulness and instead makes us worry about usefulness. Presentation. Here's another purpose for synthetic data you're probably familiar with if not even used yourself. So this means any data that is intended to be made freely or publicly available, such as when you put it in presentations, share it in a tweet, put it in the GitHub repository with your code, all of that stuff. This includes teaching, by the way, I'm counting teaching as a way of presenting something. Lots of things that are called teaching data sets may actually be anonymized data rather than synthetic, but synthetic data for teaching purposes is totally valid. So these, again, they're sufficiently large in order to do whatever it is you're trying to do, to present data on a slide, to let people have something to use your code on, things like that. They must be representative in relevant ways. And this again comes down to how do I make it faithful for my purposes. But you have to take care if it's possible that it could be mistaken as real, and you have to really thoroughly test it. Because if you're making it freely available, then lots and lots of people will try and use it, and they may use it in ways you're not expecting. So to try to be thorough when you're testing it. So yeah, this probably you've used or at least seen other people use synthetic data in presentation purposes. I do also want to point out that lots of people who feel they cannot make their work reproducible may be able to do so with synthetic data. So if you're saying, you know, someone someone I know, for example, said, Oh, I can't make my work reproducible, I work with secure data, but you can make synthetic versions. So that people understand your steps. And then if they're really interested in exactly reproducing your outcomes, they can get access to the secure data. But they can at least understand your steps very thoroughly, because you've made a synthetic data set for them to work through your stamps. So consider it a very valuable part of reproducibility. So we come to another interaction. Have you used synthetic data for accessibility or presentation purposes? So some of you who answered yes to proof of concept can answer yes here. I'm expecting accessibility. Some some people who did proof of concept stuff may have used it for accessibility purposes. Yes. So I'm getting a couple of responses there. This is great. Lots here for presentation as well. I think that's a real valuable indication of how people are being careful about really being, you know, personal data identifiers, things like that. I did expect presentation to be a very popular response. And if you're interested in learning how you might present data with a synthetic data set, but you haven't done so yet, don't worry. We'll come to that in the second half of the workshop where we go through the code demo and we work through some examples and we talk about how it might be useful. Okay. We've got another example and that is code development. This is a very useful purpose for synthetic data that many of you may not have considered efficient and thorough code development. So again, sufficiently large and that's quite a hand wavy term because it depends on your purposes. And in many ways it is deliberately unfaithful. So you have to test if your code that you expect to run well on the real data does actually run under all assumptions that might apply. So let's say, well, let me go through these points. You want to test that the code outputs and documentation are clear and useful and you want to ensure your code could be run by others. So you might use this sort of synthetic data in slides at conferences and GitHub repositories to showcase your methods or to teach students. I don't know, wait, that's still on presentation. I missed something in my notes. That's terrible. What code development should be saying is that it has to be deliberately unfaithful. For example, if your data has never had missing values in a particular column in the past, you should and you want to use synthetic data to test your code, make sure that there's a missing value in that column. Because you want to make sure that your code doesn't hit something unexpected and crash. You want to make sure it either hits something unexpected and does something reasonable, at least gives a useful error message or ignores it or replaces it with a zero or whatever it is you think should happen. But you need to make sure all possible assumptions are run. So let's see. Remote work. This is another one. And this obviously is much more timely since we're all working remotely. You might tell I, for example, am not in the office. I am in my upstairs craft room slash library slash home office. Lots of us are working from home. But more importantly, lots of us are also working with people from other institutions. So we may not all have access to log on to the same secure servers. Or we may want to push code to a remote high powered computing facility like bead, very popular. And you cannot do that with some kinds of secure data. But if you still want to collaborate with people from other institutions or push it to a remote machine, and you cannot provide the real data, you need to create a synthetic data set that is, that you can push to those remote servers so that you can work together or work remotely. Now, this one's tricky. It should be probably quite large and medium at least, maybe even high fidelity. So this kind of data set is probably going to be the hardest one to work with. But you should take care to make sure that it's portable and workable in diverse computing environments, that it's faithful to the original as necessary for your useful analysis, that it's clearly useful for reproducing results. So we want to make sure that it runs when you run your code, but also when everyone else in your team runs the code, or when they run code that is comparable, maybe in a different language. We want to make sure that it's really useful for reproducing results properly, and also that it is communicated accurately. What data set is it based on? Which choice is it faithful or not faithful? How big it is? What has been tested on all of this kind of stuff? So now I've got another interaction. Have you ever used synthetic data for code development or remote work purposes? These are slightly more advanced, so I'm expecting a fair few of the no neither. Ah, but good. We're getting some of these other ones. This is fantastic. This shows that, I mean, I don't know if the exact same people are answering yes on each of these have you used slides, but potentially it shows how diverse the purposes for synthetic data are, even among our audience, which is 60 participants today. So great. Yeah, I did expect more people to respond to this one saying neither, because these are arguably slightly more advanced purposes for synthetic data, but you know, very useful ones. Now, how to generate synthetic data. This is what you're here for. This is the big deal. Handmade. This is the most obvious example of synthetic data. If you only need a few examples, you can just make it up from your imagination. So maybe sample data, for example, you might, someone says, oh, can you send me a couple of rows from your data so that I can see what it looks like? What order the variables are in? And you say, no, my data is secure, but I will fake up a row that shows Joe blogs from Nowhere'sville or Phony McFake from the Marianas trench. You know, these kinds of things are obviously, obviously synthetic, but very useful for showing what the data is shaped like. They might be representative or not. You know, it depends. If you're trying to hand test edge cases to make sure your code doesn't crash, then you might hand make 10 lines with really extreme variables so that synthetic data with missing values, with nonsense values, like a height of 1000 centimeters and a weight of negative 20. So back on our height, weight, you know, synthetic data, we might say, what happens if we have a negative weight? This doesn't happen in the real world, but, you know, in synthetic data, it's possible that we want to make sure the code doesn't crash or throw up a weird wobbly error message or something. Because, you know, code can data can be entered incorrectly and therefore just have have bad data in the real stuff too. So small scale handmade synthetic data, you can have exactly the data that you want to demonstrate or test exactly what you want to test. But it's a pain to make lots of data by hand. So I recommend you get good at making random or nonsense data. So this is output essentially generated by random generators. So numbers are really easy. Most programming languages will have multiple ways to develop numbers or strings of numbers in random ways. You can also generate some strings. So some basic packages and loops can give you, you know, a first name sort of thing and a surname sort of thing. And you can, you know, populate some fields with this much more easily than trying to hand write a bunch of fake names. And you can do combinations of more or less structured things. You might need to import a few packages and write mini programs or loops, scripts to, for example, create a bunch of fake email addresses that have text and at symbol, more text, a dot, a couple, like two or three letters following the dot. So you can write some basic rules here. There are some very common types of data that people want to generate with this sort of random or nonsense generation. So you can get some ready-made packages and libraries to generate synthetic versions of common things like names, email addresses, postcodes. These include synth pop for R, faker for Python or makaroo, which is available via the web or through an API access. So you can do it yourself by writing little mini programs or you can just get some of these packages if the thing you need is pretty common. Now, moving on to medium fidelity, you know, you want something that not only looks reasonable, but that matches your real data set properly. So to do this, you're going to want to learn some machine learning methods for generating synthetic data. So, I mean, these include the basic stuff like supervised methods, anything from linear regressions through to decision trees, neural networks, classification methods. So again, you get sort of naive bias stuff, decision trees, neural networks again. Unsupervised methods, you can get clustering and dimensionality reduction, principle component analysis. These are easy to find if you want to learn more about them. We'll discuss the linear regression approach to machine learning for synthetic data analysis in our code demo. And my personal favorite is simulation. So this is output generated by simulation methods. Now, I include artificial environments like wind tunnels or wave pools or vacuum tanks in which inputs and outputs are measured in a controlled environment that's meant to mimic the real world. This is sort of straddling the line between observation and simulation and generation. So some people might say this is observation because you're observing a real, my, oh, my video camera has just changed. Let me move back to here. There we go. Can everyone hear me? I assume possibly I lost a bit of sound there. Apologies for that. Beyond the artificial environments like wind tunnels and wave pools, there's fully computer simulated data in which real and or simulated actors, forces, situations are applied within simulated environments and inputs and outputs are measured. So people use these kinds of simulations, one or both of these kinds of simulations for predicting the commonplace outcomes of complex interactions. So the weather forecast that we talked about earlier, weather forecasts are very high level computer simulations that take loads of real historical data and our best understanding of how the physics of the real world works, they have put that into rules and they crank it through the computer to get the simulated outputs. That's the weather predictions. So that's a simulation. But then there's also things like, I mean, you'll be familiar with some of the epidemiology models that have come out recently, in which people predict that if people do or don't wear masks or do wear don't social distance, this kind of thing, the infections will spread this way or that way and this will be the predicted infection rates. That's all done with computer simulations and they take real world data as much as possible, but they put it into a computer simulation, crank it through, get some predicted outputs. That's therefore synthetic data. So let's talk about some synthetic data conclusions. Synthetic data again is generated, not anonymized. Fidelity matters, but truly, truly high fidelity is not really feasible. Let me take a sip of tea. Thank you. Higher fidelity is not always better. There are many purposes for synthetic data. There are many ways to generate synthetic data and synthetic data is very important for making your work reproducible. Okay, here's another interaction via mentee. Please tell me, I think you can enter up to three times a word or a short phrase. What do you think about synthetic data? Maybe what you've learned, maybe what you want to use it for, maybe why you hate it and everything about it. Feel free to enter some words here and you will see them come up on the screen and repeating words will get bigger because that's how word clouds work. Oh, I'm liking this. Yes. Got some good words here. Expand small datasets. Absolutely. Yeah, so there are times when the real world stuff just isn't available. You can use synthetic data sort of copy with variations on that small dataset. That's technically synthetic. Great methodology development. Absolutely. Data augmentation, that's like expanding small datasets. Generalizability, excellent because your data might look exactly perfectly but you want to make sure your code works on other very similar kinds of data. Security, useful, real, fake. Yeah, solutions, ethical. I love that. I love that people have picked up on the ethics because it is, yeah, there's sometimes when you just can't run experiments in the real world the way you might want to but you can run them in computers. Okay, excellent. Visualization. Yeah, random strings. Efficient solution. Liverpool. Someone just wanted to enter Liverpool. Great. Future scenario analysis. Absolutely. Yeah, because we cannot get data from the future. That is an unavailable dataset but we can make our best guess and see what we can do. Great stuff. I think this is fantastic work, everyone. Thank you very much. So we're going to take a break. It is 10.50 now so I'd say at 11 we will start the second part. I will put a couple of screens up and leave you with some links to some other resources if you want. I will be reading through the Q&A I think during this break. Well, I may take a short break to stretch my legs and then we can work through some of the Q&As as well. Okay, so if you want to take a break, get something to eat or drink, stretch your legs, walk the dog, you have until 11 o'clock. So I will see you then. Hello again, everyone. Just wanted to go through a few questions that we've had come up in the Q&A. One interesting question is would proper data documentation eliminate the need for synthetic data, synthetic preview data? It depends on why people need to see the preview. Really good documentation would certainly prevent some people from needing a preview because they would say, oh, what's in this? Oh, I'll look at the data documentation. It tells me what's in it. But it wouldn't eliminate the need for people to say, okay, does this format work with my code? Because I don't know. You would have to be really exhausted with your data documentation and it's much easier to just create good data documentation and provide a small preview so that people download it and try to use their code. So, yeah. Let's see. Another one, while generating the data, could we get a row of synthetic data that is exactly the same as real data? It is possible. It's unlikely, depending on your generation process, but it is possible. And so if you're working with very secure or very personal data, you should definitely run an extra check that no row of your synthetic data exactly matches any row of your real data. That's a very good check if you're working with something sensitive. Is it possible to imprint specific statistic characteristics on synthetic data? Yes. That depends on your generation method, but you can also generate lots and then like eliminate ones that prevent your statistic characteristics from appearing. I get the idea of showing a few rows of synthetic data as a preview, but it only applies to relational databases. What if we have tables, grids of data? I mean, arguably, you can give any kind of data that you might use. You can create a small preview of it. Whatever file format you're working with, can you do a small version? Or maybe your preview would have to be big. It depends on your data type, but you can absolutely create a preview that isn't small of your data type. Yeah, some questions there. Let's move on to the... Let me stop sharing and I will move on to... New share. Let me close this window. Let's see. Here we go. All right. I don't know if you can see it. I hope so. On our UK data service synthetic data page, can everyone see our UK data service open synthetic data GitHub repository? Zoom in a tiny bit. Yep, I can do that. Let's see. How's that? Hopefully, you can see this. I agree. GitHub does come across as real tiny in the background and I often need to zoom in. This is our GitHub page. You can find the slides that I've used so far, including clickable links for anyone who wanted the further reading links in this 2023. You can get the code demo. This is where the code that we'll be working is. The webinar that includes the slide decks. I'm going to go back and show you. If you want to work through the code, you can absolutely copy and clone this or you can download a zip and work through it or you can clone the GitHub repo and work with the data. Or if you don't want to work with that, you can hit launch binder. Now, don't all do this at once because it will crash binder, but launch binder will allow you to work through the code in your web, in a virtual environment within your web browser. You are welcome to do that. If we all do it now, it might crash, but we can, you can just watch me and then at some time in the future, you can go back and you can, because this is being recorded and will be shared, you can pause it, work through something, start the video again, or rewind it if you want to catch something again. I will work through the code. Again, let's see if we can zoom in a bit. If I can get this to zoom, come on. You can do it. Let's think about it. I launched this binder about an hour ago. In theory, it should work, but who knows? This does not seem to want to be working. Let me try refreshing it. Binder not found. Okay, so I will launch it again. Binder can be a bit slow, which is why it's a preview of binder. Yeah, hopefully it will actually build it this time. It did build it for me properly earlier today, and maybe it's just throwing a bit of a hissy fit. If it's taking a long time to launch, it's usually because binder needs to recreate. Now, I haven't done it before, so hopefully it will work, but if not, I will open up my Jupiter notebook on my machine and we will do it that way. Which, yeah, not entirely ideal, but yeah. In the meantime, I will answer some of the questions from the Q&A. Someone asked, would it be correct to remove generated rows that happen to reflect the real data? Is it possible for the absence of these results to reveal more about the real data than leaving the results in? It is absolutely correct to remove generated rows that match the real data. Is it possible for the absence of these results to reveal more about the real data than leaving them in? I don't think so, but I cannot guarantee that it would not be relevant, that it's impossible, I guess. Anyway, I have now gotten my code data up. Okay. Jupiter notebooks, whether they're through Binder or just through your own sort of Jupiter notebook installation, you can run them cell by cell. To do that, I hold down control and hit enter. There are other ways. You can click this little go button in the top panel, for example. It shows it's running by putting an asterisk here. Yeah, it's all right. So it's already got all this stuff installed because I've run this a bunch, but all of this stuff are the different packages we will be using in this code demo. So blah, blah, blah, lots of text still running because Binder does slow it down a bit. If I were running this on my own machine directly through Jupiter notebook, it might be a little bit faster. Okay, let's see. Just so you know, the Jupiter notebook is, it does contain some cells that you can run directly as they are. And it also contains, pardon me, some cells that you will have to edit the contents of the cell to get them to run properly. So let's see. So now this has moved to a one. It was an asterisk. It's now a one. That means it's the first cell that has run, which means it's finished running. So great. We've passed all of that. So that's all of our importing packages. That's important to using code. Here, we're going to check the things that we can get. This is essentially my folder. You see here input of stuff that I can import. So I'm going to tell me I can import weight height CSV. And I am importing it as h underscore w underscore original. So now we're going to check the data. This is a good process to follow in general. Here's a bit of what it looks like. So this, for example, is why a preview of your data file is so useful. Sometimes just seeing a couple of rows of the data gives you a such a better sense of the data than the description does. In this case, this is the real contents of the data. But if your real data is secure or personal in some way, just seeing some rows of synthetic data gives people such a better understanding of your data. In this case, we can see that it is 10,000 rows and three columns. Gender, height, weight. I mean, personally, I think that's probably sex rather than gender. There's another way we can check the contents by calling just the head. So this and the number inside the brackets tells us how many rows at the start to give us. So instead of giving us five at the top and five at the bottom, this gives us 10 at the top. So extra credit, call up some other number for head. And also, what do you think tail does? I'll move ahead on that. Let you do it in your own time. So we're going to explore the data a little bit with this command gives us the column names and also what data type is this. Whereas info gives us a bit more. So it tells us the name of the column, whether it has any, how many non-null things it has and what kind of thing it is. So gender is an object, height, weight or both floats. Great. That tells us a lot about the data. And that is important. If you have never opened a data set before and worked through it, it's probably less important if this is the 18th time you're working with the same data set. But now we're getting into what the data really is. And for that, we want things like describe, which gives us the mean, standard deviation, minimum, maximum, interportile ranges, all of this good stuff. So that's just some basic descriptive statistics. You get that with the describe command, which is suitable. Now, you can also just get describe for one particular, you know, you can group by gender. So for example, you want the mean of female. It's for height. And, you know, that's pretty useful. So this way you can get different kinds of descriptive statistics. So you can say, give me just for male, just for female, just to height, just weight, you know, things like that. You can get value counts. This will tell you exactly how many of each there are. And for extra credit words, you can get value counts on something else. So height and weight. You can see if anyone else, if there's more than one row that have exactly the same height or exactly the same weight. But again, we are very visual creatures, people are. So instead of just looking at the descriptive statistics, maybe we want a visualization. Oh, look, there it is. Height and weight as a scatter plot. Now there's a clear relationship here. It's clear that we don't get very, very tall people with very, very low weights. And we don't get very, very heavy people who are also very, very short. So there's a clear, linear relationship here. That's not surprising. So let's look at a scatter plot with gender mapped onto color. So here we can see the different genders. They've had the same basic relationship. They're both on this line quite clearly. But they are not interchangeable. That there's clear differences there. I mean, there's also lots more you can do with visualizations. These are very basic ones just to introduce. There's extra credit work here. If you want to use a different, this one using SNS instead of map plot lib, to create scatter plots in these other languages, sorry, other packages, you may find that you have a favorite package and you always use that for visualizations. But some packages have different features than others. And you may find sometimes you need to know more than one package. So it's good to have a play around with something you understand well, like a basic scatter plot like this, in more than one package to see how they play out. Now, do we have any questions so far for importing the data, getting descriptive statistics, quick visualizations? Kind of not expecting any, because if you've ever worked with data ever, this should be quite basic. But I just want to make sure. Yeah. Yeah, everyone's saying it's hard to get to the binder. There are too many. We don't have like big money put down to make our binders work for hundreds of people at the same time. So you can, pardon me, you can download or clone the repository and get the code that way. And just run it on your own machine, assuming you have Python and Jupyter notebooks installed. Otherwise, I will work through the code data and you can try it again at your own pace soon after the preview. I assume we will go on to have lunch. So some people might be able to get in right after and other people might tomorrow morning. I don't know. But if there's no questions on that so far, let's talk about distributions. Now, we can see from the linear relationships and the scatter plots that there's something going on here with distributions. So let's look at some histograms. Well, look, that's look quite nicely bell shaped, isn't it? That's the distribution of height into 30 different bins. So good, straightforward. Now here you can try switching to a different number of bins. I won't bother. I'll let you do that on your own. And let's look at weight. Now, weight is much less bell shaped. This is clearly bimodal. That's quite interesting. Height should be one, but weight should be the other. But we saw such a clear relationship. What's going on? Well, I suspect it's because the difference is between male and female. I could be wrong. But here, we're going to create a subset of the rows where gender equals male and where gender equals female, and we'll create histograms of the male only half that appear at 50% transparency. And so and the field male only half also at 50, so that you can see both subsets laid over the same graph. This is much more code than the first one. But look at how nice the outcome is. We can see, you know, the height as well is now no longer unimodal. Yeah. So height for both when you plot it properly also comes out as bimodal. So this explains clearly why we also get a bimodal weight distribution. There is obviously some overlap, which is why, especially for height, there's quite a lot of overlap for height. This is why it made sense that there was a unimodal distribution for height when we first looked at it. And this also highlights the importance of why you should really understand your data very well before you try and create a synthetic version of it. Because a first glance at our first histogram gave us the wrong idea. And you do have to understand your data very, very well before you can create a good synthetic version. So now let's talk about distributions. I mean, these look relatively bell shaped, but are they? Plus they're bimodal, so there's something going on, but is each subset normal or it's clearly not uniform distribution, but there could be other distributions. In fact, these are all the 10 most common distributions, power, law, normal, log, normal, gamma, exponential, uniform, all this kind of stuff. We want to know for our subsets of male and female what's the distribution that matches. So here, height, males, we get the distribution of this, and we match it against those common distributions, and it'll tell us which one is the closest match. And we do that again for white males, height females, and weight females. And then we ask it to print out what's the best distribution for each of those subsets. And that actually runs pretty fast. Did it not? All right. Then, well, again, sorry, but it does do, it does not find the best distribution. It finds the fit between each of these 10 distributions and this subset. So for height males, how closely does it match to each of these 10? The next bit of code we have creates an empty list with the best fit, and it says for the distribution in each of those, so each of these 10, it finds the best and it appends that to our empty list, and then it prints the empty list. So it's running, takes a little bit of time, you can see it moving forward. Realize I'm going quite fast with the code here. Hopefully, you will be familiar with writing Python code, or with using, you know, Jupiter notebooks. If not, I would encourage you to read all of the in-between texts that I've got where I explain what these code blocks do, why we're doing them, why we're structuring them the way they are. But we do have our answer. Each of these comes out as normal. Normal is the best fit for all four of these subsets. So male height, male weight, female height, female weight, they're all normal distributions. That makes it easy because we can not have to worry about making some distributions, you know, exponential or power law or something like that. We can just treat them all as normal. So we're finally getting into creating synthetic data. And I would say it is important even if you're familiar with your data set, that you make sure you have good documentation of the features of your data set before you start making synthetic versions. So actually, what is the mean for this subset? What is the distribution for this subset? What, you know, all the features that we think might be important to create in our synthetic versions, we need to know the real ones. And I also want to say low fidelity synthetic options are quite random. I kind of think of them as you create a box and you just chuck stuff into it. And you'll see what I mean. Whereas medium fidelity options are about understanding the structure of the real one, and then creating that again in a synthetic version. So we'll go into this. We create three different low fidelity synthetic versions. We have lowest, lowest and low. Lowest quickly run this just gives us three random numbers, just just columns of random numbers. That is the lowest fidelity data I could think of to make for this gender. It's not distributed into male or female. It's just numbers, height. Obviously, we don't get negative heights or weights. And I haven't told this not to do that. So it's just creating some random stuff. This is obviously very inaccurate. But it would allow us to test if we can download and upload this file type, for example. So it is not without value. It's just without much practical value for the kinds of things we're probably going to want to do with this. Let's create a scatterplot just to see what this looks like. It looks like a big cloud of multicolor dots. So not much like our scatterplots. There's an extra credit section here if you want to change the number of columns that you create. Now, lowish. Now, this is more realistic in that we are creating one, we're naming the columns. And this time we're putting like a minimum and maximum on, you know, and we're, this way, it's not quite so random. So there's still all numbers, still not male and female and gender. But at least there are no negative numbers. There are no so let's just have a look at our scatterplot. So it's still a cloud of random. But we can see from this, you know, the scales that, you know, it's at least not negative. We don't have negative weights. Okay, so this is lowish. This is the middle of our low fidelity. There's some extra credit work here if you want to deal with that. And now our low fidelity step. Sorry, we have a preliminary step on this because even though it's low fidelity, we're actually making it slightly more realistic. So first I'm creating a variable called height that has a minimum and a maximum. These are derived from our real data. So I went back to our descriptive statistics and found the min and max for height and weight the min and max for weight. And I've created 50 male and 50 female. I then bind that into a data frame and print it. So this time, you know, we've got the gender is male and female. The height and weight are much more realistic than they were in our random numbers or our initial random numbers. And let's have a quick look at a scatterplot. So it's still a cloud, but they're now only two colors instead of all of the different colors. So we're getting more realistic, even though we're still very, very random. Okay, now we're into medium fidelity. Oops. Yeah, sorry. Didn't mean to double click that. There we go. And we've got, again, we've got three versions of this, medium fidelity one, medium fidelity two, and medium fidelity three. So medium fidelity one, we want to make a very simplistic regression model of the data and use the mean and standard deviation of height from the real data to generate matching synthetic data points. So what we're doing here is we're teaching a logistic regression model. If you have this kind of height, these are the kind, you know, this is the relationship to weight. And so then we are creating a random string with the right height, weight and mean sort of stuff. We're giving it that. And it's predicting weights that should match according to the regression model it's made. Let's see what that looks like. So we also have gender in zero and one, because it is considering gender and height when it makes the weight prediction. And it wants new numbers instead of categorical variables. This particular regression model wants numbers as input. Okay. So then we find, I think this one is our, yeah, this is still medium fidelity one. So now we're creating the, we're testing how accurate our output is compared to the regression model. So that's what this is, model score. So our height and our weight from the generated data we just made. How well does it match our regression model point nine. So pretty good. Not surprising because it's a fairly simple model. And then what do we got here? Yeah, it's, I'm a bit lost. Hang on. This time, what am I doing? Give me just a moment. This time we are creating another random starting point. So height around the mean height from the real stuff and the root center deviation from the real stuff. And weight is empty. No, wait, did I only, is this only making the regression model? Yeah, sorry. This isn't making the synthetic data. This is making a regression model that matches the real data. So we're saving height and weight as arrays. We're making a regression model and we're testing to make sure it fits on itself. It's got a 90% fit for itself, not for the synthetic version. Here we're creating a synthetic version with weight is not filled in because we haven't yet predicted the weight yet. This one predicts it. So this will take this gender and this height and it will guess the weight. And so it will overwrite weight with the predicted output. So here we have synthetic gender, synthetic height, predicted weight. Let's have a look at the scatterplot here. Oh my God, it's very, very linear. So it's synthetic and it does match the relationships we found in broadly in the real data. And it does match in the sense that it's, you know, just distinct for male and female. But gosh, this doesn't look much like real data, does it? And that's, that's a bit of a problem. So this time, we're going to do it again. But we're going to add noise so that it looks more realistic. So we repeat the same starting steps. But we add it in another column for noise. Then we convert that to a data frame. And then using our same linear regression model, because we already have the model on the real data, we predict our new one. So this time, when, when we overwrite weight, there is a step here that says current prediction equals current prediction plus the noise. So it takes the gender and the height, predicts a weight, but then we add noise and noise can be positive or negative. So this way, the predicted weight will move out in a cloud like fashion from that straight line that we saw in the medium fidelity one. And there we go. It is actually not bad. So this is much more cloud-like. It still contains the relationships. It doesn't look exactly like our real data, but it's approaching it. It looks much better. And so medium fidelity three. The real problem with medium fidelity two is that male and female both use the same basic range. So whether or not, in this case, purple represents male or female, I don't know. But this one happens to be taller and also smaller. So, you know, there's, there's sort of like, it's not quite capturing the real stuff. So in the third version, we're going to split it into male and female subsets, create linear regression models for each of those subsets, then I could predict the weight with some noise in there as well. So in theory, this should look much more realistic. So we copy the male only data set that we used back in the visualization phase, convert the gender to a string because the regression model only wants numbers, not the word male or the word female. We save the male only mean and standard deviation for height. And then we convert them into arrays. We make a linear regression model from those arrays. We create another dictionary. This one's our going to be our synthetic one, in which we fill the height and the normal distribution around the male only mean and standard deviation, convert it to a dictionary. And if you want to check it, you can un, you can remove this hashtag, but it's it gets quite big at this point. And we do the same for females. Oh, no, sorry. This, this creates the overwrites the weight with takes the height adds the gender. Here's the adding noise phase. Yeah, then it takes the weight. And it writes that back in. So gender height, weight in this one, you'll notice I don't have a separate column for noise. I'm adding noise in by taking the current prediction. So what it it's estimating is the likely to be the weight based on this gender and this height. And I'm adding a random amount of noise and then I'm subtracting a random amount of noise. And because noise can be negative, you might end up subtracting twice or adding twice or something like that. So in theory, this could move it very far away from the straight line, or it could move it away and then back towards, but we'll see how it goes. So that's what it looks like. Run through the same for the female only. There we go. We've got some negative weights. Don't know how that happened. Could be flaw in my code. I'll have to fix that and upload, correct it in the GitHub repository. But let's have a scatter plot to see what it looks like. Yeah, it's definitely some negative weights. I'll have to correct my code, but they are fitting the pattern. They're a bit noisy around. They're not quite as noisy as if I purely only added or subtracted because I have the two phase noise where I add random and then subtract random. It potentially can move back towards the middle. So all right, that in theory works through this code. There is some extra credit. We'll see if I can't fix that negative weights in the female data. I'll just compare. Do we have negative weights in the male data as well? We do. Something's wrong in my code. I'll fix it and upload it to the GitHub repository corrected within, I don't know, by the end of the day, I imagine. Okay. What do we have? Do we have any more questions? In selecting the best fitting distribution, why do we have same scale and location? Let's go back to distribution. Here we are in distribution. The best fitting distributions. I think I specify scale and location. So I think if you don't specify to use different ones, then it just assumes same. I mean, if you have a good reason to use a different scale in a different location, then you can. Right, yeah, location and scale. I don't know. I'd have to read the documentation on this package. This is, which package is this? From way back at the beginning, fitter. You have to read the fitter documentation to see what it says about scale and location and how you specify them if you want different ones and why you might want them. I don't know. I went through the basic fitter function, so I didn't look into it in more advanced terms. All right. So we've got lots of time for questions. I will stop sharing and turn my camera on. Yeah, there I am. I can share my screen again if you want to go back over any of those examples. But I can also just go back to the slide deck if there's something you want to see there, or I can go anywhere else on the web if you want to, if you have another question that might be relevant. So you can ask in the Q&A or in the chat, I suppose, because now I am in the chat. Okay, we've got it. Nadia suggested the fit distribution, fit disser plus package in R. So if you don't want to use Python, of course, all of these functions are available in R as well. And we get asked a lot what's the value of using Python or R one instead of the other, and it really comes down to which do you prefer, which is the most common language in the team that you're working with, for the most part, because really they do a lot of the same stuff. At a very high level, there's probably some stuff that Python does better and some stuff that R does better, but if you've not found them yet, don't worry about it. Yeah. Okay, so Susan is replied to the fit fitter question is because this is normal, so we're using mean and standard deviation. Other distributions will have other features that matter more. So yeah, you're absolutely right. I mean, that's how we came out when we created the synthetic height column input to our distribution model in the medium fidelity two and three, we created that height with mean and standard distribution because height is a normal distribution. Yeah. If that makes sense, if it doesn't, I'll try and explain it again. I do worry it will come out gibberish still. Female weight actually follows gamma distribution. Maybe that's the reason you got negative values for female weights in the end. Well, unless I've done fitter wrong, which is entirely possible, it returned normal for all four variables. But yeah, it's absolutely true that if something else uses a different distribution, you have to create, oh, I had a typo. All right, I'll double check that then. Or if you use GitHub, you can push a change to the GitHub to tell me where the typo was. It's absolutely true that if you don't explore your data well and don't find the right distributions, then your synthetic data won't look very realistic. And that probably explains why some of my synthetic data in the medium looked less than totally realistic. I would also say it helps if you, the better you understand your real data, the more capable you are of creating a useful synthetic data, whether or not you want that to be, you know, to be faithful on all of the different points. It will tell you if being faithful on all of those points matters to the question you're answering, if you understand your data very well. How can we assess if our data set is small or not? Or if we need synthetic data to expand it? That's a very good question. I would say extensive testing. You can, of course, make your code available to the wider world and you can publish a paper making outrageous claims and lots of people will try very hard to prove you wrong. And that's one way of finding problems in your data. In a less antagonistic sense, you can just ask people who are a bit interested in the questions to see if they can find problems in your data, in your code, in your method, in the new regression model, for example. There is some question about whether more data is always better. And it's not true that more data is always better. But if your data is not representative, then that's a reason you might want to augment it or, you know, expand it, make some synthetic stuff to fill in some gaps, like the skin lesion AI that we discussed earlier. We know that the photographs to train that AI were not representative of the worldwide population. And so we had sort of a moral obligation that if we want to use it on people all over the world, we have to make data that is representative. So if your data is not representative for the population it's meant to be taken from, then that's a good reason to augment it to expand your data set. It's really hard to say if a data set is too small in real terms. It has to be big enough to do what you want. And you should be clear about what you want. That's the best way to answer criticisms of it's too small. Say, well, I've wanted to do this and it did that quite well. But yeah. Okay. So I will double check my code. Always happy if someone else wants to correct my code. Obviously there's some errors in there. And I'll try and correct it and get it back up properly by the end of the day or tomorrow, depending on how hectic the rest of my day is today. But yeah, some interesting questions here about how small is small, you know, how big is big. Yeah. I think another reason to augment a potentially small data set is if you expect things to change in the future. So it's possible now that height and weight are distributed along these things. But if we find that maybe there's some kind of change in policy, you know, at the governmental level about making healthy food available or recently in the UK giving all primary school children free lunches, then we might expect height and weight to change because maybe that ensures that children are not malnourished and that might bring up the average height when this cohort of school children becomes adults. So we might expect the distributions to change and we can create data that matches our expectations and test to make sure that our code still runs or that we can detect interesting patterns. So that, for example, is a very good reason to create synthetic data if we're expecting a change from what we have now. Okay. Louise has shared the link to the GitHub repository again in the chat. That's always good. We've got some participants dropping off and I don't blame you because you've been here for almost two hours. It's fine if you've got to go, if you've got other things to do. I will be taking questions for another 10 minutes, I guess. You know, you can ask me to go back to the code or go back to the slide deck if there's something that you want to discuss or you can just slam your laptop shut and go get a coffee. It's up to you. Okay. Yeah. Yeah. I do. If you're new to Python coding, I do think it's useful to try out all kinds of different situations for where you might use this command and, you know, what the variables are. And so even if you're not interested in creating synthetic data, if you've found great, I really, it's helpful to understand what it is. Don't think I will need to create it. You can still go through the GitHub code just as a sort of like lesson in Python and using code. That's still, that's a very valuable function as well. Thanks, everybody. You're very positive in the comments. Appreciate that. And I do apologize again if there was a lot of banging noises from the construction down my street. I would also like to say that my puppy who's been asleep on the floor this whole time has not been making a lot of noise, which is not what I expected of her. Usually for two hours she would at least demand a treat and to go out for a wee at least once. Thank you very much, everyone. Really appreciate your attending and participating in the Mentimeter and asking good questions and being in the chat. Is there, we've got a question. Is there any standard regulation for synthetic data? I don't believe so, although this is probably different in different jurisdictions. So if you're working in the UK or European Union or the US, there might be different standards. I think it's good practice to be very thorough and clear in your documentation that it is synthetic and in which ways it's faithful and not faithful, how it was created and why you've created it because all of that helps people interpret is it useful for them? But as far as I know, that's best practice, not any kind of regulation or standard. This might sound a bit off, but can synthetic data be used to generate text in low resourced languages? Probably. I don't know that the output would be very good because generating text does depend on a lot of input, but it certainly can be used to generate text. The text might just not seem very realistic to a native speaker of that language, but yes, it absolutely can. So yeah, essentially it's, you would be creating a sort of chat GPT and there are, chat GPT is built on many years of extensive text generation development. You absolutely can input some of that kind of methodology on a low resource language and generate output. Depending on what you want to do with it, why you might want to do that, yeah, it's possible. Yeah, yeah, absolutely. Yeah, you're very welcome. Yeah. Okay, let me mark that as answered. What do you think about using SMOT for generating data for the imbalanced data sets in classification? I would have to look up what SMOT is, but let me do that now. And oversampling technique where the synthetic samples are generated for the minority class. Yeah, I mean, that sounds like a way to balance data sets that would be mixing real observational data and generated data, which is totally a valid thing that people might want to do. Again, you should be very clear about how much, how your synthetic data was generated and maybe add a new column in your data set to mark which ones are generated and not. Classification is tricky, though, because simply being generated will give it some features that make it potentially detectable by the classification algorithm. So you might, for example, generate something that every time the computer just recognizes as generated and depending on whether that's what you want or not. But again, it depends on your data, your purposes, what you're classifying and why. Yeah, oversampling is definitely a possibility. What are the most important aspects to look out for when trying to generate extrapolated synthetic data as data that can't be observed due to lack of hardware, physical conditions, etc.? Yeah, this one's interesting and tricky. I think there will always, because you're extrapolating things that you don't have the real data for, faithfulness is really hard to gauge and there will always be somebody who's telling you you're doing it wrong. But again, that comes down to not is it faithful, but is it useful? Does it help you? Does it test a theory? Does it test some kind of potential solution? Does it give you reason to maybe scale up into a real world kind of application where you can start getting real data? Or you can at least check some of the outputs of your synthetic data against real data. It's a tricky one. And ultimately, it comes down to how well you can argue that your synthetic data is useful for your research purposes. Good question though. That's very good question. Is there an official site for statistical synthetic data in the UK? The UK data service are working on this. The ONS has some. I believe Edinburgh University has quite a few health based synthetic data sets, but there's not one official site for synthetic data in the UK. Mostly individual research projects may or may not have a synthetic version available. And I think actually this is interesting. The N8, we're developing a synthetic data hackathon hopefully next October, I believe is the proposed date in which we want to really poll people on what synthetic data they need and will want to use and then actually try and create some of that synthetic data so that it can be used on BEAD, the big high powered computer that you can push remote code and data to for processing at a big scale. So if you're interested in getting access to some good synthetic data, stay tuned to N8 and sign up for the synthetic data hackathon in October. Okay. Yeah. Good. So lots of people have gone about their day. There's a few hanging about. Please to see. Please do send in more questions if you have them and let me know if there's anything I can do or show you. Thank you. Thank you very much for attending. Yeah. I'm pleased that people have found this useful. I would be interested, of course, to know if you don't already use synthetic data. Are you interested in learning to generate it and what kind of purposes you might use it for? Is it preview or presentation or code dev or remote work? You know, all the different purposes we had. We don't have the exact dates on the upcoming synthetic data hackathon, but Marion, I don't know, she's still one of the participants. She might be the one to ask. Let me see. I'll go to see if there's an N8 mailing list. It doesn't look like there's a mailing list. There is a Twitter account, so you can follow me on Twitter if you want. Or you can follow N8 on Twitter. I'm also following them, so you can find them through me if you can't find them directly. And they will have notifications about upcoming events and how to sign up to different workshops and things. Plus, however you found this workshop, you would get notification about the future workshops. Yeah. Synthetic data mainly for training machine learning models. This is a really common one. Lots of people use synthetic data for training machine learning models because many machine learning models learn best with large volumes of data. So if you don't happen to have large volumes of real data, yeah, fake some. And then you train it on that fake stuff. And that's very common. That's a really popular one. And see, we've got a question. Do you, I assume that's, do you know whether there are some research groups working on how to differentiate synthetic data from real one? I don't. It would be, I don't know that it's a problem. I don't know that so many people are using synthetic data in ways that could be confused with real data. Like, as I say, lots of people are using it for machine learning. But I don't know that anyone's using it in ways that really directly impact the public. So I don't know that it's a problem that has needed solved yet. But if you're interested, you might be on like first on the ground addressing this potentially as a problem. Yeah, it would be quite interesting to see if you could build a way to differentiate synthetic data from real data. I mean, some synthetic data is really easy to differentiate because it's obviously synthetic, you know, people's first names as a collection of random numbers and letters that no one actually has like names like that, you know, or names like Fakie McFake. This is, this is not a real name. So some synthetic data is deliberately obviously synthetic. In that case, it's quite easy to differentiate. Yeah, people's whose address is on the moon. That's quite fake. I hope. But yeah, something that's trying to appear more realistic. It would be interesting. Yeah. I mean, earlier, there was a question in the Q&A about how to train something to differentiate fraudulent trade transactions like and that's an interesting one because yeah, there must be some feature about the transactions that makes people suspicious or that indicates fraud, you know, maybe the speed or the timing of the direction or whether three of them occur at once or something like that. There must be features that make it appear fraudulent. And so there's, there's, there must be things that make data appear synthetic and you could in theory pick up on those. Adding noise to our linear regression predictions. Did you consider uniform noise rather than normally noise or any other distribution of noise for that matter? I did not because this is a basic introduction to synthetic data, but that is absolutely appropriate. You might consider different noise patterns. Um, because that, that is potentially a way to make your data much more realistic is to add more realistic noise. Obviously, noise can be distributed just the same way as, as anything else can be distributed. It can be uniform or it can be, you know, power law, that kind of distribution. Yeah. So you, you absolutely could add noise in different ways. I added the basic ones that come in Python, which I believe are normally distributed noise. But yeah, you could specify how, how much noise, what kind of noise, whether you add, you know, if you have the two step phase where you add and then subtract noise, you might give each of those a different distribution, something like that. So yeah, very possible, very interesting question to think more critically about noise. We are one minute away from done. So thank you very much to the 17 participants who were hanging on trying to extract all of the synthetic knowledge you can. Again, a direct you to our GitHub repo where you can get access to all of this. And if you clone it through GitHub functions, you will get the updates when I correct the code. And when I, you know, add the video recording, I think we might add that to the GitHub repo. I'm not sure that might put us over the limit because this is two hours. Well, thank you very much and have a lovely day. And I'm going to wave goodbye and then close this Zoom session. So