 Okay, everybody, welcome to the UK Data Service Synthetic Data Code demo. My name is Joseph Allen and I'm a research associate here with the University of Manchester, and thank you all for coming. Okay, so to start with, we've got all of the resources this notebook is going to go through. Again, in particular, I would suggest getting this binder link open. It should take a couple of minutes. There's also an R pub. So if you don't have our studio or aren't interested in our studio, you can see the R code that we're going to run there. There's documentation for all our libraries that we'll come back to a bit later. And there's some introductions to the NAT cell data that we'll mostly be working with. In this code demo, we're going to be covering what the NAT cell data actually is just because I think that context is so important when we are looking at synthetic data. I'll introduce mockaroo, which is a web based data generation tool. It has its limitations and that's why we'll move on to Faker. But until that point, we won't really need much coding. Faker is in the Python based version of mockaroo. It can sort of be deployed more at scale, generate things in real time. It's a lot easier to write custom logic. Then we'll introduce synth pop, which is an R package, which will actually be doing some synthesis for us. And then finally, we'll put it all together and we'll sort of synthesize a new data set. As I said, we're going to mostly be working in Python, particularly with pandas, but I'll try and break that down and there will be a bit of R. To start with, we need to import any packages we're going to be using beyond base Python. For this, it's going to be importing numpy, which is support for large arrays. Generally, we're only going to be using 2D table arrays anyway. If you're not familiar with Python, we're importing these packages and as lets us name an alias. In any demos for numpy pandas or matplotlib, you're going to see these aliases for MP for numpy, PD for pandas, PLT for matplotlib. And then Faker is our sort of data generation library. We're going to be using slightly later. If any of that hasn't worked, you should be able to install them using this. An exclamation mark followed by pip install and then the package name, but we shouldn't need to run that cell anyway. You should have seen a little star there and then the number that just means that's run. So now all of these packages are imported successfully and that should work for you in binder as well. To start with, we're going to look at the NAT cell data. So there's a lot of links again in the binder here that'll take you to the pages on the UK data service where you can get this data. But if you're on the binder link on the GitHub, you should actually already have access to this data in there. So we won't need to touch that. I've also created a sort of modified version that we're going to be using for this. It's all in that GitHub and how it was created. So to start with, we need to read in that data frame. So we're going to create a variable called data frame. We're going to use equals to assign something to that data frame. And then we're using PD, which is that alias we set for pandas. So for all our data manipulation needs the PD dot read CSV to call a function of pandas read CSV. Then we're going to access the data set in the NAT cell folder. If we press tab, we should get some autocomplete options here as well. And that should work in binder as well. So that should sort of make things a bit simpler for you. Then we can call data frame dot head. So that's a function of all the data frames and pandas. And that will return the first five rows of those data frames. So we can sort of do some visual inspection there. Make sure everything looks all right. But it looks like it's sort of read our headings and things are okay. And another sort of general pattern for pandas uses that instead of using data frames, we'll probably use DF as a shorthand. We might call it data. We might call it test or training data depending on what we're doing with it. But for the majority of this, that's what DF means as we go through it. And I've left sort of some empty cells with some prompts if you want to code yourself. But there are cells underneath that have everything in them that you need. And you can run those with control enter. So same stuff reading in the data set printing out the first five rows. We can also call tail to get the last five rows. And I think in any exploratory data analysis, these would be the first steps I recommend. Can we read the data? Do the first five rows look like they make sense? Do the last five rows look like they make sense? Sometimes data can sort of trail off or do weird things at the end if it's not sort of dealt with properly. Only concerns at the moment might be the age of first child seems to have these not numbers, perhaps implying that they haven't had a first child, but everything else looks about right. I'd be concerned maybe with these true or false is, but because the column is called has child, I kind of feel like that true condition makes sense there. And that does correlate with these NANDs in here. So that kind of makes sense to me. Also looking at sort of the names, you know, the emails seem to make sense. Frankie Botrill has an email that sort of F Botrill Ju at Wikispaces. So the domains are looking a bit weird. But that's because these are synthetic emails as well. So I've just not restricted that to sort of Gmail and Hotmail that we might expect to see from individuals. Next, I would suggest running the info function df.info on our data frame. What we'll do is list out, you know, how many rows there on the data frame, what the indexes count to and from, how many columns we have, and what data types those columns all are. So in this case, they're almost all objects with the exception of that numeric data and that Boolean data. We can also see the individual missing data for each sort of column. So these ones are all fully populated, but email is missing about 100 rows for some reason. I'm guessing these opinion questions like whether a one night stand is okay, whether sex without love is okay seems to have been perhaps opted out a few times. So we can see that there is, sorry, I just got a delay there. We can see there that there's just some missing options people have chosen to opt out, maybe like 10 have chosen to sort of consistently opt out or something like that. But yeah, that's all. The other one I would suggest running is df.describe. This gives sort of mathematical data as we only have one mathematical column. It's not too useful, but in any other exploratory data analysis, this would be a really useful start point to see, you know, the average age, someone had their first child, the minimums, the maximums and sort of the quartiles there. So on average, people who had their first child at 25 if they have had a first child. And what I would be looking here for is, you know, what's interesting to you? Are you noticing anything? Are you thinking, oh, you know, I'm curious to see what the difference is between females and males when they're looking at their opinions on one night stands or whether they think religion is important in their life and all sorts of stuff here. So trying to sort of build up a research question from the data, which might be a little bit backwards. But that's sort of where I am now. I'd also suggest I've already renamed all these columns, but a lot of them had some sort of quite strange names that weren't easy to recognize at a glance. You couldn't really inspect this table and make sense of who you're looking at. But I've already renamed that. And I would suggest if there's anything in here that you think you're not interested in, I would suggest dropping that from the data set as soon as possible. We can always come back to it later. So I've already got all that here. You can list all of the existing columns here. You can use this notion here of selecting certain columns to basically drop ones you're not interested in. And you can rename all the columns by passing a list of column names into there. But this isn't renaming anything here and here and now. So it's not too much of a problem. So we can log it out again just to see. You won't really notice anything, but I dropped that sort of religion question among some other stuff just to make it a bit simpler. There's a breakdown of what the states are actually entails. But again, I'm going to whiz through this. Otherwise we won't get to any synthesis stuff. But there's some cool opinion data on whether people think it's okay to have sex without being in love, whether they think they're under pressure to have sex, whether there's too much sex in media, whether men have a naturally higher sex drive than women as an opinion and things like that. So the next thing we can do is we can access individual columns from that data set. If we do DF dot age group, it will list out the entire column in order of that age group. So we've got 25, 16, 25. We can see that's replicated here and it's sort of assumed we don't want to see everything, but just want to see that head and tail. What we can do next is call the value counts function on that age group. And that's actually a bit more interesting. We can see how many we have for each of these categories. But there's no implied order to those categories other than descending in size order. And we'll see that our plots try and do this to make things look nicer. But as humans, we know that there's sort of an ordinal nature to this data. 16 to 24 year olds are obviously younger than 25 to 34 year olds. And it might be nice to see that in a visualization. It's not going to be useful if we have lots of different data. So for example, our first name column, if we do a value counts of those first names, we've got about 3000 unique names. Now 3800 rows. It's not really going to be useful to plot this data that I can do. Oh, that's interesting. I guess it just took some of the most most common ones there. Oh no, that was a big mistake. I shouldn't have run that. So that's going to try and run a plot of like 3000 different columns. I don't know if it'll get there. Let's just carry on. So yeah, we can call a plot of those value counts. Done. I don't think that'll take too long, but let's give it 20 seconds or so. There it is. There's our 3000 unique names on a bar chart that's not worked very well. So by default, it's assumed we wanted a linear plot, which isn't necessarily the case. And it's just put them in the order of largest to smallest. We can get around this by asking for a particular kind of plot, which you just saw me do. So kind equals bar will instead give us a bar chart, a little bit more easy to read, but we've not got any labels, not get anything like that. It's kind of small. It's a bit dull. We should probably have some ordering on this. And we can actually say matplotlib has a lot of built in styles already. So if you print out this PLT dot style available, you'll see a list of available styles. And I won't go into too much detail again, but I've just picked one. I've made the font size a bit bigger and the figure size a bit bigger. And then I'm turning off sorting. So if I plot now, we can see we've got a much larger plot, a bit easier to read. But the ordering is still wrong. That sorting wasn't what we wanted to do. And if we turn off that sorting in values count, we see the same kind of order. It's not done anything clever there. So what we can do is create our own categorical data type. And I'll only go through one example of this. But what we can do is go category type. Well, that's not really good. Let's do age. Let's just copy what I did. So category age equals, and we're going to make a new categorical data type. That's going to take a list of the types we're expecting. So these will be the strings 16 to 24. This can be anything that we've seen in our other data in our other columns, for example, that we've got a notion of sort of agree agreeing strongly. And that needs to be categorical as well. But we'll deal with that shortly. We also need to say that there is an order. This is ordinal data. So order equals true. That should work. Yeah. Next, we need to set that column to have that type. So we're sending the age group to use this category age. And again, that works fine. And now if we plot our graph without sorting, we should have those orders. And now we actually get the plot we kind of wanted in the first place. If we forget to sort by sort false there, it'll just re-sort it for us anyway. So it's trying to do us a favor there. But I think it's important that we, I think it's going to be a big theme throughout this that things need to be categories or factors in R to actually be synthesized properly. So that's useful to see. And then we can also add a title, there's a title parameter to plot. They can just add a title and make it a bit easier. So we can see we've got mostly young individuals up to the age of 35. And then after 35, we seem to sort of have a decreasing amount of respondents from each age group. When we do get sort of a synthesis on machine learning stage, this is going to be potentially problematic, right? Because if we're training on a lot of data, the opinions of younger people, maybe our model will learn to predict sort of on the basis of those younger people instead. Okay. So I've got loads of stuff here that will skip really. But if you are actually interested in looking into that data, you can run all of these and see the distributions of female to male. I'll go through these first few actually. So there's slightly more females than males. There's a huge amount more white people than not white people in ethnicity. And obviously that's a huge assumption to assume that all not white people are going to share an opinion or white people are going to share an opinion. And then similar with sexual identity, we've got heterosexual and not heterosexual. But again, things are a lot more complicated than that in reality, I would say. We've got relationship status. I'm going to run these categorical ones just to make sure we actually have those categories defined later on. I think that's all we need to do there. Okay, cool. And now we're on disclosure control. So at this point, we've already seen there's quite a lot of personal data there, right? We've got some special category data. We've got sexual identities. We've got some sort of ethnicities. We've got sex. We've got first name. We've got last name. We've got email, which is a personal identifier, which is very bad. So we need to do something to deal with this. If I wasn't doing this just for a demo, my first hunch would probably be let's just drop first name, last name and emails. Unless there's something very specific we need from those emails. For example, if we're doing fraud detection, it might be quite useful to extract the domain names from those emails. That would be sort of a valid reason to keep that. But really, that's not what we want to do. So I'm going to make some demo columns just to demonstrate what dropping looks like without actually needing to re-import all that data and do all that categorization again. So I've just duplicated those columns using this format. And then I'm going to drop those columns immediately just to show that they are added and now gone. So they're no longer over here on the right. Next, we're going to look at masking. So this is replacing parts of the data that's sensitive. We might replace individual, we might replace whole columns. We might replace individual rows. So again, we're going to get those demo columns in. And I'm just going to print out those demo columns this time instead of the whole thing. We can overwrite those entire columns by selecting the column here with this notation. So data frame with open brackets and then the string that we want. And we're just setting that to be equal to nothing. And then we can head that. So we'll see we've just completely nulled out this entire column. If I saw this in the original data set, I would assume maybe that we never collected this data in the first place. I definitely wouldn't assume it had been redacted for my safety or their safety. I would assume something has gone wrong and that we don't have access to this data. What was probably a better thing to do is what's sort of called masking out. And that's where we write something very clearly fake. So something like first name or fake first name. And again, this is trivial to do. In Excel, we would do this with sort of a drag and drop. But we can do it in one line with Python like that. Again, we can sort of replicate the structure of an email or a credit card just to make it clear that this is something we intentionally took out. This is now test data. And in the documentation, we might say we do actually have this real data in the original data set. But we don't have it here quite intentionally. And then the last thing we can do is sampling a data set. So if we call this sample function with the parameter frack equals one, that means we will return all of the rows that we had in a random order and then get those values. So if we run that compared to here, all we've done is just shuffle those last names, but everything else is identical. So it's not necessarily a good thing to do because we're still making use of that personal data to sort of hide these individuals. But with combination with other stuff, it can be quite useful, I think. Next one, we're going to look at coarsening. So traditionally, we'd apply this to sort of geographic or mathematical data. For example, if we had somebody's BMI or weight or height, we might reduce that to the nearest 10 centimeters or 10 kilograms. In geographic data, we might aggregate or reduce up to something like the city of Manchester instead of individual addresses. But none of that really applies to what we're doing here. So as a quick demo, I'll just showcase the really neat apply function in pandas. And we'll just assume that we want to get the first character of each last name. I didn't know if I need to reapply any of this, I think. So what we need to do is write a function of this form, apply function that's going to take some sort of input, and it's going to return some sort of output. And what we can do is we can apply this to an entire column. And that will basically element-wise take the string, set input to that string, and it will return the output and overwrite that cell with that output value. So instead of applying function, we're going to do something like get initial, because we're just going to get the first character of each of these. It's going to take a name, or in this case a last name, or a first name, or an email. No, just a first name or a last name. And what we're going to do is we're going to use slice notation just to get the first character of that name. And what we can do next is we can apply that to an entire column. So we'll go df lastname demo is equal to df lastname demo.apply get initial. I don't know if it'll, yeah, it does tab complete if you need it. And then if I log out those, we should now have just the last name of each of those. I don't know why they changed order in that time. But yeah, so now we've just got f for fuller love, b for Brayfield, stuff like that. What we can also do, in fact, I will rerun this just to get our original data back. So instead of writing all this stuff here, we can use what's called a lambda function. So what we're going to do there is we're going to write in exactly the same style. Instead of here, we use the keyword lambda, and then we're going to write an input and we're going to get an output. So now we don't need any of this notion of naming a function. We don't need this sort of semi complex structure. We can sort of do it right here. So all we're doing is the input is name. And that will be the value of whatever's passed in will be set to name. And then we're going to return zero. And that will do exactly the same thing as that code above. So this is identical to this. It's just a bit quicker, but we have lost the ability to apply a named function, which I would argue makes it much more readable as well. So if I run this now on the last name, there we go. So now we've got those last names coarsened as well in a sort of snappy one-liner as well. Okay, next we have mimicking. So what we would do here is probably find some sort of curated source of first names, last names or emails, and just make use of those. We don't have one. We could use like an Excel spreadsheet that we sampled from. We could use another data frame or another data set that we sampled from, but we don't have one. What I would suggest doing here is making use of some sort of online data generation tool like mockaroo. There is also a notion of sort of like adding some noise when we're mimicking so that we're not simply lifting direct rows from another data set, but this becomes really tricky. You know, how do you add noise to these names? You can't just sort of add name endings or capitalized letters. It doesn't really do anything to protect the individual. So this is where I would suggest using some sort of curated source. So next we have mockaroo. So I'm going to do a little mockaroo demo and feel free to join along. I'm just sending it in the chat in Zoom and on YouTube if you're watching on YouTube. We're going to do some light tasks here just to sort of whiz through what mockaroo can do. If you don't want to do much coding, this is probably the place to give it a go. So the first task is we're going to create a first name, last name, email and gender. So looking at this we can, let me zoom that way in as well, sorry. So what we can do here is we can remove rows we don't care about. So I don't need an ID. I don't need an IP address. And if we want, we can try and reorder these as well to sort of match the format we have in our dataset just to save us any extra work. We can name these things whatever we want, but I'm happy with those names. And in type we get access to this sort of amazing set of very, very complicated, very, very specific things. You know, we've got drug names, we've got currency codes, valid credit card numbers. There's like NHS numbers and stuff in here and there's even Matthew things like normal distributions, binomial distributions. So we can do quite a lot of really impressive stuff here. But for now I don't need any of that. So if we click preview here, we can see the data we'll be getting. So we get first name, Anna Diane, last name Wilmore and an email that sort of connects those things. A, Wilmore, it seems to add some sort of random number each time. And it just seems to be listing sort of random domains probably from their own domain list. And we can just click download data here and it'll download that data. We can see that there doesn't seem to be much correlation between the genders and the names, which is, you know, arguably true, but depending on the data set there can be quite a high correlation between these things. And if our goal is to make an indistinguishable synthetic data set, I think it's quite useful to try and sort of perpetuate those correlations if we can. So that is task one. We won't actually need the genders, but what we can do instead of the genders is make use of gendered names. So they have separate, sorry, names. They have all sorts of different first names for, you know, European names, male names, female names, Chinese names. So we can use this to sort of force that gender relationship so I can pick male names and now all of our names will be male. And this is important because we have male and female in our data set already. So, yeah, next thing we're going to do is fill in missing data. So remember our emails were missing about, you know, 100 rows of data. In this blank section we can force data to be blank. So if I set 10%, then our emails will be completely blank 10% of the time. That's a little bit of overkill though. We'll just set that to 2% for now. Because they are quite rare. That probably in the real world will correlate with something like age, right? You know, very young people won't have an email address. Very old people might be less likely to use a computer. Things like that. So it can be, it's a little bit more complicated than just making it blank all the time but that's all we're going to do for now I think. Next is some maths. So if we create a new field, we're going to call it weight. And we're going to make use of this normal distribution here. So we can set a mean there. So let's set like 65. The standard deviation of 5 and I want 0 decimals and I want 0 blank. So if I preview that now, we've just got this normal distribution of weights associated with names. Again, there's no correlation with anything outside of this column. But what we're also going to do is we're going to add heights. Did I do this already? Yeah. So task 4, we're going to add height column. Again, we're going to do a normal distribution. The height should probably be a bit higher. Maybe something like 150. And let's make the standard deviation a bit larger as well. So now we can see we're basically sampling two normal distributions every time we create a new user. Now this isn't necessarily the correct way to do it because weight and height do have some correlation. But what we can do is we can use the same function to do the same thing. So we can actually write formulas. We can do all sorts of conditional stuff. We get access to a bunch of different functions. So we can reference other fields. We can reference Mongo databases. We can pull dates. We can uppercase stuff. There's loads of things we can do here. And we also get access to this. So if we just type this, it will return exactly what we want to do. But what we can do instead of this is we could make references to the weight itself. So instead of just this, we could say field. Is it field or fields? Field. Rackets, quotes, weight. And I'm just going to assume that height is equal to weight plus 100. So if we preview that, again, that sort of notion of indistinguishability that I might notice it and I might get a bit confused. It does look like height is clearly inferred from weight. So is there more we can do there? So we could make use of the normal distribution that we have. So we could do field weight plus this and change that mean. We could also just do a random function. So again, there's a random function built in here that takes a minimum and maximum. So if we do random, let's say 90 instead. So now we've still got sort of a relationship to that weight by completely ignoring this new random distribution though. And it's not necessarily real, but it looks kind of realistic, I would say now. There's nothing that seems to link them, but with some analysis, you'd see that sort of hard cutoff. There's quite a uniform distribution throughout those random numbers, I would imagine. I think that's all we need. So let's delete those. And yes, so let's download it. So we've got preview. It's showing the first 100 rows, but we'll get a thousand rows. We can download that data. And in the binder, you've already got access to this on the net cell slash mockaroo.csv. So just having a look how much I need. So I've got 3,799 rows. So 1,000 isn't really enough. But yeah, we need to do a bit more than that. So let's generate 3,799 rows and download. Oh, no, we get an error. And that's because the maximum download size for these free accounts is 1,000 rows. So now what was feeling quite simple and quite easy has got a bit more difficult. We can sort of get around this by generating, you know, we could make multiple different data sets. And we're going to have to do this anyway to deal with our male and females. So what we're going to have to do is download 1,000 at a time. And I'm going to have to download two male data sets. Then I'm going to have to change this. Again, to names to females, we're going to need three female data sets. And I'm just going to have to like sit here slowly waiting for each one to generate, which is fine because there's only 4,000 rows. But if we had a million rows, you have to do this a thousand times. And if we were doing something a bit more complicated than just splitting it on male and female, that logic is going to get really heavy really quickly. Again, these are all available already. So I've added some male and female CSVs under the net cell folder. We can join these with our existing data sets. Oh, no. Here we go. So that's the mockery with just the general names, no male or female notion. We can join those in and we might see that there's some sort of, there might be some conflicts. Perhaps here like Barbara and Coop sort of conflict with the sex, gender, sex of male there. Haley is a male name as well. You know, there will be cases where this isn't the case, but we are seeing that conflict and that's why we need to deal with it. So I can count them. I need about 2,000 females just under 2,000 males. I'm going to drop those first names and last names and emails and I'm going to split the data frame on male and female. So by doing this, I know I won't get it there. So now if I go maledf.head, we'll only have our male data in here and female df will only have our female data in there. We can read in our new data sets from mockery. We can concatenate them, which is simply putting them on top of each other. I only need a certain number of rows. I don't want any merging problems. I can't count the number of rows I needed. I reset the indexes so they don't clash when they try and merge and then I just concatenate that in. So that should give us a new male data frame that's got first names, last names and emails from mockery. So you can see those weird emails coming from sort of general companies that already exist. We can do the same with females and now we've got both of them. So we've got a female data frame where we can merge those both together, re-sample it because otherwise they'll just be females on top of males and then we need to reset the index just to hide that merge, basically. So now we've got our females called Jory, Jillian, we've got our males called Chewy, Cori and a female called Christina. So that's all sort of added up there. I don't know if we still have our demo columns. No, we're good. I'm also going to rename those first names, last names and emails to something like synthetic first name, last name and email just to make it extra clear that we've done that, we've added in that personal data and just trying to sort of cover ourselves in sort of why we've done this. I would suggest if you feel like that was technically doable a lot of the complexities there came from the fact that we were trying to use mockery at all. So I would suggest just try and do it in Python yourself instead of going through all that mockery that's comfortable for you. We had a lot of complexity just because we were splitting on male and female but we could have split on ethnic group and sex. That would give us four different datasets we need to make and merge back together. If we added in sexual identity there, we'd have eight different datasets. If we added in age groups there, we'd have six times eight, whatever that is. You can see all we're doing here is trying to get these columns to correlate with the first name and it's not trivial. As I said, if we had to generate a million rows that's going to be 28 different datasets with thousands of clicks on that download data button. What would be a bit easier here is instead using Faker and instead generating these individuals based on data we're seeing in the tables instead of generating them based on mockery's decisions. We're done with mockery now. Next we're going to look at Faker. We can initialize a Faker object. There's a link to the documentation here if you need it, by the way. We can initialize a Faker object and we can just call fake.name and that would generate us a new name every time we run it. There's plenty of other functions. We can get addresses and what it's doing by default here is assuming that we want American people in American addresses. We can run this over in a for loop and generating those names. Faker also has a list of standard providers. There's loads of stuff in here we can look at. Address can be broken down into building numbers, cities, suffixes, all sorts of stuff. There's loads of details in here. If what you want isn't there, they have community providers as well. People have written their own separate packages just for air travel, their own separate packages just for music for all sorts of different stuff. It's very useful. We can see here we can get barcodes, free emails which will generate emails from Hotmail, Gmail, Yahoo, things like that. We can even generate a full credit card and it's trivial just to keep this going. We can also import those additional providers from those community providers as an internet one that we can install to get IP addresses, stuff like that. A huge perk of Mocharu is that we have different locales as well. So we can set Faker to be in an Italian locale and then it will start to generate Italian names and Italian addresses instead. Again, they'll change every time we run it. We can also use a combination of different locales so we can generate Italian, Americans and Japanese people. And it'll just do this with sort of randomness. You might get all Japanese, you might get all Italian. You can set a seed. So for reproducibility's sake, every time you run it, you'll get the same names. So you can prove that you didn't actually insert those names yourself. They did come from Faker and that it can be sort of consistently reproduced. And if we set that locale and we generate these things all separately, we'll see that there's sort of no consistency between them. We get a Japanese female first name, we get an American last name and then it looks like we get a Japanese first name. So what we can do is write a custom provider and we can sort of make these decisions based on what we're seeing. So we can import that provider set Faker to be English because we can't add providers without that. We can make our own class, let's call it new provider which will extend that base provider we've just imported. And we're going to define a new individual in itself, but we're also going to take sex and what we're going to do here is we're going to say if sex is equal to male, then we're going to generate a male name. First name equals fake dot first name male and there's like a function for almost everything in Faker. So if you want it, it's likely already there. Otherwise, we're going to assume that they want a female name. First name equals fake dot first name female we're going to generate a last name but that last name is going to be independent, well in most cultures it'll be independent from the first name. So fake dot last name and then emails. Yes, we're going to build our own email using that first name. So I'm going to get the first character of the first name add it to the last name. I'm going to add the act character and then I'm going to add the email provider which again they have a function for. I'm just going to get from here and we're going to return the whole thing add that provider to Faker and it's called new provider and then I should be able to call fake dot individual here. I think it'll error if I don't pass anything, but let's see what it does. Yep, so it's requiring that positional argument sex, so I need to give it a mail and so now it's making a mail Cameron Simpson pretty female, we get Joanne West with an email and those emails are always from Hotmail or Yahoo yahoo.co.uk Outlook and what I've got here is a slightly more complicated one that's also checking for whether we gave them white and if they're white we're going to assume that they're Spanish, obviously huge generalizations that don't represent any data set or myself or any organization really, but from there we can then generate a white mail, we can generate a not white mail which will generate a Spanish mail ooh I guess Ariel is male Tito, Cabaro females and yep, all sorts of stuff there so we've got this data set that we had before that's kind of got these fluffy emails from Mockaroo, let's replace those, so I'm just going to overwrite them again with a sort of masking example of first names last names hello Tim Cameron and then rearrange those columns if we need them so we've got our masked values back in now before we use that apply and we saw how that's used to one column to age group but what we can actually do is with the use of axis equals one we can apply it to an entire row at once so then we can write a function like this that takes an entire row it's going to generate a new individual making use of the sex from that row and the ethnic group from that row and then generate get those individuals put them in our data set, return the entire row and that will overwrite each row so it's a little bit slow it seems to sort of take about five seconds per 100 rows so I'm only going to run it on the first 50 here but if you take out this bit it'll run on the entire data set. I'm noticing as well there that I've not done anything to make sure that the last name is generated with the first name so we're getting last names that have sort of independence from the first name but that's not a problem so we can see Cameron Patel for example is here at Hotmail so we've now got these emails that look like they make a little bit more sense but we've also made a lot of assumptions about emails and we might you know obviously not all emails are just somebody's first the first letter of their first name prepended on to their last name but yeah so that's how we would do the same thing in Faker and this is much more useful now because if we do go the sort of synthetic data route we can remove that personal data from our synthesis and then add it back in with a function like this and what we're basically doing here as I've said before is we're saying you know if this user is male pick a male name if this user is white pick a British name or English name sorry if this is female and not white we choose a Spanish name it's not quite right we're sort of we're building this big complicated flow chart and it's much more complicated than that you know how do these opinions correlate with sexual identity with ethnicity with age group there's a lot of different things going on there that we're not going to be able to do just with Faker and this is basically a decision tree so there's some details on decision trees here the gist of it you know if I was going to say what decisions affect whether I go play golf today I might say I only go play golf if it's you know if it's not raining and if it's not windy that's a gross oversimplification of it right but but that's what we're looking for we're looking for sort of a simplification that lets us abstract that problem up into a decision tree again if we say we drive to work only if the weather's bad otherwise we walk there's a lot more complexity to that do we even own a car does the car have petrol in it can we afford more petrol is there a tree blocking the road there's there's you know thousands of different things that affect that decision but it's not going to be realistic for us to make a model to that level of granularity and if we do start to sort of overfit on that data let's see things like this we'll start to see that okay every time somebody has this opinion this opinion and this opinion their name must be Joe Allen and we we can sort of get to very dangerous positions there where that personal data is the best indicator of opinion data and things like that but this is effectively what we're going to be doing with synth pop and it's up to us to try and make sure that we don't over do it and we don't over fit that training data so next up we've got synth pop there is a very cool shiny app that I'll share shiny is sort of like web based our demos I suppose I'll send that in both the chats again this is a nice way to sort of visualize some basic synthesis of variables so we can create sort of agent sex data we can choose a different synthesizing method and the really cool thing here is that we can see I have to click run sorry so running will generate new data but we can see these comparative plots here and we can see at the very least the counts and we can see what you know did we generate roughly the right number of male and female did we generate roughly the right number of each age group this is like a very low pass attempt at comparing the states though because counts is one thing but we really need to see you know do the age groups correlate with sex the way they did in the original data that we can't really see here so again as I'm using anaconda I've already got our studio installed there is an our pub that you can look at if you don't have our installed but we'll share that here as well so that our pub will just summarize everything we're about to do but you won't be able to sort of interact with anything other than the graph plots so let me move this out of the way and so the first thing we're going to do is zoom in a little bit probably then I need to set my working directory to synthetic data to code demo I'm going to set the working directory here there are two libraries I'm going to need called reader which will let us read in CSVs and things like that though I think synth pop might do this I didn't make it to that point and library synth pop which I've already installed through our studio next we need to read in our data frame so we should get that from where is that from here I think that should tab complete as well so we want this personal with yeah with personal closer quotes that will read in the data frame exactly the same stuff we've just been doing in pandas in terms of best practices synth pop suggests you have sort of about 12 variables maximum it doesn't scale very well you shouldn't run it on a data set that has less than 500 rows as with many machine learning algorithms generally the more data we have the better it's going to perform and they have this function code book on the df and that will sort of give us a quick summary of the data frames we can see how many different values are missing what percentage of them are missing and how many distinct values there are so the big flag here is obviously there's missing data but that can be okay and that we have a lot of distinct names and emails these probably shouldn't be going into our synthesis process basically so to start with we're going to drop those first name last name emails and actually the religion and age data that's come back in here because we've imported from our original data so that creates a list of those columns and then we're going to iterate over those columns not dropping them basically so dropping everything just dropping them not dropping everything else so if we run that code since this again we don't have those issues and in fact if I tried to synthesize on this original data it would have warned me about all this stuff maybe I'll show that if I read that in again and if I just try and synthesize that straight away it complains it says there are factors with too many levels deal with them somehow so again I'm going to drop those and I need to convert has child if I run that code again you see has child is logical we need to convert that specifically to a factor because that won't get caught later so factors will sort of have labels associated with each different option then next we're going to do this which will convert all of these character strings into factors so if I run that again not only are they now all factors here but it's listed out what those different labels mean so we've got labels for each individual factor here each individual factor each individual column at this point we need to make sure that there's not too many missing variables we're okay here because where is it can I run DF so we're all right here because the missing variables are in a form that are as expecting but there might be missing variables literally denoted with missing or a negative one or something strange synth pop isn't going to pick those up in this regard so make sure that they're actually are missing values you should remove any variables that could be derived from other variables as well so if we had has child age at first child we could infer one from the other they shouldn't both go into the model and we can also set specific rules in the synthesis so we can set that children under a certain age shouldn't have a job or shouldn't have a smoking status but at this stage I think that's everything we need so we should just be able to give synthesis so we're going to do sin underscore DF with an arrow so that's just assigning the output of synthesis to that new data frame that's not actually a data frame sorry that's it that's done so we now have a sin DF we can call the summary function on that see what we've got so we just get the counts of each of these different variables there's quite a lot to see there there's also this one's quite interesting we can see the algorithm that was used for each individual variable so I think synth pop under the hood is doing something kind of clever to figure out which one goes where they all use cart mostly by default I believe but age group has chosen to be sampled instead we can also synthesize multiple data sets at once so if I do this with m equals 5 then we'll get 5 synthetic data sets created and this will be useful when we can compare them because if we're sort of overfit on one data set we should see sort of a large discrepancy between different synthesis then we can call the compare function so compare on the synthetic data frame and the original data frame and we'll get these really nice plots that we were seeing before just of the counts between the I'm not sure if that's blocked out by zoom for you but we can see the original data and then synthetic data is pretty close on counts and if we press enter we'll get the next ones so we can just see generally the counts look okay but just seeing the counts isn't really enough we should probably try and figure out if these variables are correlated between each other as well and we can also if we run the exact same thing but with this m cell m cell equals 1 to 5 we can see it compare across all of the synthetic data sets we've generated so we can see how much it sort of deviates from that norm and then we can use the synth pop package to write back to our original data so there is a write.syn function that we can use but I've already prepared that in the binder anyway so we don't need to do that so for now we can just jump back to the binder we've got a question here okay I'll save that question because we've only got 8 minutes left and I'm just going to do a correlation plot and then we're done so the NAT cell data is available in your binder this is the synthetic version so all of this has come out of that package the only discrepancy is that hash child has now been replaced by ones and twos instead of booleans but I've got a little quick fix for that here so that's resolved and then we need to go through our synthesis process again so creating those first name and last names and then running that apply generate individual function again so now we've got julian long male white we've got we've got any not whites for example so that's now a fully synthetic data set created from the observed data now one thing I would suggest is normally we look at distributions but it doesn't really show much more than those counts do in these cases I'm just using alpha to sort of map them on top of each other but things look quite close and data frames also have a correlation function that we can call so this is a really nice way to quickly visualize the sort of strength of some of these features obviously they all correlate with themselves straight down the middle and because I only generated the first sort of 50 all these names and emails are correlating quite a lot with each other but that's because I masked them with these so they are sort of generally staying together but if we look we see stuff that makes sense right has child correlates very strongly with the age group which makes sense as people get older they're statistically more likely to have a child relationship status also correlates very strongly with having a child age group correlates with relationship status and we have some stuff that correlates very lightly sex seems to correlate with relationship status but again it's almost no correlation at that point and so that was our original data this is our synthetic data and I think I'll zoom out a little bit so we can see it a bit clearer oh so what we can see there is we've got the general shape is still the same of these correlations for example has child is correlating with age group to 0.3 correlation almost identical in the original data but relationship and age group has deviated quite a lot but we are seeing sort of the same general patterns here we could perhaps diff this data frame as sort of a measurement of how similar our synthetic data is in terms of these correlations as well I'm not going to go through these summaries because we only got about five minutes left I do suggest this advanced resources page it's very academic you know these are sort of the advanced tutorials but they take the form of like long form academic papers but this is where I got most of the stuff I needed for this talk and just to conclude that exploratory data analysis is needed to understand the data sets context we can't really do any of these since the steps without that from Mockaroo we can do that sort of basic masking and if a data set is too large we can write a custom paper provider that we saw we can deal with those different locales all sorts of cool stuff and then from what I learned Synthpop is quite easy to use I've not used R for four or five years and I managed to sort of wobble through that quite easily but I'm finding it's quite difficult to it's quite difficult to verify whether that synthesis is good enough I would say but there's a lot of details in there especially in those advanced resources