 Great. I think we're live now. So hello everybody. My name is Paul Groves. I'm the lead architect for client and boarding over at City Group and we've also got Andrew Carr on here. Andrew? Hello. So I'm head of consultancy for the Bristol office in Scott Logic. Cool. Okay. So we're going to talk about synthetic data generation today. Won't go too, it's quite a broad subject. It's quite deep. So we're going to kind of go over it quite lightly today. But if there's any questions, we'd love to hear them. You can also contact me at, there are contact details on here if you want to kind of learn a little bit more. So let's get going. So there's within Finos, there's two synthetic data generation projects that are live there right now. Data Hub and Data Helix. So Data Hub came out of City Group. And what we've produced was a set of Python libraries that were helpful to doing synthetic data production. So I kind of support to choose use cases one way you could hand write rules kind of features and then you could go and generate some data. Or there was we've got facilities where we can analyze existing product existing data sets kind of make a statistical model of that data set and then use that to produce then synthetic data. And we've also got Data Helix. So I'll throw you Andrew to quick decision about that. Yeah, so Data Helix is slightly different and it was a synthetic data generator designed to generate large volumes of data really quickly. So we tried to do it where I guess similar to the modes that Paul talks about with the Data Hub where you can throw it at data and analyze data and then generate it. But we describe their own data language to describe the rules of the data, such that someone who wasn't necessarily a developer could write down all the rules and generate data very, very quickly. So I guess the first use case was really for testers who wanted to do load testing on the system and generate that data rapidly. Over to you Paul. Cool, great. Actually, more Data Helix stuff. Nice full screen shot that. So, yeah, so I guess if you look on the left hand side, you can see sample rules that describe the data so you can describe the types of the columns, you can describe certain rule sets you can say that this string flows this regular expression, you can put conditions between the data. And this if you go to this Finos playground, you can actually live edit a profile and then press run, and it will generate some database on that profile. Thus, you can have a really quick turnaround to generate some quite realistic with data quite rapidly. And an example is is is pretty simple but it shows you a bunch of rules that will generate that data in the right hand side. Next slide please, Paul. Cool. Okay, so since that's a data, you know what is it. There's a lot of talk about it these days and simply anything where you can algorithmically generate data. That's what since edit data is. So there's a lot as we've probably touched on before there's a lot of ways of doing this. Typically, we're looking in financial institutions towards synthetic data, because we're looking around data privacy so how can we populate our test data systems with realistic looking data, but that it has full privacy on there. We've got a lot of people we've come down this data redaction route, and then we've obviously discovered issues there, particularly around re identification. And then we've also many of us gone down like before anonymization path and often we found the results were not particularly weren't particularly happy with the results like for those anonymizers to fully anonymize we've often ended up with just trash data at the end of it that has no use within really a system. So what we're trying to do is synthetic systems is trying to produce realistic data that has the characteristics of the real data, but there's actually no way that we can breach privacy no way we can accidentally recreate real people synthetically. So it's not a new thing it's been around a long time. Some of the history of this is a lot of the work in synthetic data that we can see now is pioneered in the 90s by the US Census Brewer Brewer, where what they want to do was share out the data sets without revealing any of the actual real census data because it's quite confidential you have people salaries, race, religion, lots of things that people maybe want to put out there. So what they were doing was and also as issues of incomplete data as well. So what they want to do was kind of statistically populate missing data with with realistic values. And so they spent a lot of effort doing this. If you go to the opposite side of things of trying to recreate kind of believable worlds based on rules. Anybody probably around about my age or Andrew's age, who used to play games on the ZX Spectrums and BBC Micros probably remember this game elite. And this again was a former synthetic generation was more procedural generation, which was how can you create a life like galaxy, but you can go around trade and do things and keep it consistently generating. And so that was really quite clever algorithm algorithms they generated back then. And that was kind of also like the the genesis of I guess of that kind of procedural generation of kind of world creation. So pretty different way that you can use these tools and it goes way back. So it's nothing new. As we touched on there's two kind of big use cases we're seeing out there in financial services right now. There's the whole GDPR, all the laws we have and regulations around protecting PII data. The problems of anonymization and re-identification that a lot of us have found. Then on the opposite side, you have kind of machine learning. So while you might not use synthetic data, we definitely wouldn't use synthetic data to train your ML or your AI, particularly in the data engineering aspects of building an AI or ML pipeline is your developers might not be allowed the real data. Again, so you need kind of standing data sets that look vaguely realistic that you can use in your development systems. I mean, you have other things where you might have a team of people that are working on the say, say, tagging data, actually working on the real production data set and it's not available yet. So while you're doing the work in terms of your data engineering, you might need some kind of standing data sets for the real works kind of going on to get the data that you need. So there's kind of two different use cases we're kind of seeing a lot around here. So yes, we've got three main approaches, production, which is simply removing sensitive information. You have the risk of re-identification. You have anonymization, which is much for the process where we show re-identification cannot happen. There's a bunch of tooling out there that can help you that. Or we go to synthesis where we, again, you've got those two words where you do it procedurally, where you handcraft your rules and your procedure generate, or you analyze production data, observe the patterns and generate from there. And that's kind of where we're roughly at between data helix and data hub. Data hub's kind of morning into the analysis side. And yeah, I think we've got data helix, which is much more the kind of the procedural side. Yeah, so Andrew, do you want to hand over to you now on this one? I don't mind. Yeah, I think this one is your slide. But you think I can talk first, Sean? Yeah, no, you go for it. OK, so I guess there's lots of problems with redaction. I think the challenge with redaction is when you start to use redaction to get to the point where you can't do the re-identification, you have to remove so much as to sometimes the data becomes unusable and not useful for the task at hand. So you end up with just a slosh of data that doesn't actually represent the original data in any form whatsoever. So there aren't real trouble challenges with redaction. Also, the other problem with redaction is it does take a long time to do. And then you have to get to the point where, as Paul says, you have to verify it that you can't then re-identify the individuals in the original data. Paul, if you go to the next slide. Do you want to talk about differential privacy? Yeah, it's a differential privacy. So what we tend to talk about there is when we talk about differential privacy is the point where you're basically creating a statistical model of the data. And out of that statistical model of the data, it should be general about the data. There should have the properties of it, the distributions, any constraints down in there. All sets of that data. But you mustn't be able in that data set be able to re-identify an actual individual again. So, you know, if we looked at this example back here of the HR data where, you know, you could simply cross-reference. If you redacted people's names out, you could simply cross-reference it with like an HR phone book. And then if you hadn't removed things like salary information or the rest of it, you'd quickly work out what everyone's salaries were or other confidential things you didn't want to find out. So it's, yeah, differential privacy is quite critical to synthetic data generation, particularly when we're doing it in the analysis of existing data sets. So if you're looking at a typical synthetic data foe where you're doing the analysis stage, you'd often start with your production data. So in your production environment, this is our boundary. We have another boundary which are non-production environment. If you start by querying your production data, you'll certainly remove any PII, easily identify PII attributes out of there and don't name social numbers, addresses, all those good things. You don't want those in your data set at all. You will then do your analysis on it. So to find out where these constraints distributions and other interesting properties are and produce this kind of DP data file, this differential privacy data file that's a statistical model. And that should be then safe to transfer to a non-production environment. And then you're into your synthetic generation. That's your kind of analysis mode. Then you've got your generation mode, which is where you take this statistical model, you put it into some kind of thing that can produce data out of it that looks like the original data set. And when you've done that, you might then choose to enhance it. So as you remember, we've redacted all the PII attributes out of it. We've removed people's names, addresses. So now let's add back fake names and addresses back in there. And now you've kind of created this virtual life like representation of the original data set, but everything it refers to can't possibly have existed in real life. You've got to be a little bit careful in this space, particularly with your data analysis. There's lots of different approaches to doing this analysis, produce the DP data file. What you tend to find out is you've got to be very careful of anyone who knows anything about like being near machine learning, understanding the curse of hyperdimensionality where you have so many dimensions of the data that actually everything results back to one record and one record only. And if you do that, quite often it's quite hard. So when you're doing your analysis, you have to be quite careful of what dimension of data dimensions you're actually interested in. And remove the ones you're not, that are not important. This is, I guess, now we've got the procedural generation, which is quite simply, you offer some rules, you produce some data, do something with it, put it into your database. Now that could be a lot more involved because the developer has to sit there and handcraft the set of rules. This is really useful approaches where you might not have any data. Like, it's a brand new system, you've got no data, so you just need to start generating something for your test system to start working with. So that's those kind of scenarios. Or you need a lot of data very quickly. And it's got to be relatively simple. So yeah, I'll hand back over to you, Andrew. Cool. So if we take a step back, so obviously we've highlighted that there's different approaches to generating synthetic data. The question is, when do you use which approach? And I guess I'm going to chat to you a bit more about the rules-based approach where you generate synthetic data from a bunch of rules and when that's useful. So I guess if you pull all the different use cases of why do you want test data, you can pull them back. And this is a bit of a generalization, but it often holds true. Sometimes you want low volume, highly accurate data. And that's typically to test functionality in the system. And for that, you need the data to absolutely be accurate, otherwise it might not trigger the correct functionality in the system. Sometimes you want high volume, reasonably shaped data, and it's reasonably accurate data. And that's often to test the load, whether that's a you're doing, you know, can the system deal with this throughput? Can it deal with data and process at a certain speed? And can the system responds given a certain volume of data? And with that, that tends to be the use case that you will typically do rules-based generation data. Depending on the use case, as you said, considering should support the approach. What I'm going to do is walk through an example of how a simple use case, if you want volume data, can get complex very quickly using a rules-based approach. So I generally recommend if you're going to go for a rules-based approach, use it when you want kind of large volume data, and it just has to reason be realistic. So if we go to the next slide, okay, so like I said, what we're going to do is do a simple example. And this is a financial services or capital markets example, I guess. Imagine you want the simple test data of a trade ID, a stock ID, a stock name, a price, and a trading date time. So if we go to the next slide, if I use really simple rules and said, actually, you know what, the trade ID is an integer, the stock ID is a string, the stock name is a string, the price is a float, and the trade date time is just a trade date. Now, clearly, I think even people who are outside of financial services, we look at that and see that that's clearly a nonsensical bit of data. It's valid according to rules that we just gave it, but it's not really usable. Even for functional testing, this probably wouldn't be very usable. If you look at the price field, it's not to two decimal places. You know, it could easily break part of the system in even basic checks. And so if we go to the next slide. If we try and tighten up some of these rules, we could try and generate a bit more realistic data as well as giving it the types of each field. We could give it rules about the field so we could say that the stock ID should be taken from a new ration. We could say the stock name should be taken from a new ration, and we should say maybe the price should float between, you know, two boundaries. And maybe the trade date needs to be greater than one week ago, but less than today. And then if we have a look to see what that would generate. The data looks much better. It still has a lot of challenges in it. Clearly the stock ID and the stock name don't match. And the price still has issues because there's multiple, I guess, significant figures after the decimal point. So it's getting a little bit more realistic. And even with some simple rules, we've got, we've got much closer. But again, it's probably not usable for functional testing it. Yeah, probably not usable for volume testing it is definitely not usable for machine learning analysis. So if we go to the next slide, if we try and tighten those rules up a little bit more. We basically put in a condition to line up the enumeration, the stock ID with the stock name, and maybe give two decimal places for the floats. And whether the trade date time will keep the same. So if we go to that, the next slide. So actually, we're starting to get data that looks a little bit more realistic. I mean, there's still a lot of problems with the data. But I imagine that this would probably be fine for volume testing. But if you're trying to do functional testing, clearly you can see in this example, the stock price of BT has varied wildly. Now, that's unlikely to work if you're trying to get that into a system and using the stock price like that because the system would clearly go, well that stock price has jumped hugely. But actually, if you're doing volume testing, maybe that's okay. Maybe the shape of the data is accurate enough for volume testing. And as you can see, even this really simple example with five columns, you actually need to get a reasonable set of rules in place to start to get the data looking realistic. And we worked with one particular client that had very large files. And originally we were generating, I think for 150 columns, they ended up generating over 3000 rules to make the data look realistic enough to do volume testing. And even then it wasn't anywhere near realistic enough to do functionality testing. So that kind of gives you a bit of a feel for how quickly using the rules based approach, you can get to the point where you've just got too many rules to manage. Okay, next slide. Yeah, as I said, you know, with 150 columns, we saw over 3000 rules really quickly. And bear in mind, this is a really simple case. I've only looked at the challenge where we have simple cases where each row is independent of the average. If you were trying to do something like generate a realistic looking bank account, the rules would be way more complicated. You have to do rules on how much money you're spending, you would want the rent to come out at the same time every month. And you might have dependencies between things you might go well, it's getting near the end of the month the person hasn't got money in their bank account maybe they would withdraw 300 pounds cash. Next slide please. I think that should be here. Sorry, I went the wrong way. Sorry. Yeah, so as we talked about a real estate bank account salary coming in the same day every month. Same outgoing such as rent, etc. Outgoing is hopefully less than in Cummings realistic amounts coffee at prep shouldn't be 300 pounds. And also realistic number of events I you end up with a state system and rolling stats I balance. And when you get into that situation you quickly comes to the conclusion that you need to start handwriting the code. In fact, if you want data to be as accurate as the application. You probably end up getting to a situation where you have to write as much logic in the generation of the data as the business logic that you have in your application originally. I think that's over to you now, Paul. Yes, so I guess some examples we've been looking at for ourselves, you know, providing realistic test data for our client and boarding platforms and they're quite complicated onboarding requests so you handle all the KYC. It's a very complicated kind of multi nested level bits of data. Also, you know, generating risk and PNL with realistic value so by analyzing the production, you know, PNL data and then using that to make sure that things like tennis currencies curves and all the rest of it all line up properly in a realistic way and then using kind of form a similar values as well. So, and also we've been generating portfolios of trade so, you know, design characteristics a little bit like scenario based so we can generate a portfolio of interest rate swaps to and add different kind of characteristics into the generation of that other things we looked at critical payments between merchants and card holders again analyzing those data sets. That's quite an interesting different kind of way you have to be analysis. So the final way you find it helps very much with is easy exploratory relationships with vendors and card providers so for you often that anybody near anything of your data you have to get through signing contracts big NDAs. So we start actually just sharing some purely fully public domain data sets between each other that are structurally correct things become a lot easier very quickly in terms of that exploratory relationship. So I was hoping to demo this today but unfortunately my main desktop PC has died of death and blue screen. So I'm going to have to walk you through this and not do a live demo. So with data hub actually lucky you already got a few minutes anyway. So simply to install data hub use for its Python base. Anyone who knows Python used to do pip install data up core and you'll bring down the library and then very simply if we were to handcraft some rules in a simulated data elix. We could if we wanted to create say a set of accounts we could quickly do this where we'll say we want a region. Here's some data to choose from. Here's some weights the data. Now we want a country. So there's a data has kind of inbuilt types for things like current countries currencies and those kind of things. So it understands what they are. So we say look now give me a country and based the country based on the region field so it will set countries now appropriate to the region. Then we'll say one of the industry. So you know are you in retail banking finance you know agriculture whatever and then generate a very specific industry code based off of that industry it sounds. Now generate a legal name fat kind of thing that this record so we'll call it you know ABC mining company limited or whatever. And so there's a lot of work in here where we generate them appropriate names since that's belief based on the industry in the country that that this things in. And then we generate any code or other things EVPV whatever and use various functions so it's all in Python so yeah and it's very easy to extend. Now if we are to look at the analysis what we can do is you've got two different functions here so this generate model function is in the top. That's what you run inside your kind of in your inside your production domain so you'll say but take this CSV file or whatever input stream it is. Now create this thing called a model file Jason such the output. And then you give it essentially the classifiers and the continuous values also discrete values and continuous values to say. Look is the region country sick code is to code sick things that I want you to analyze and work out how they're distributed and then inside these this set of classifier information. These are like the continuous data values so assets under management estimated value. And then there's your bits we support like a plug in model so this code a little bit old so there's different analysis modules which are tainted towards different data sets. And also there's support soon once I finish doing the PR and it's in there which will support kind of multi table as well so there's a bunch of stuff in there. And then what we can do here very quickly is we can then generate from that model so we can now say generate from model. This is called a fast bucket model. So it's very very quick give it that model file Jason which has got that different that kind of statistical model in there. And now go and generate and then we can also you can see when we talk about enhancing the data we can actually add back in things like a name and le code which we removed from the original data set so we can then add these extra attributes back onto it since that's good again. So that's kind of bro briefly what kind of data does and sorry I couldn't do a live demo but things went are we wrong. Data hub is also incredibly easy to extend so if there's not a function that you want with a couple lines of Python you can, you know, extend it so here's here's one where now we're adding a little message in there and it's going to go hello, whatever your name is. It just takes a couple lines of codes to do that. So that's how you can extend data. So if it's something that doesn't you want extra. It's really easy. Right. So what's next for data hub and data here. So the two projects we're bringing them together. If ever also investigating in terms of the data specification, how we integrate with the alloy legends contributions within us which is great that's got a whole. Markup language, sorry, functional language about it that describes data sets. So look at how we can integrate with those so we can actually start generating data sets from legend specifications. So in the next version of data hub we got some extra bit so data at the moment if we're doing analysis of data sets all our models support classifies with continuous. So here we haven't got anything that sports continuous values only so that's a bit of a hard spent integrating a CT GAN so that's like a another big open source project that does that uses GANs to analyze data so integrating that and making it so it's seamless within the products and multi table sports so you just fly it you know a bunch of files and tell it what foreign key primary key relationships are and then it'll be able to use that for its analysis and generate them as well. We're also building in more financial types of understanding of QCIP size and many eyes curves and those and all those kind of things, as well as things where in that data analysis. At the moment you have to say what each is like you have to say that these are the columns are one. So if you're building data type predictors in there so we can look at your data set and then go that looks like a QCIP that looks like a currency that looks like I attributed to name or a currency you know something that's there so it just help move things along and also we're looking at spark integration for really big data set generation so we'll we'll use pi spark to then generate actually on a cluster. And I think down the line this is probably going to be one into next year we're going to look at actually how we support agent based modeling as well so if you're trying to use do simulation how we can then since definitely generate actors for you to decide ways. So cool. Yeah I guess this sounds any interesting you know please reach out to us the details should be around. They're not in this deck. We're always looking for help so if you want to get involved we're looking for anybody can put a code in Python, particularly if you've got any kind of data science background as well that will be fantastic. And if you want to help us particularly on the ally or legend integration that would be really great to reach out to us as well but yeah the if you code in Python and want to get involved. Please do even if you can't code in Python will teach you on the way. Yeah Andrew you got anything to finish up with. I know that was great Paul. Cool. Awesome. So that's, that's it. I think we maybe hope you've got a couple of minutes to questions, which I sure hope to have out see what we have. So there's no questions in, I guess we can probably make it one more minute and then wrap up with silent bar. Cool great thank you everybody I'll put my contact details within the chat. Please do reach out to us if you want to. That'd be great. Thanks for coming everybody. Thanks for attending everyone.