 Okay, I'm going to start the webinar now. This is going to be telling you about our new tool called QA My Data, which is a health check for numeric data. We're going to spend about half an hour speaking and we're going to give you a live demo of the tool as well. Let's introduce the speakers to start with. Okay, so I'm Louise Corti. I'm the Service Director for Data Publishing and Access. We have Christina Magna, Magna, our Senior Data Creation Officer, and Miles Alford, our Hadoop assessment systems engineer. All of us have been involved in building this project in various roles and we're very pleased to be able to show you the final product. Okay, so we're going to cover the origin of the tool, just very briefly. What's useful when we check numeric data? How we developed our tool? What tests did we look at and what are available in the tool? A quick demo, a very short techie overview, and then something about future plans. All of the slides will be made available on our event, so we'll share them with you. So why do we need to assess data quality? So first of all, if we're publishing data and we are a researcher who's created a data set and we're submitting it to a repository, it's useful. We don't want to send dirty data over. If we're a repository who's receiving data, we want to be checking the quality of the stuff we receive. Peer reviewers are checking data that's coming in for published analysis, maybe doing some reproducibility on the code, but they want to look at the data set as well. And then data publishing, checking the quality of data does support the fair principles. That's findable, accessible, interoperable, reusable. So it's also useful to check the quality of data when you're using it for the first time. So you download a new data source, and you want to check whether it's got errors in or it's good enough for you to use so you can begin to analyze the data. And then I think if you're trying to teach students around data quality or creating a numeric data set, it's very useful for them to show the kinds of things that can go wrong and the kinds of things they need to look out for. They want our data to be healthy and safe. So our tool, we had a year or so project to develop a lightweight open source tool that helps us quality assess data. We call it a data health check because what it does is automatically identify some of the most common problems in data submitted in the disciplines that use numeric methods. So coming from a data services we do, we deal with a lot of survey data, numeric data databases, and this kind of tool is very useful to help us check some of these given the volume of data that we have in. We also have a self-deposit repository with lots of data flowing in again and again this will help us do some pre-checking before data is actually made available. And we like the idea that it can help set some quality data quality profiles for people who are publishing data. So we have slightly different standards for our self-deposit repository and our curated our curated repository that deals with some very large-scale government survey data sets. We may allow less errors and have a perfect data set for our government surveys where some of the smaller research experiments might have some you know be allowed a few errors and we'll show you what we mean when we say errors. So we develop QA my data and the idea is it for it to be extensible people could build on it doesn't cost anything and it's quite easy to deploy. So first of all if the polls are working I'm hoping they are but I'm not sure if they are. I just wanted to know how many of you are actually in the business of appraising data. Let me know if you can you can do the poll. What you need to do is click on one of the options. I'm not sure if it's working but hopefully it might be. Okay I'm sorry the polls not working I'm gonna have to shut that. I was just basically wanted to know how many of you appraise data on a regular basis but never mind. Okay so moving on from that I'm hoping that some of you in the audience actually do are in the business of checking data. So we want to if we're publishing data that's where my archive comes in. We want to share clean and well documented data and we want to share it under the right conditions so we're looking at both of these aspects. So there are quality issues that do arise in the description of data and in the data itself and also we want to be sharing data carefully properly under the conditions that we collected it and we are looking at privacy impact assessments under for example the data protection ruling regulation. So a privacy impact assessment or an appraisal of data looking for identifiers is important and this helps us to find the legal gateway for access. So this tool could be used to look for some of the common things that may be left in data when it comes to disclosure risk. So here at the data service we've had a three tier data access policy for many years. The first category is open. We really don't find anything disclosed in there at all. It's generally highly probably aggregated data or small teaching data sets and the majority of our data sets fit into the safeguarded section in the middle. So we're registering, we're authenticating to get in and then finally we have our control data which is a space to use personal data through an approved legal gateway. In that case we're not worried about disclosure risk in there because people can't take data out. They can just take reviewed outputs. So current ways of checking data. When we did some work trying to assess what other archives do and what people who work with data do we really found that there's lots of different ways that people do things. So we're generally looking at the structure of data. We're looking for incorrect or missing or inconsistent values in a data set. We're also looking for unanticipated accidental disclosure risk. For example, we maybe get sent a data set in and there's something accidentally left in there that shouldn't be some kind of a administrative variable. That's the kind of thing that we need to check for. And now we locate these issues and we decide what we're going to do with them. Either we can clean them and treat them which may mean going back to the data owner to say look we found this in here. What do you want us to do about it? Or we can just flag the errors. Now when you're dealing with massive streams of data coming in administrative data or transactional data, one probably needs to flag the errors. You're not probably not going to clean everything up as it comes in. So what we found is many data creators and data publishers are using manual methods. So although there's data integrity rules that's used in data collection process, for example in CAPE instruments, the integrity is built in so you can't put in the wrong codes, if you're not using those methods you're likely to have errors in your data. Some of the statistical software commands will help you run things. You can run frequencies. You can make various checks by running basic commands. But there's a lot of eyeballing of data kind of manually. So checking things, looking at variables that you're worried about, maybe running frequencies, looking at different kinds of outlines in there. A lot of it is using manual methods. So how can a tool help? Well, first of all, we can flag the issues. What we'd like to do is run a report on the things that we're looking for anyway, and it flags up where the errors are. It doesn't necessarily have to solve them all for you or clean them. It's telling you where the errors are, and that can be very useful. We are, we want to be deploying it as a service for our self-deposit repository so that actually people can do a health check on their data before they submit it. And again, kind of smaller, the researchers doing their own research don't really often know what data quality means until they get a check back from a repository saying, look, this has got the same values, it's a bit nasty. So actually having those checks up front to know what you're checking, we think it's very useful. And also, this tool can be deployed into data publishing pipelines. In all cases, if you're streaming data in or bringing the same data in every night and building it up, you could run these to check for various things, for known things. I was going to ask you what your biggest problem with data is, but I don't think the poll is working, so sorry. What are we checking? Well, we're checking to make sure the values are correct. Do they make sense? We're looking for outliers that are erroneous, or we're looking for outliers that shouldn't be there, very low frequency counts, or possibly disclosive outliers, very high income, or things like that. And we're checking for the formats of some of the values entered, and I'll give some examples of that. And of course, the challenge of missing data, nobody in the data world likes missing data, particularly when it's not defined. So also, we'll also be checking for that. So here's just a very typical example of a spreadsheet. What's the problem with this data? Let's give you a minute to look at it. I know it's in red, so it's barely obvious what's wrong, but there's three things wrong there. First of all, we have a pregnant man. Now, this was a survey conducted in the 60s. We had binary coding, so you probably would expect a pregnant man. The age is a problem. We've got some of these minus 10, and we've got a very old person of 126. Well, that could be right, but we might want to check it. And then we've got clearly the wrong data. We've got some timestamps put in the occupational coding field, which is wrong. So they're the kinds of things. They're probably accidental, or you've got errors in data collection. A second example is there. We have a timestamp on something. Clearly something's gone wrong in the second row. We'd want to go to check for these formats to make sure that doesn't happen or it gets rejected or flagged up. And then finally, we have our nice problem of missing data. There's a lot of missing data there, but we don't know why it's missing. Was it a don't know? Was it didn't want to respond? Was it just missing? Was it suppressed by someone? We don't really know what it was, and then we need to decide what we want to do. We want to treat that all as missing or whether we want to indicate some other flag in there. But we need to do something with that data. Now, we have got four types of review domains that we're looking at. First of all, we're checking for the basic file itself, so errors in various things to do with the file. And then we're checking metadata. Thirdly, we check data integrity. That's actually the data itself, the numbers, the values. And then we're checking for disclosure review as well. So, we scoop the tests by looking at our own procedure just to see what we checked for. We prepared some dirty test datasets with all known errors in there and ran it through. And then we reached out to other colleagues and archives and researchers asking them what they checked for when they looked. And we had, when we ran some workshops with the tool, we got some very useful feedback on additional things to look for. So, our number one is our basic file checks. Our system, our tool is checking whether the file opens and whether it's got a bad file name, which might have spaces or something in. And it's using something called REGX, Regular Expressions, to do this. And we'll show you how to formulate them later and how to make use of them when you're checking data. The second thing we looked for is metadata. And that's a number of things that we can check here. And again, we're going to be showing you a live demo of these kinds of things. So, just to we're through, we're checking for missing variable labels. We may not want any labels that are not specified. We may be looking for, as I said before, missing values that have no label. We want to know which ones are not labeled. We are looking for odd characters in names and labels. So, it may be they're miss belts or they don't mean anything and you want them to be clean and purposeful. So, we're looking for them. We're looking for the length of variable and value labels and we can specify a maximum. The use specified the maximum and it will fly for ones that don't obey that. We are looking at spelling mistakes in the variable labels and also in the value labels using a dictionary file that's plugged in. And again, we put an English dictionary file in but you could put a dictionary file from another language in. And now we have our data into a integrity checks. This is checking the actual spreadsheet, the data itself. It's reporting on the number of numeric and string value variables that it finds. It's looking for duplicate IDs and we found that in the clinical trials world that's often the case. There's quite often duplicate IDs in there that aren't supposed to be in there. So, you can specify whether you want to look for that. You're looking for or we're looking for odd characters in string data. We specify the characters we don't want to find. That could be any characters that you don't want to find in there. We're looking at the percentage of values that are missing for each variable. And so we might say, I want to know if any variables have got more than 25% missing values. Let me know which ones they are. And now we're looking at spelling mistakes in string data. And that's useful as well when you've got, you know, some text within the field. And then finally we have our disclosure checks. We are specifying a particular threshold value. I don't want to find any values that are less than five in a categorical data. Or I don't want to find anything that's very high. So you can actually set these thresholds if you're looking for things like age or income, as many things you can set. And particularly frequency counts as well. Again, we're using our regular expression pattern search again to look for particular direct identifiers. And this is really nice because the reg X can actually help you formulate a code to look for a generic code for things like post codes, telephone numbers, anything really that's got a formula formulae instruction. And then although the next one says to be done, it is actually done. And that is your you're looking for direct identifiers, or for example named entities in string data. And what you're doing there is you're plugging in a list of stop words that you don't want. So you could have a list of place names, countries, real names. These checks are incredibly resource intensive. And you can ask questions to miles about that later. But I think that's important to realise that you might want to run them one at a time. Now just to say that this is just really looking at very basic direct identifiers. It's not looking for combinations of variables and identifying information. And we do use SDC micro, which is an R package to look for uniqueness in rows of data. And we run that separately. It's not part of this tool, but we do run training on that as well if you're interested. So this is an example. The tool allows us to put in to enter sgssdata.saxon.csv files and run the checks. It's a very modular design. You have various checks that you run. You can change them. You can set a threshold. And also if you don't want to run the test, you just comment them out in the code. So that's just an example of a configuration file that we can show you later on. Just the first one is looking for these odd characters in variable labels. And you can see that they're the odd characters that we're specifying here. And then at the bottom you'll see it's very straightforward to run. There's just two lines of an instruction that tells you to run the checks. This is an example of regular expressions. Again, we're going to ask Miles to say a bit more about this, but that is an expression for an email, a very basic expression that will pick up things that have an email address. And if you go onto the web and put in regex, you can actually find for pretty much anything. Telephone numbers, zip codes, many, many things that have a regularity to them. We have reporting that's in Jason. It's used to build a very detailed report. And then we actually have a very straightforward HTML report that we can read at a glance. And basically it just tells you which test has failed and which has passed. And then for the ones that have failed, you click on their row and find out where the problem is. So it's supposed to be very straightforward. This is just an example of a report that's generated. I fed in my file. It's given me some summary things. It's found some bad file names. This is the things that it's found. It's found something it doesn't like. So I click on the row and it basically tells me which variables are guilty of this. This is the odd characters and variables. There's just an example of what it does. And we'll show you more of that later when we write live. Okay, right. We're going to do a demo. We're going to do a live demo. So I am going to pass the presenter over to Miles Offered, who's going to give you a live demo, hopefully. So just bear with us for one minute or we exchange. I had to show the other screen. So hello. I'm Miles. I'll be presenting the live demo. So just for clarification, so everybody understands what we're looking at on the screen right now. Over on the right side of the screen, we have a file explorer where we have the directories config, data, dictionaries, report. These are just ordinary folders and they're set up in the way that Cure My Data prefers to see the directory structure. And then on the left over here, we have a command prompt. A lot of people get kind of scared when they see a command prompt, but it's very simple. We'll go through everything step by step and explain everything along the way. We can see over here, if I type in the command tree, we can see here that what we're looking at is the same as over on the file view. So from here, we can have a look at what the Cure My Data command can do. We're typing in QAMD help, which will explain all of the help message, has a complete help context menu so that you can go through and figure out how the commands work and how to run the commands. So for example, in this instance, we'd like to run the tool. So we can actually run the QAMD help command and then type run to get context-specific information about the run-sub command. So in this instance, we can see that there's a load of flags, which are like options, where we can pass information into the system and tell it what we want to do and what we don't want to do. So we can specify a separate configuration file or tell it to write out in a different file format, which will make a little bit more sense once we've gone through. So from here, what we can do is we can actually type in QAMD run and then give it the file name. So at this point, we're going to be looking at a file over in the data directory. So in data teaching, I've got the command here. I'm going to copy and paste and type it out myself. So if we run this command here, where we're saying QAMD data run and then the file name that we want to run it against, we've been saying the output flag with the file name where we'd like this to go. So what this is going to do is this is going to create this report file here. So I'll delete that and then when we hit enter in the command window, a progress file will pop up. A few seconds later, the file will be checked and then over here, we will have the report. So if we look at that in a browser window, we can have a look through the report and we can see some basic metadata about the file such as the file name at the top, how many cases are in that file and how many variables are also in that file. We can also have a look at more interesting things like how many of those variables are numeric versus how many are strings. So in this instance, we've only got two strings for 188 variables. We can also see some other metadata as well that may or may not be as interesting, but then we can also look as well that we've got a bad file name check that's failed. So in this instance, the file name should match the use of specified pattern and currently, the main issue that the main offending character will be the cent sign and the underscore as our pattern, which we'll show you later on, doesn't match and doesn't allow for these characters, but we can also drill down into other checks as well. So for example, we can take a look at the variable odd characters and we can click on that here and that will take us down to a more detailed view. So it's located two variables in this instance, a variable ownTV and a variable v137 that have got odd characters that based on our configuration settings are not considered acceptable and has highlighted that to us as a failure. We could go back in and decide that we want to keep those characters and remove them from the variable odd characters list, which is currently looking for things like ampersands, hash symbols and at signs, or we could go and remove them from the SPSS file and go from there. The configuration file is very, very simple. So it's currently located in the config directory and then if you take open that with a normal text editor, I'm going to open it in notepad plus plus right here. The file is written in YAML, which stands for YAML in markup language because programmers like to be funny, I'm afraid. And in here, each check has a section and then each section has a block that represents the check itself. So in this instance, we've got the bad file name check. It has a description which will show up in the user report. So if you want to support multiple localities, for example, you want to run this check and you have a well speaker or something else, you can change this text here so that the report can reflect that. And then also here, we have the regex pattern that is matching for our file name check, which can be changed by the end user. We have multiple other checks as well. So for example, we can look through and we can see that that variable odd character setting and we have all of our characters listed out here that we don't like in a variable. To specify the fact that we'd like to run with our configuration file, you have to specify the dash C flag and then specify where our config is. So you'll have to excuse me while I remember the commands. So here now we're running with the dash C flag, which specifies that we'd like to use this specific config file, which is the one that we have currently open on the right. This is where I've broke it. There we go. So now when we refresh this, we haven't changed anything, so everything's going to be the same still. So if we were to go back into our configuration file and add in to here the underscore and percent signs into our basic file check right here, I'm afraid. I hope you can see that. It's not the largest font. There we go. So we have added the underscore and the percent signs. We can save that file, go back over to our command prompt, rerun that command. We can wait for that to run while that's running. Hopefully fingers crossed if I've made my sacrifices to the demo gods. The bad file name check has now changed to a pass because we have allowed these characters in our pattern that weren't previously allowed. From here, you can take a look at the GitHub page if you're interested in downloading or having a play around with Kira Madel and feel free to open an issue with us at any point or send us an email if you have any questions or queries at a later time. I'm going to hand you back now to Louise and hopefully we should be able to answer any of your questions. Just coming back to my screen for a minute. I just want to share the light page. Okay, so just to continue a little bit with the presentation, excuse the screen there. So just a little bit about software and you will have a chance to answer some questions. We wanted to use open source libraries and actually quite hard to find. So a library is where it's going to actually import and export various file formats. We looked at the Java libraries that are compliant with FPSS. We looked at our libraries and we looked at Python libraries and actually couldn't find any really that would actually allow us to do what we wanted to do. So we found something called ReadStat which wasn't really that well known at the time but actually it is quite used quite a lot. We're quite impressed with it. So the ReadStat library is a command line tool and it's a C library from reading files. From particular statistical packages you can put them in. It was designed as a free statistical data analysis package for the Mac which is interesting. It supports a variety of common file formats and it's been in active development since 2012 and it seems to be receiving security patches and fixes and it's actually definitely been maintained more as a kind of community tool. So if you want to ask anything about Rust you can ask Miles but we did try various wrappers to wrap around this and we tried Java, Closure, R, Py, something like that. Miles tried just about everything and eventually turned out using Rust which is a really nice environment that he said that it demands that things work and actually they do. It's very, it's very good for fast running executables and also it seems to be very reliable where you can compile many bugs you can eliminate bugs at compile time so you don't have to wait a long time. And it seems to be have really good documentation and seems to be quite friendly overall. So if you want to ask information about that you can ask Miles or follow up with him. So it is available on Linux Windows Mac as Miles showed you there's a GitHub page where you can download it and use it. It's very lightweight to run you can actually add information in there you can set the configuration file and put your own thresholds in. There's quite a lot of documentation actually concise documentation on the wiki page and if you've got ideas for new tests that you think will be important we'd really like to hear from you. It's released under MIT license and we are hoping to deploy as a service sometime this year so that users maybe don't have to run the command lines although we do think for all users running statistics should be able to to use a command line and I think more and more that that is the case. So that's our GitHub space you can go and look at it and we'd really like you to try it out. So just to say we have actually evaluated this fair bit over the last year our own staff and I use it as part of our ingest work. We've spoken to a number of peer data repositories who I think some are looking to or using it and we've had various workshops with a whole variety of stakeholders really from people who own data data managers lecturers who are teaching court methods who are all interested and can see the potential and again we'd like to hear back from you if you are going to test it. We really do want to try and push the idea of actually being quite open about a data quality profile. Now there's a lot of work to try and kind of implement fair guidelines to put some kind of measures in place to score and that can be quite difficult because with fair findable probably is not equal to accessible which is not equal to reusable there's lots of different domains that make it hard to score something but I think with a data quality profile you can say very clearly what's acceptable and what isn't. So you may say we don't accept any unlabeled missing values that's very straightforward. So we think it's a really good way of saying here's some very clear rules on what we are prepared to accept and what we aren't. So we have a web page which has got the available tests that are there. It's got a download and run guide which Christina's compiles which is really detailed it does a step by step every single screen shot so you really if you follow it you really can't go wrong I don't think if you find anything in there and you can email us and let us know but it's really straightforward and it has very nice instructions for how you might edit the config file to set your own thresholds. We've got some teaching resources we've got a messy teaching data set with all the kind of things that we showed you problems and issues we've got some slides for teaching this and some exercises you can use in the classroom and we've recently got a blog a blog around the tool itself and also a more more technical one so we hope hope you find those useful. I was going to ask you on the poll which doesn't seem to work how you might use it in your own work but perhaps we can let you have a think about that and you could maybe ask us questions. I just want to thank the whole team John Johnson led on the specs and the developments and Anka Vlad is our repository manager for reshare our self-deposit repository and again she had lots of input into issues that she found in data that were coming in on a weekly basis and also we work briefly with Australian Data Archive who wanted to put a front end onto the tool and added one very quickly it doesn't allow you to configure the tool but I think there's lots of ways where you can go and integrate this into your own workflows and do what you want with it that the code's there and you can pick it up and use it. Okay so that's really all from us if you want to keep connected we have an email address queuing my data we've got a a data service list and we're on Twitter as well and we'd like to hear from you so what we're going to do now is we'll take some questions and we're not going to open up the audio because it can get a little bit chaotic but we'd like you if you do to have a question in the box and we'll just monitor those and answer them between us okay and we probably will write a summary out for for the questions the answers in case they provide a useful FAQ right I'm just looking at the technical ones there's quite a lot of technical tools okay so mostly technical then let's take them in their order they're there we've got a guide on how to deploy on Linux so Miles perhaps you can tell us yeah so if you take a look at the GitHub repository on the wiki the GitHub wiki we have a fairly detailed guide on installation instructions for getting set up on Linux the majority of the hurdles we find will be getting a Rust development environment set up but if you fairly Linux based anyways it's no more difficult than installing most things on Linux but all of the packages are listed on the GitHub page for Ubuntu 16.04 and 18.04 respectively as well both work fine both where the sort of development environment for writing at all in the first place so it works flawlessly on Linux as for migrating to data files to Azure depending on what's when you would want to run the tool would be dependent on the type of data you're looking at if you're looking at lots of smaller data it might be worthwhile to run it as you transfer it across whereas if you're looking at lots of large files or a lot considerable number of large small files it may be more worthwhile to wait until you've transferred all of the data over to Azure first you could even make it part of the the data transfer pipeline while migrating your data so I hope hopefully that's answered your question Claire okay good well we've got so let me display the message of the teaching materials okay your eye for the teaching materials yeah okay I shall go back hopefully you can see the screen these sites will be available and actually we'll let you know where to find them afterwards but it's on the UK data service sites if you put QOMare data UK data service you'll probably find the page but there's a list there all from I could just show you the page probably yeah there you go if I show you this page you'll see here's my data there you go yeah so that's the page there on the UK data service under R&D and it's just got just got a list at the end of all the things you can download there's presentation the tests are here there's a guide and the teaching examples there as for the question here about examples of disclosure disclosure risk the QOMare data chosen QOMare itself doesn't actually score disclosure risks you can have it look for common disclosure risks such as net direct identifiers so for example you can have it look for emails using regex but it doesn't provide any form of light solid score as to what kind of disclosure risk a specific file poses so it's good for looking for specific things like phone numbers postcodes or email addresses and counting those out and giving you direct pointers to where those things exist but it's not particularly good for telling you how much of a risk a specific file poses and there are better tools out there for such as STC and micro which Christina will be able to answer any questions about should should they should their eyes and just to say the point there really is it's just for for things that I really shouldn't be in a data set that when you're doing a quick scan so we're not suggesting this is a complete disclosure risk tool although if you are worried about a particular companies or names or place names of a region you could actually just have a dictionary with them in and check them so it very much depends what you're looking for but you can you can plug any type of name density dictionary in there so so potentially it could be quite useful we haven't done any large ones but we'd like to it would be useful to try on maybe bigger data sets to see if it takes a long time but it's really around them direct identify right we're just looking some more questions here so to answer your question Lindsay there about whether a good tool can see data on SharePoint as it currently stands there is no support for that and I'm not particularly sure how easy that would be to get that sort of thing working again we would normally with Curamada you'd be looking to run this either hosted somewhere where you'd have some form of tech department hosting this on your on your behalf and you submit files to it in a more data pipeline style of fashion where you submit the file to a data store there or you run it on your local machine and most of the time we advise running it on your local machine but again as it currently stands we don't currently have access to SharePoint data through Curamada at all no but what you could do is take a if it's kind of moving data you could take a snapshot as a CSV and run that through because the tool codes with CSV quite well so it may be that you take snapshots every day and run that particular file through so I think that would be something will be a kind of compromise right let's just have a look I'm just seeing what the questions there are okay we've got a question about do you use any other tool for disclosure so I'm going to cover that in a minute yes we do does the tool work in that context? Lawrence question here about missing values so the tool does differentiate between the different levels of missing so if you have a file set to a minus nine for example and you declare that as a missing variable and assign it a label the tool won't complain about it being missing and we'll treat that as an acceptable missing and then for non-responses where the data has just not been entered into the file for example an SPSS that's considered a sys-miss value and this can be treated completely differently so there are markers within the file that are like attached metadata to each value so there's the tool can tell between something that's a complete non-response or something that's just missing because it's supposed to be missing okay let's just have a look what to use at all we do currently use other tools for disclosure risk in the eukodera currently I'm as far as I'm aware we use STC micro and I think do you want to say a little something Christina around how we use that we don't use it for everything but we do use it sometimes hi Annika and so at UK Data Archive we actually use a mixture of QA my data and STC micro because especially with large governmental surveys sometimes it does happen to have direct identifiers in so that would be helpful to use QA my data for however when it comes to indirect and identifiers and combinations of variables we would use STC micro either the package in R itself here in house but we do teach on the graphical user interface of STC micro it provides all sorts of calculations for risk including the global and the individual risk it is very handy because having that individual risk in there you can make a common sense decision whether a pattern is actually risky or not in your data but if you have any questions about STC micro you can always drop me an email I am a big fan of the package so if you would like to deploy it in your curation workflow please just contact me at my email address to quickly answer Claire's question there we have considered creating an R package for this tool but my knowledge of R and understanding of R is limited to say the least and it would be something that we might consider in the future but would be as it currently stands more work than we are willing to put in just because of the way interoperability works between Rust and other languages it is very possible and it can be done so it is something that we can consider going forward and would be something we could consider if it was a massive feature request for people next up to answer Graham's question if you were to embed certain elements into the data file for example specific odd characters when you save that to the configuration file your configuration file is kept on your local machine and as long as you keep that configuration file safe backed up preferably then as you make modifications to that that will persist at the time so as long as you are using the same configuration file always using the dash-config or dash-seed switch and specifying your configuration file the same each time then you would have repeatability and you would also have a very easy way to share your configuration as well what else do we have so anyone else have any questions for the next couple of minutes if you don't feel free to contact us on curingmydata at ukdata-service.ac.uk to ask anything that you might suddenly think ah I need to ask this we will be putting this webinar up to watch again and if you go and consult our pages and you want to find out more do go and have a read of everything that we have available and if you have further questions just please get in touch with us but thank you very much for attending and for listening and enjoy our curingmydata tool and the data quality it's going to bring for you okay thank you very much thank you thank you bye