 Hello. I'm Steph Simpson. I'm a systematic reviewer at the Institute for Health Metrics and Evaluation at the University of Washington in Seattle. I'm here to talk to you today about automating data cleaning and documentation of systematic review extracted data using an interactive R Markdown notebook. My institute takes in data from a lot of different kinds of sources to inform our modeling. We have very large set of coherent detailed models covering many years, locations, diseases, conditions, the global burden of disease project. We've got other global scale health projects going as well. We do about 40 systematic reviews a year. These taken in peer reviewed scientific primary scientific lit and fit them into our research databases. So they can make inform those models. A quirk of our institute is that the extracted databases usually are handed off to a different person from the reviewer to for analysis and modeling. That lets data set cleanup to fall through the cracks and can get left off at the handoff. So we've got a problem with that. The data extracted by hand are inherently messy and patchy and inconsistent. Uploading messy data wastes of computer infrastructure, analysts time reviewer effort and getting those rows that never make it in. And if the data set ends up biased or has incorrect numbers in it, that can lead to unrealistic models. Systematic approach to cleaning these checklists once we have decided the extractor should clean it beforehand off. The next systematic step is to do checklists, but that can be very tedious to go through and excel making the computers do the checks save some of that tedium and extra usefulness out of that is that the code is reusable. One more step to make it even easier and our markdown notebook allows even easier code based checks and our markdown notebook acts a lot like a lab notebook where you can take notes and paste in your experiment or actually run your code, put in figures. All those kinds of things detailed it and print it out as a copy of what you did. So you get the documentation out of it as well. I have a repo holding all of the code that I'm presenting here today, and that will be available post that again at the end. I want to talk about the code in the automating the cleanup so the previous iteration and runs. We had a script that launched functions with configurable set of inputs. Just in our and that works great, but it's a little fiddly to get the configured. What I'm working on now is this interactive notebook and it doesn't run yet so to demo today I'm going to show you the example data set I came up with. Give you a live demo of the pre of the working code and then a narrative tour of the notebook. My test data set I won't go into the details, but it's got a lot of different kinds of fields and all kinds of different areas doesn't have a, it's not a huge extraction, but it's would be tedious enough to do this by hand. But we have an internal policy to not share on published data. So I needed to find an example data set to share with you. Infectious to these data observatory to the rescue they're at Institute at University of Oxford in the UK. They have several published data sets from scientific sorry from systematic reviews. One of them is soil transmitted helmets such as hookworms that is like the data that one of the, it's one of the causes we extracted on a neglected tropical diseases team. And from what I can see the policy data use policy says I can use them this way so there's a script to download their data prep a data set similar to mine with it so that the test will work fine. And I've got that script on repo. Here is a glimpse at this example data set on selecting some fields that I will talk about later these ones are going into a unique key in a minute but gets you an idea. The next step is to show you the working code based using this example file. This code came out of a final project for a class I took last time at health metric sciences in here at UW with fabulous collaborators rose bender and Ali East is stellar graduate students in our, our department or Institute. And three files I'm going to talk about next. The three files are here here we are my R studio document. I've got a read me file. We've got the C report parent file, and we've got a configurable file as well we've got, I have, I'm showing a little bit of the check functions as well. I won't get into those but they feed into this. And he shows you what it's about. And how it works gives you broad instructions on setting up the compute config script that collects all of your inputs in one place, and saves it as a set of arguments accessible to the parent script. So the parent script runs them and calls those child scripts over here. It saves everything out as a report of what rose failed, which tests with which columns. So we'll talk about those in a moment. Here is the parent script with where users update to to file paths. Everything else is either the data that's or, or the details from the config file or the data, the input data. And the rest of it just runs. Here's the config file I've set it up already to match today's today's demo so I'm just going to run it but you start out defining the source directory. And the path to the input data path to the output roots and directory and create a whole new directory out of that with a internal folder dated with today so that you don't get confused and get things stacking up too much on top of each other. Here are the checks themselves these are the inputs arguments to the different checks for checks for missing this checks a check for duplicates a check for valid values within a column. So the missing this check checks individual columns, make sure there's no hole in the data where you don't want one. And the duplicate check can catenades of your input columns together to make a unique key to say, are there any rows that have the exact same combination of this. And the validation check is more complex, it has a column name, a logical operator and logical conditions to be met. So if we want the minimum age to be less than 19 in this example. If it's not less than 19, it will flag it for us. So how far we got we got the the output directory let's start writing down the recording the entering the criteria for the validation checks. And then this saves all of these inputs as an RDS file elsewhere, which will be called again by the parent file. Again the parent file has directions that says just update these two items. So there they are. And then we can run all the other functions so here we are. Collecting the list of functions to run and sourcing them. This is the call that gets them to actually be a thing to be able to be called upon. Some packages. Here's where we import the arguments from that RDS file that we saved out. Here's the data table. Sorry, here's the input data. Loading that up. Here we've got observations and columns, observation columns, observation rows and columns. And that is everything the rest of it is just running what we've already asked for. So, here's the missing checklist check, the one observation with missing values in this column in this row. Duplicate list. This is handy when you've fixed everything you're going to get answers that say no duplicates found or no other errors found. Finally, the last one. 10 observations that don't meet this last criteria. And then we can save all this we get out of this a repeats it all over again. But we get a print and it tells us that our output saved so we have documentation on which rows need which fixes. Super, super useful. It's a little crude as you see from the output there, but it is still going to be useful. Oh, that worked. Let's look at how far we've got with turning that into an armwork down notebook to make it even easier. I have a file name here for you. I do have to give credit to people who have helped me get this far. Not quite everything I know is from Jenny Brian, but the book to help set up your our projects and your studio and your get and everything working nicely together. Super useful as is the come chunk options from you easy and all their work down help. Our notebook, my our markdown notebook in our studio with the usual YAML output is set up to be notebook has a standard setup code trunk. Some packages at the top and then instead of a read me. I've got that here in the notebook of what the state what it does this is very similar to. Formatted same thing as we just watched. I'm calling this section a form to fill in instead for each of these sections I've got a. I've got narrative instructions and a format syntax formatted example, including there's my way I showed you in the other arm. Set up. I've got some places for feedback. commented out right now so I don't get error messages with instructions. Some of this has instructions of there's nothing for you to do here, but again more feedback. Then I've got the three checks for missing this. And valid values. And each of these has an explanation. The input section and output to expect and then directions. Two directions for this one because you want to define what's missing you can add to the previous stuff just by adding another or in case you have to know. And this is still part of looking at missing this, which columns of check. Again, formatted blanks and formatted examples. Here's the duplicates again this one explanation inputs results directions. Formatted blank formatted examples. And the more complex one for. Checking for valid values. With explanation inputs results directions. Formatted examples. We've got the end of the interactive part clearly marked, but also here's some information for you find your report. Here. This part says still run the code chunks, but what's the. Cleaning steps are so cleaning steps automatically done. This is still formatted just like the other one I will do summary factoring, but it's still keep reminding people to run the code checks source the functions run the functions. We write the report. Same thing. Easier for a lot of people to read and then if you're like. Let me re go over again what we what I showed you so if you have a problem with no cleanup at all. We've got fewer user will raise the data you have a waste of time, among other things. We don't need that checklist is a systematic first approach. The configurable script takes away a lot of tedium and and potential for human error of just missing things. So we have to pay attention lower award low intellectual work is really draining. Our markdown notebook is easier to use for a lot of hours as many reviewers including me to some extent who are less code provision. Again, we only got so much of that attention is fair. Let's save the cognitive effort for doing systematic reviews. It's easier, it may be likely to be adopted wider across the Institute which would lead to bigger return on investment more and faster systematic reviews fewer drop rows. Better data, and of course you can automate those reports about what the heck happened how did you clean this. Thanks for listening. I should be about for 15 my time in Seattle when you're watching this live. So I will try to be around to answer text question or yeah text questions, but I may not be and I'll get to them when I can. And again, the code repose you. Thank you for your time.