 Okay, so this workshop is on secure medical data collection best practices with Excel and leveling up to red cap and collaborator and sorry that should be a capital R and collaborator that's a particular package in our and I'm going to walk through our structure today I'm going to talk about best practices in Excel, if you're using Excel or your collaborators are using Excel, how to help them produce the best and cleanest data possible. But we're also going to make the case for leveling up to red cap and packages as part of the red cap environment red cap are in collaborator with will be easily and Kenneth McLean. So I'll get started with best practices for the clinical research data entry in Excel and spreadsheets and we're going to focus largely on Excel and Excel's quirks because it is the most common spreadsheet in use today. So for this workshop we're going to assume that you have access to Microsoft Excel, or to a free tool that will let you work with Excel files like Libre office or open office that you've downloaded the Excel files from this workshop from either Google Drive or your email. And then you can use zoom to put questions in the chat window. And definitely post reactions let us know if particularly during the exercises things are going okay. Everybody make sure at this point that you have a copy of Excel or the equivalent and you have the three spreadsheets for exercises. And let us know in the chat if you're having trouble. If you don't have those. This is the link here for the Google Drive. I guess I have to go back. Escape out and I will paste that into the chat. Just in case. Thanks to it. And I'll go back to slide show from the current slide. So hopefully everybody can get those. We won't hit the first exercise until about slide 60 so you've got some time to get those downloaded. So our goals for our first hour is to learn about the pitfalls of Excel, learn how to structure tidy data, and that's a particular kind of data structure. Learn how to store your metadata descriptions or notes about the data along with, but separate from your data. Learn about some common data errors and how to avoid them. Learn about free alternatives to data management with Excel which are superior on many fronts, and we're going to focus on red cap, but which requires substantial investment of time into your research career which we really think in the long term are absolutely worth it. So one of the big pitfalls of Excel is that it's super easy, you can just open it up and start typing in your data. It doesn't necessarily produce good outcomes as in this far side cartoon you can't just throw things together you can't just throw things in a data set, and expected to be analyzable, you have to be organized, and have a plan before you even open up Excel. Excel is everywhere but it was designed to help small businesses to help with data entry act as a sort of database perform calculations format tables. Visualize your data on one sort of a Swiss Army knife for small businesses, but Excel is not great at any of these so Excel might not be the right tool for your job. And we're going to talk about ways to make it better but in the long run, why you might want to choose better tools. Using Excel in some ways is like falling into a pit of not great that is hard to escape. And we're going to make the case today that your research deserves better. So what tools do you actually need for reproducible and secure medical research data management. So ideally you'd start with a data entry tool with data validation so we check the data type, whether it's integer or text or date, and the date range to make sure it's reasonable. It's hard to overwrite data by mistake data that are secure and private especially protected health information, good for collaboration so there's an audit trail of who changed what cell at what time, and hopefully notes on why they did that. It should be good for collaboration so and data are easy to analyze, which a lot has a lot to do with data structure, the data analysis itself is reproducible, and any errors are obvious and easy to detect. The reality of Excel is that it's really designed for small, small businesses and it doesn't scale well. It is not designed for big data tends to fail. Excel is filled with compromises, it's very easy to overwrite your data, you can type anything into this any cell, whether even if it's the wrong data type or clear and obvious typos. It's very easy to overwrite formulas or edit them by mistake, and the formulas themselves are hidden which is great for printing, but really bad for checking your formulas. The reality of Excel is that point and click copy and paste data manipulation, especially with multiple steps leads to undetected errors. Our structure and our habits and wide computer screens encourages very wide data left to right, but Excel is actually designed for very tall data. It has 16,000 columns and a million plus rows maximum. So that really the goal is to have tall data but we don't often do that way. And there are many tools on the web that can open password protected Excel files. So your PHI is not terribly secure even if you have a password. Excel and this is possibly the most dangerous tends to fail silently. So a few Excel spreadsheet horror stories for you. In 2020 the UK COVID tracing program lost thousands of COVID cases. When Excel ran out of roads, they had data and CSV format that were important to XLS templates which are limited to 65,000 rows. The extra data rows were silently discarded and this was one of the dangers of Excel. Someone eventually noticed that they were receiving exactly 65,536 rows every day eight days in a row. And over those eight days over 15,000 cases of COVID-19 were not traced contacts were not tested or notified. The response to this Excel was always meant for people mucking around with a bunch of data for their small company to see what it looked like. When you need to do something more serious you'll build something that works at scale, but you wouldn't use XLS nobody would start with that from Professor John Cowcroft. I think this is generally thought to be true but a lot of folks who don't know any better are using Excel for applications that it doesn't scale to. 88% of forensically examined business spreadsheets contain errors, more than 20% of gene data sets submitted with scientific manuscripts contain errors. An infamous Duke microarray study of precision cancer therapy due to multiple Excel errors labeled less effective cancer therapies as more effective, leading to bad outcomes in gene expression driven clinical cancer trials and this has been the news a lot and I'll drop these links into the chat a little later for you to explore. Silent miscalculation of spreadsheet in 2012 led to a loss of over $6 million by JP Morgan, known as the London whale trade. This was an error caused by manual point and click manipulation of spreadsheets. Point and click manipulation of data, especially with multiple steps is prone to errors is not a good practice for reproducibility. There are a lot of spreadsheet pain points data entry itself spreadsheets are bad for this there are lots of errors the typical error rate and studies of spreadsheet data entry is somewhere between one and 6.5% of cells and you can imagine in a large spreadsheet that adds up to a lot of errors. The FBI National Crime Information Center, which is a database that collects data from multiple sites and tries to merge them. 15% were in error identified the wrong person, often due to a typo of a middle initial or a name spreadsheets often auto format data to be helpful with unintended consequences, they interpret a lot of things gene names like March one as dates. 27 gene names in fact had to be changed in 2020 by an international consensus conference, because Excel will not change this behavior. And so 27 human gene names were changed to accommodate Excel. Data entry and dates are particularly bad at this dates are stored as whole numbers. And they're stored as day since January 1 1900 and all windows and max since 2011. But this is different in Mac Excel before 2011 based on 1904 dating and ugly surprises can occur because if you move dates between one spreadsheet and another or into or out of Excel, it will do this formatting automatically. Silently. Another big problem with data entry is the, what's called an off by one error. And if you think, oh, you know, it's only three to 4% errors, it's not so bad. This is a problem that's not so specific to Excel, but you often get what's called an off by one error. So if you're missing a return like in column one, someone's typing along blood pressure systolic blood pressures, they type in 127 they don't quite hit the enter key then they type 136 they had entered, and they keep going. And they get to the end and they've run out of numbers so they leave this blank. So if you're going to ship this to your analyst, you hope that they see this and say, oh, this is 127. I wonder what this 136 is and then figure out that, oh, that's patient five blood pressure. This belongs to patient 6789 and move them down. It's very easy if you're not thoughtful about it to go to the source document change this to 127 throw out the 136 and say, oh, I guess patient nine never had a systolic blood pressure. Or similarly, if you hit an extra return here and you have a missing row, a cell of data. When your data entry person who sometimes is wildly underpaid gets the end and says, huh, that's weird I have an extra number. You would hope they would let you know and figure out that hey this one belongs to patient four this one to patient five. But if they don't if they just go shrug their shoulders and move on, instead of having one error, you've got a whole column of errors after that problem. So it's really important to think about offline by one errors and figure out ways to make data entry more foolproof. So better data entry for medical research data you want data entry forms with clear labels corresponding to the patient in the field, rather than open sheets to reduce off by one errors. It's best to use a tool that secure and protects PHI and it's best to use a tool that can validate data entry. And, you know, Excel isn't necessarily very good at that it can it will accept a date like this, or a blood pressure of 1047. And, you know, if you enter a potassium of 6.7 dash but lice, it will turn the whole column into characters rather than numeric. And if you put in, if your end key sticks during Hispanic, it will assume this is a completely new category that's distinct from the original Hispanic with one in. It's also an issue with security and privacy spreadsheets and Excel were really not designed for privacy. Excel when it was designed assumed that all data were on floppy disks pulled out of a computer at the end of the day and locked in a drawer in a locked office. All the data were on a removable disk that could be locked in a safe and this was before hard drives were a thing. You can password protect spreadsheets but lots of programs on the internet can open lock spreadsheets without a password. Then there's formula security formulas are hidden in cells they're not visible. They're good for printing but bad for checking the formulas, you can easily have errors missed, you can easily overwrite them just by typing, you can type data over a formula. Nothing will stop you. There are no warnings and no audit trail. There's no way to track who changed this formula or why. And an audit trail means you should be able to take your raw data, repeat every step track every change who made it when why, and end up with the same final result. If you can't do that you don't have reproducible data analysis. As an example in 2021 I contacted four well known authors of papers published in the last five years in my field. So not that old to extract data for a meta analysis, none of them, none of the four had access to the raw data. So there was no reproducibility of papers that were published less than five years ago. Another big issue is that Excel fails silent. So a problem like too many rows or changing formulas by typing over them doesn't trigger a warning and error or anything, it just lets the user do it. And when something goes wrong, there's really no way to tell, and this is why forensic accounting has become a field, going through checking every cell and checking every formula to find out what really happened. Excel sort of encourages wide data, not officially, but technically Excel wants tall data. There, it always expects a lot more rows and limits. There was a lot more rows available than columns, suggesting you really should have tall data. Most users record data over time from left to right. And that's some of that is just our habit reading from left to right. We also like to avoid scrolling and our usual screens are wide, not tall. And this creates unwieldy data files with many identical column names. If you do multiple visits screening randomization treatment one treatment to and record the date and systolic blood pressure at each one. You end up having to create this other header row extra header row for the visit, which really should be a variable. And you have the column name date repeated four times which is very confusing from the computer point of view, and difficult to analyze when you have four distinct SPP measurements in the same spreadsheet. So why do good research people do bad data things. Some of this habit the screen is wide and we want data to fit on the screen and don't like to scroll. So we tend to make wide data. We read from left to right when we add new data we tend to add it to the right side as it's new data. We also don't like to type that much, we want to avoid repetition, which leads to variables stuffed into the headers. And it really encourages us to you encode data as colors there's actually a button for cell styles and Excel is silently helpful on data types changing genetic data the dates are changing numeric data to text if even one character is present. And since Excel enforces no rules, we can do anything, even if it's a bad thing. The limitations and limitations of spreadsheets. It's easy. Excel is very accessible. It's being used for purposes for which it was never designed Excel frequently fails and fails silently. Manual point and click copy and paste leads to errors separating data entry calculations and reports is a very good thing. And it's important to use tools fit for purpose, and there are certainly better tools than Excel. Our recommended data tool for medical data is red cap. It does all the things there's data validation. PHI is protected their data entry form standard data entry forms for things like demographics. It's available for free at almost 5000 institutions worldwide it may be over 5000 now in 141 countries with over a million projects and 1.6 million users. And this is where you want to end up, even if you're starting at the point of data entry with Excel. Now you may say I really need to do this now your data collection your data analysis and you don't have time to invest in red cap and learning this. And in this hour will focus on how to best use spreadsheets to minimize data problems. In the long term if you're committed to medical research it's worth investing the time to learn red cap and worth investing investing the time to learn some are into a work with the statistician who uses our open source software to analyze your data. So you can see the code and understand what's going on. So onward to tidy data and best practices for spreadsheets. So tidy data is a data standard. I'm going to talk a little bit about this over the next few slides. So tidy data standards data sets can be in all shapes and sizes, but data manipulation tools need data standards, if you can build your data set in a standard format. There will be a lot of tools to work with your data and tidy data has become the standard data format since 2014 and many statistical tools work best with tidy data. So tidy data is a standard way of mapping the meaning of a data set to its structure. So tidy data each variable forms a column, each observation forms a row, and each cell is a single measurement, you don't combine two measurements and any given cell. The standard structure of tidied data means that tidy data sets are all alike, and can be used by the same tools but every messy data set is messy in its own way. Working with tidy data means you can use the same tools and similar ways for different data sets. So untidy data often means reinventing the wheel with one time approaches that's hard to iterate or that are hard to reuse. So having a data standard promotes collaboration it's easy for other people to work together because you share a consistent data structure and you can pass the data from one person to another to do different steps. So having a tidied data standard makes building data pipelines easier with standard inputs and outputs to each function. So let's walk through this a little bit and we can work with this and chat. So we just talked about what makes data sets tidy and untidy. Why is this tiny data set untidy. What it is there's, you should only have one unit of information per cell. So in the check and anybody take a stab at why this data set is untidy and feel free to jump in. So looking at this one. Yes, two pieces of data in one cell it's the blood pressure screening problem. So we've stuffed systolic and diastolic in the same cell which is a very common parlance in medical data, but makes it a lot harder to untangle those. So ideally the tidy version of the same data would separate SPP and DBP and identify the visit at the screening visit so separating out the screening piece of blood pressure into its own label for visit. So if you had a visit to visit three that could be discriminated as well. So one measurement per cell. It's also notable tidy data is often more cells, a little bit more typing, which is why sometimes we're reluctant to do it. So here's another example of untidy data. So why is this data set untidy. And the hint here is you're going for one observation per row. So exactly they're two different visits to distinct observations going on the pre visit and the post visit. So a tidy version of the same data has a time or a visit column. That's the pre and the post. It's notable that now we have one measurement per row tidy data is often taller and tidy data is often more cells in this case 18 versus nine. Here's a third example. So why is this data set untidy. The hint is each variable or same measurement should form a column. So to some extent we've stuffed a variable into the header row. That makes sense. There we go. So we want to separate the visits into its own variable rather than stuffing it into the header. So we have one column for visit one column for SPP. Tidy data is often taller and more cells but it makes it a lot easier to analyze. It's in the standard tidy data format. So unfortunately when you bring untidy data to your local statistician we often get the glare of death, because they have to spend a lot of time wrangling the data that could have been prevented. If you can set up your data set as a tidy structure when you collect your data. We're talking about good data practices. So using spreadsheets and trying to build tidy, clear reproducible data that makes analysis easier and prevents problems later I'm going to go through 14 good data practices. So first is making clean rectangles of data with no more and no less so tidy data structures that are very clean, very vanilla. In this sheet should have one row, the header row variable names no extra header rows no titles no notes, lots of data, no blank lines or columns and no empty cells, and absolutely nothing else. So here's an example that is not a clean rectangle. There's an extra title. There's global notes on the data down here, and there's no a note on an individual data plan. These are all meta data that should be stored on a separate worksheet. So a common problem is extra headers. So here's our header row, we would think single header row but we have an extra headed row for visits and we talked about this in the last few slides. We're actually encoding a separate variable to visit number. So you need a column for that in your header row for visit number. You don't want to have the same measurements appear twice in columns. And if you're seeing that that suggests you're putting more than one observation in a row. It's very confusing for the software and the statistical analysis software to have SPP twice which one is which dbp twice. It's best to have unique variables throughout your data set one for each column. And avoid empty cells and why not have empty cells well it truly empty cell looks the same as a cell with a space or sell with three spaces that somebody just happened to space bar a few times. So they're completely different. And it's better to fill empty cells with a clear marker that it's not available or empty and na is the standard capital and capital a. In the past people use code numbers like negative 99, which seemed like a reasonable idea at the time, but it can cause real havoc if someone tries to do math later using that number. So it's easy numbers and not understand where it came from. If you want to store details on why the value is missing stored in your notes or your metadata, not in the data frame itself. You want to end up with clean data rectangles with one header row of variable names, none of which are repeated. Every row is one observation. Every column is one variable with non repeated. So with one unit of data with no blank lines or columns, any empty cells should be filled with na and no stray notes or titles that you've stored elsewhere. Number two, use consistent variable names consistency makes computing a lot easier. The consistent case for variable names is really helpful to humans capital BP and lowercase BP are both blood pressure, but to a computer these are completely different. So in general avoid capitalization. Computer struggle with most punctuation. A lot of times the punctuation is used for various functions. So avoid punctuation other than underscores or dashes. And humans have no trouble understanding something like BP standing, but computer struggle with spaces they wonder a VPN standing two different variables. So it's helpful to avoid spaces and replace these with underscores or dashes. Generally, lowercase what's called snake case with an underscore between lowercase words is recommended. So lower snake has things like SPP millimeters of mercury heart rate and BPM, AST and international units albumin and grants per deciliter note the units are attached. That can be really helpful if you're looking at your data years from now and you're not 100% sure because the standard units have changed five years later. And it's in variable names if there's any possible way someone could guess wrong on units imagine someone new opening the data set in 10 years. Typically we'll use the variable name underscore and then the units and consider these as two distinct chunks so you get an underscore between chunks and words within chunks if you have more than one as a dash so millimeters of mercury grants per deciliter mill moles per Don't encode data with color even though Excel encourages that avoid using font color or highlighting your data should be plain vanilla highlighting or font colors are easy for humans to read computers are colorblind computers need ones and zeros. This was an example of encoding the study arm as two different shades of green, not a great idea humans can see it but computer has no idea. So you want that variable encoded as a distinct variable as a distinct column. It's clearly labeled as intervention or control or one or zero have you want to encode it. In general, it's a good idea to make it human readable, as well as computer readable. So a short text name that's clear is really helpful. Some folks like to use sort of one underscore intervention or zero underscore control to have both a number and a label kind of stuck together. That's I think that's generally a good idea, but whatever your preferences as long as it's clear. Be consistent with categorical variables always use use the same spelling and capitalization. Unfortunately humans aren't nearly as consistent as computers. So humans know that Hispanic lowercase Hispanic Hispanic with two ends or capital H are likely all the same value in a categorical ethnicity variable, but computers don't know that they think these are four distinct things. And it's really hard to be consistent to avoid typo spaces changes in capitalization, even inadvertent spaces at the end of the word, which are invisible to humans look like a different category to computers. So you need to structure your data entry for success use consistent capitalization and set up drop towns in a six Excel for each categorical variable. So this can be done and this is a great opportunity for our first exercise. So go ahead and fire up Excel and open the exercise one dot XLS X file. And I'm going to try to do the same on my second screen, or my computer decided to verify my copy of Microsoft Excel at the moment switch. And you should see study site there. What we're going to do is on this data sheet. We're going to create a column for a categorical variable. And we're going to add a new worksheet with the plus sign here. And we're going to double click on that and call it cat var for categorical variable lists. So now we have this sheet which will store the allowed names for our categorical variable. And we're going to start with study site and this should be identical to our study site on the first sheet to data sheet. And in the column between the study site we're going to add four values. We're going to start with Ann Arbor and Boston, Chicago, and Detroit. So we have four values to work with there in a two to a five. And these are going to be the allowed values for this variable. We're going to set this up so we don't have to worry about capitalization, we don't have to worry about typos, it will protect us to some extent from those common human errors. So now that you've got this set up, go back to your data sheet and select this study site column will go down say 26 rows or so. And then click on the data tab and Excel this is up here data. Now if anybody's having trouble. Say so in the chat, if you get behind you if you're struggling let us know but if you're caught up, this is great so you can click done data. Then you're going to, you may need to widen this up to be able to see this to click on data validation. And that's here in the data panel. And I'm going to click on the drop down and click on data validation. And it's going to ask okay what are your validation criteria for this column. And I'm going to say list. I'm going to allow a specific data values that are in a list, and then click the source. So now I'm going to give it the source of the list I want. And then I'm going to click, go back to the sheet cat bar lists. And move this out of the way I'm going to select my values, not the study site just the four values, and it should look like equals the sheet name cat bar lists, an exclamation point, and the four values some dollar sign a dollar sign to the dollar sign a dollar sign five. If you've got that right, then you hit okay, and Jason asking you back through that and do that again so I'm in data, I select the column I want. Then I hit data, and then data validation, which is this icon here, and then I hit the drop down and hit data validation that pops up this window. And then I want to allow a list. I click in here, and then I actually click out of the box in cat bar list to select the values I want. Once I have the values I want I can hit return or hit okay. And that takes me back to study site. Okay, looks like Jason's caught up everybody else looks okay. So now you notice there's this little drop down arrow next to study site when it's selected. So if we select the first study site. And we click, we can see there are only four options will put in our bird, we can put Chicago, we can put Detroit, we can put get this right Boston type of capital B it'll bring up Boston a capital C will bring up Chicago. And it's working well. What's interesting here is if I try to type in an e name like El Paso. The value doesn't match the data validation restriction defined for this cell. I've got to start over hit retry and type in D I get to try to hit enter. So you can limit. And Ray asked why limit to the first 26 rows not the column I could absolutely do the whole column. So I can do the same thing just to walk through it again data, data validation. I can extend it to these cells. Yes. And this is what I want sure. And I can have, you know, an infinite number of rows in this column. So Ray absolutely right I could do that. And, you know, even if I click on an unentered it has this drop down, I can select one of those four values. Okay. Please put questions in the chat these are great questions. But I think that gets us to where we want to go. So, I'm going to see if I can stop sharing this one, and go back to sharing my PowerPoint again. And so we went back we entered, press return enter. And if you type in an unallowed value it won't let you. So, I think we got that down. Caitlin asked if you don't click the whole column it won't apply by default that's correct. If you want that whole column forever click the whole column. And yes it does assume the first row is a column name. And I think we're good on the questions feel free to ask more if it's not clear. Yeah, you just select the values I think I get what Caitlin's asking. So in the cat bar lists spreadsheet. I just selected the values not the header name. I just put that there from my reference so I would know. So eventually you may end up if you have a big data entry project and multiple categorical variables. Your cat bars lists sheet may have 10 or 15 variables on it so it's helpful to have the header there so you know which variable those options go to. So all of your categorical variables are factor variables, you have a list of all the allowed options. Okay, everybody in good shape on that feel free to put more questions in the chat if not. Okay, I'm going to go ahead with dates. So be consistent with dates. Unfortunately, dates are a messy poorly standardized data field. There are so many ways to mess up dates. You can do month month data year, you can do four digit year you can have dashes you can have slashes and many more. It's common in the US to start with the month, common in Europe to start with the day. You can use numeric months or text months you can use two or four digit year dashes slashes underscores and separators. And this creates many, many problems with international collaboration. So fortunately, there is actually an international standard ISO 8601 established in 1988 by the international standards organization no surprise. Data standard is four digit year to digit month to digit day. That is the standard computers get this and they can sort dates in this format and chronological order, which is pretty darn helpful. ISO 8601 also includes standards for our minutes second and time zones and lots of other things. And you can look it up at this link. And you can format your Excel date columns to ISO 8601. So if you enter a variable name in column B and let me just see if I can hop back and demonstrate that. I'm going to switch my sharing again. So say I'm column B I had date. And you can enter a couple of valid dates I'm going to do 1519. 3720. And it looks like I may have already formatted that one. So, if you have. Sorry, I may have already formatted these. So let's do you can command one to go to formatting select the column, and I'll put standard formatting back on so to 1419. And if you want to format this to ISO 8601 select the whole column with whatever dates you have. Control one on Windows or command one on the Mac, that takes you to the format cells window. The third one down is your ISO 8601 it starts with four digit year two digit month two digit day. And click okay and all subsequent dates will be plugged in so even if you type in one for 18, it'll convert them, which is pretty darn helpful. And if you're out of trouble. It's not perfect. You can type in dates and leaving zeros will be added so it adds 0104 so it's standard 8601, but it won't prevent things like the 15th month so if I enter. So it's the 151515 1515, it will accept that. Now, it'll accept it silently. It won't say hey that's not a valid date. What it will do, it doesn't know it's not a validate it will format it as text and that's why it's left justify, but it's silent unless you know to look for that, and your columns see that but imagine your columns pretty narrow it's not obvious that that one is left justified no others are right justified unless your columns might enough now you could note that the fonts a little bit different maybe a little bit bigger but it's pretty subtle unless your columns wide enough you could see somebody just typing along and you know doing 20 20 20 21 you know 2748 and it doesn't really complain you know that's part of the sort of silent failure of excel unfortunately i'm going to swap back to my power points another thing is to be consistent with subject IDs they should be clearly assigned to each participant each study participant should have a unique ID usually numeric particular to your study it must be unique cannot be their name this should not use or be even be derived from a medical record number birth date name or any other item of protected health information ideally you would assign a unique number at the first contact and keep it through screening and enrollment you may choose to use a unique prefix for each study site to allow ID assignment across sites 15-0041 would be the 41st participant at site 15 or you can keep site and participant numbers separate variables if that's easier you should also be really consistent with files and file names and make sure you use a consistent file structure be sure that if multiple files are used like if you have a different file data entry file each site they have exactly the same data structure exactly the same variable names in the same order with the same categorical drop downs and data validation rule ideally set up one data template tested in a pilot at multiple sites and then finalize it and make sure every site is using the same template that is locked and can't be edited this is a lot easier to do with a single unified red cap database across sites but if you're stuck using excel this is what you want to do and you want to check and validate data from each site and file frequently especially early on and make sure no one's inadvertently editing the data entry excel file it's important to use consistent file names if you have multiple files with different data types different sites decide what the key naming chunks you need are in what order to identify the data and use the same name chunks in the same order with underscore separators so if your pattern is study site data type and csv then blood pressure study data from boston and the labs data blood pressure study data from chicago the vitals data so you have a sense from looking at the chunks of what's in it where it's from in which study and generally you're going to separate the chunks with underscores separate words in a contiguous chunk like bp study with dashes just a standard format be careful with text strings especially gene names excel tries to be helpful and guess the data type in each column any column that looks a bit like a date gets turned into a date any column with text into it gets turned into a text column and that's why you should never add text notes to the data especially in numeric columns you can and should set the format of a column in excel same idea select the column command one format sells as text turns off that helpfulness for that column just pushes everything to text you can sort out the details later it's important to include a code book which is a detailed description of every variable in your data set on a different sheet in your excel workbook labeled code book and we'll show you an example in something called organized one of your downloads with that as columns for variable definition how it was collected the data type the units if it's applicable or the allowed values categorical or the range for numeric you should do no calculations in your data sheet just the data just the facts one sheet near the excel workbook should be labeled data and should have only raw tidy data in a rectangle with no calculations and no formulas formulas can be altered typed over and deleted silently let your status station or statistics program do the calculations in a reproducible way it's helpful to record raw granular data within reason you want to record the raw data in the most granular form that's reasonable rather than recording age which is a calculation in a fairly rough estimate record the date of the visit and the birthday you can calculate the age very exactly later record the date of diagnosis rather than years of disease again you can calculate that later later you don't want any implicit calculations get the source data get it as granular as is reasonable it's important to use data validation to protect yourself from data entry errors we talked about categorical data validation earlier with the lists data validation is a set of rules for each variable or column that restrict the data type and the range of allowed values setting up data validation for each variable dramatically reduces errors the data entry error rate is three to five percent per cell without data validation less than one percent with db data validation is not the default for excel but it is available only double data entry with data validation from source documents is better double data entry means you're paying two distinct people to enter the data from your source documents and then cross-checking them against each other that is definitely better but more expensive and more time so how do you set up data validation in excel for each column decide on the data type format for numeric character categorical or date and set up restrictions whether it's numeric decide on the allowed reasonable minimum maximum values categorical make a list of the allowed variables for categorical lists we said i talked about setting up a sheet in your excel workbook named cat bar lists just for your list of allowed values one column for categorical variable and as katlyn pointed out you just want the allowed values not the variable name because then that will become an option so let's do this quickly for a numeric variable variable with exercise two dot xlsx so let's open that up and i'm going to jump over to that and unshare and then reshare to okay and for this one we've got a heart rate column that we haven't filled yet labeled hr let's select the entire column and do command one for mac or control one for windows and we're going to format to numeric and we're going to have no decimals so reduce the decimal places with a little arrow key and then okay so now it's numeric that's good you can't enter text or something goofy but we also want a little bit more control of the data so snowman enters you know 4000 uh so again we select data the data validation and we're going to allow not any value but we're going to select whole numbers between and just to be reasonable let's say maybe 20 and 220 i'll just pick those as reasonable albeit pretty low and pretty high we're not going to get a lot of data entry errors but if you then enter 21 it's fine 219 it's fine but if you try to enter 19 it won't let you you've got to enter 29 or 240 is not allowed 215 is and so you can set reasonable limits to prevent the inadvertent sticky key or crazy number that will throw off your data set you have to set the numbers reasonably wide so you're not constantly getting queries but i think you have a pretty good sense for most of your variables what's reasonable and you can change that if that becomes a problem if you're studying severe hypertension you might need a higher higher limit so any questions about that feel free to put those in the chat so the numeric validation is a little bit more straightforward but the same pathways select the column data then data validation and then what's allowed okay i'm going to head back to the powerpoint it's also important to protect your raw data you should treat this like gold you spent enormous amounts of time effort and probably money to collect this so always have backups both on site and off site never modify or overwrite your raw data make copies and modify those and then save it as a new data file so file should be a progression from study raw data to study clean data study analyzed data you know all the way through you should always be able to retrace your entire audit trail from the original raw data all processing steps to the final products figures tables manuscript etc or your research is not reproducible and is not very trustworthy so putting it all together taking these recommendations and showing you an organized excel workbook and you can open the excel file title organized on xlsx and just kind of go through it i'm going to go through it on the slides but you can see this example and it's really helpful to use excel as a workbook so they have multiple tabs and typically use at least five tabs or more so it's tempting just to open up an excel spreadsheet just start entering data but it's better to start with a plan you need worksheets for data that's your clean vanilla rectangle your code book for each variable your notes overall so notes on the overall data collection what the study is about notes on individual values your data validation lists and if you want to you can you do traditional excel stuff to make a pretty formatted version for printing or sharing with others but that should be in a separate tab so your data tab should be plain vanilla no colors bolding comments or notes just a rectangle of data a single header row of variable names and no empty cells unless they have na your code book should be very clear for every variable what's the definition is how it's collected data type in units you should have one sheet that's just about the overall data collection what the project was dates who was involved who what when where why and how and where the data was collected then a worksheet that you can add notes or metadata about individual data collection one way to do this is a meta note on the actual problematic cell so if you right click on a product value you can get a drop down that says new note you can type in an explanatory note about the data collection and it leaves a little colored rectangle if you hover over it the note appears so you can see the note but it won't interfere with weeding the data into the statistical program it's not part of the cell don't put this information into the actual cells but you can put it in these notes use na of missing and this is a workaround that's pretty practical and fairly easy to do a little bit more complicated way to do this is on sheet 4b you can have every troublesome data point describe it the sbp for subject seven and visit seven and then your notes about what happened what's wrong platelets clump molasses of sample and you name it and don't put this information on the data see a fancier version and this is sort of the Cadillac is basically copy your structure from the data worksheet to a new tab leave the identifying header row and then reformat all these columns as text leave these blank unless there's a problem and then fill them in if there's a problem that oh the blood pressure was taken standing a problem when it was supposed to be sitting and this allows you to identify recurring issues with particular variables or recurring problems with particular subjects your fifth sheet should be your data validation list and hear a couple of more examples for categorical variables and age ethnicity and age race standards and your optional spreadsheet this is where you can go wild formatic color it for showing to other people not for data purposes but make an extra tab do that just don't pollute your other tabs with this so to summarize best practices you can set up clean and tidy data rectangles we're missing equals na with one observation per row one variable per column one unit of data per cell use consistent variable names with no spaces and snake case including units do not encode data with color or in headers be consistent with categorical variable values use drop down lists be consistent with dates use the ISO 8601 standard every participant gets a consistent unique participant ID be consistent with files and file names no spaces limited punctuation take control of formatting of categorical variables with drop downs include a code book include notes on the overall study include notes on individual values and measurements gone wrong do no calculations including implicit ones and collect granular data use data validation to prevent errors protect your raw data always keep backups never overwrite your original raw data document any and all changes in your data to create a reproducible audit trail and use at least five tabs in your excel workbook with this model now there are even better research practices making your research data more secure in your analysis more reproducible with red cap for validated secure data entry with our for data analysis visualization and reproducible reporting often using our markdown and if you cannot access your data and reproduce your outputs whether it's tables figures in the home manuscript three to ten years later your research is not reproducible and is not trustworthy and if you're in medical research for the long term you should invest in yourself invest the time into learning red cap and r so further resources this talk was inspired by two important free papers both worth reading links are here and these are in the downloadable pdf and i'll drop these in the chat as well and what are next steps feel free to ask questions in the chat upcoming in this webinar red cap for a single site study red cap for multi-site collaboration and if you want to jump into red cap there are great resources online including video lessons a couple of questions what's the difference between a note and a comment so you can attach a note to a specific cell without being linked to the data in any ways you can still really read in the data for a pr at and richard actually pretty much answered that the powerpoint file is in the google drive but i know raymond had trouble with that so i can email that to you and an existing spreadsheet can be imported to red cap if it has appropriate header names and is neat and tidy okay so i'm going to jump out feel free to ask more questions but i'm going to turn it over to amanda for the next step on the introduction to red cap thanks so what i'm going to do in this time period is i'm going to give you a quick overview on what red cap can do and it will be very quick because peter did a great rundown of some of what red cap's capable of and some of the variation between that and r and then i'm going to take you into a red cap project itself and let you see kind of how a database gets set up and then walk you through ways of getting collecting your data from your participants and then some of the checks that red cap has built in and some of the export abilities that you can give you so going ahead and getting started share my screen and actually switch to the power point for getting started here so this is just a quick overview on what red cap is for people who haven't worked with it before red cap stands for research electronic data capture it's developed by van de vel university and what it really is is a hip and compliance secure tool for medical research data collection so it collects data through online data entry surveys file imports mobile data collection api data imports and it's a very versatile tool so you can it has a very customizable project design you can use it for e-consents depending on your institution you can also use it for part 11 compliant e-consent for projects that have that extra standard that they need to reach not every institution will have gone through the validation process of making their e-consents part 11 validated but red cap itself is capable of it you can send out surveys via email or sms for data collection it supports multi-site projects longitudinal projects piping branching logic and default values and then you can create reports c basic statistics and export for csv stata r sass or spss starting in version 11 you can also have project dashboards that are kind of capital ask not nearly as versatile as what you'll do and are with them but a good way of just a quick display of your data and to help protect your data red cap projects have multiple phases development is for creating editing building your database forms testing everything out this is where you're going to experiment this is where you're going to be breaking things until you find out what works and this is before any real data goes into your project once you are ready to start collecting data you move into production and in production this will most institutions will have this locked so that when you make changes for all red cap when you make changes you're making it in a little pocket that won't affect your regular database and so you can see what changes you've made and exactly how they're going to affect your database and at most institutions if there's any chance of those changes affecting the actual data itself you've deleted something you've changed your coding on a multiple choice field that will go through an administrator who can verify with you that you're aware of all those changes and that you know what impact it's having on your data and you're okay with that or who can work with you to prevent it from harming your data so it has an extra check to keep you from accidentally changing something that inadvertently changes your data once data collection is done there's analysis and cleanup mode so this is where data collection is stopped you can choose if you're able to edit the existing data but this is really intended to be a free state it can be returned to production and Keely you can still export your data and then when the project's complete you can remove it from your list of project databases that is inaccessible so it's fully archived at this point but an administrator on your system would be able to return it to production so I'm not going to go through all of this because we just got it but red cap is a HIPAA compliance system the data is encrypted in transmission if it's encrypted at rest depends on the institution we've got role-based security complex audit trail it creates a code book for you as you go it is a flat database not a relational one with very we've yet to fully come up with a file size limitation you can import your data through a csv file or api and you can export the data into csv s spss that are and you can use apis to export the data as well and it is remote accessible it is entirely a web-based platform but it also allows there's also a mobile app to allow for offline data entry so for people collecting data and areas where they may not have a reliable wi-fi connection and an application called my cap as well for more participant-focused way of managing your project so you've got a project you know you want to you know you want to do it in a secure fashion first step is to figure out do you have red cap available to you easiest place to go is project red cap partners this will you can just enter your organization and it will pull up if you have red cap and who your administrators are so you can reach out directly to them if nothing's coming up there most red caps are run by some kind of ctsi department or by your it department so you can always contact them it is available internationally so you can see this is and this is an old map red cap is in many more places at this point but pretty much wherever you are someone probably has there's probably a there's probably a server if you are someplace and you find out you don't have red cap that doesn't mean you can't use it if you reach out to the Vanderbilt team they have red cap they host red cap projects for a monthly fee and they can help you work with that and there is also the red cap cloud also available for a regular fb depending on the project so those groups would be happy to work out with work out a hosting mechanism with you for your security data hosting needs red cap training varies by institution so when you reach out to your local administrator to find out how you use red cap they'll let you know about any training requirements for your specific institution inside of red cap Vanderbilt has provided videos on how to use a bunch of the specific features there's a very detailed faq that's kept pretty up to date and there's always a contract the administrator link and then here for a bit of a personal plug at University of Colorado we have a series of really detailed training videos as well we we go through all the basic features of creating a project and then we have I think 15 features going in depth on key features like using this for e-consent using different parts of the survey in some different ways importing and exporting data I also recommend the University of Washington training series as a great curriculum that you can check out that's available to anybody and the Cincinnati red cap resource center if you're not looking for videos again great list of red cap resources that you can check out if you don't have have a formal training or if you want to go beyond what your institution offers and for questions and follow-up here this is my email so that people can contact me after the presentation and that's the PowerPoint and let's actually get into some red cap here come on there we go so this is this is the heart of a red cap project we are going to come in and set things up and it would take me more time than I have today to sit down and create an entire project with you completely from scratch but I wanted to give a quick example of so you can see the different ways that red cap handles field types so every field type in red cap is validated as a specific type accepting text designed to do calculations we recommend really only doing pretty basic calculations in red cap for the most part unless you need it for your project wait and just do it on the back end with a specialized program but red cap can handle up to some fairly complex calculations at this point multiple choice questions and drop down a radio format your pick one or your pick your checkbox pick all that apply some hard-coded yes known true false fields those are hard-coded with one yes zero no e signatures file upload fields for you the person filling out the form to upload things sliders descriptive text where you can embed images videos audios or file attachments every red cap field has the field label which is what you see when you're doing data entry and the variable name which is actually how the data is stored this variable name is the important one this is how everything is linked it's a mysql back end for red cap and this is how everything is stored on text fields then you can choose your validation type so you can say only except dates date times emails numbers you can specify your decimal points and this validation is hard so it will completely reject any data that is not in that format you can also on on date number fields include a range an expected range so i'm going to maximum dates their numbers these are soft validations so it'll give you a warning if you're outside of that range but it will allow it to allow for an outlier you can make a field required so that people have to answer it to move on you can mark a field as an identifier to make it easier to strip from the data and then you can add field notes which are small reminder text underneath the field and these are great places for specifying units and other information like that and all of this will automatically be put into the code before you as you're creating it on the multiple choice questions you'll list your choices out one per line and then each one has this code so i've got one two three here that like the variable name that's how things are actually stored in the back end so once you have put in your coded answer choices you want to always you never want to change them you want to keep them the same and to keep yourself consistent through project throughout the project you can copy answer choices as you've used them someplace else so you can see we have slider fields i've invented videos i can use matrices i can upload a file or add a signature i can also use branching logic in red cap branching logic lives in the child question so this question will only appear if the person selects yes to an optional optional procedures and i can use piping so if i want the person's name as entered in the first and last name fields to show up i can pipe that information into a later field into a survey invitation into an alert pretty much anywhere in the project i can pipe that information this is also set up with the red caps you can send module so you can send this is part of what this is a part 11 compliant framework where you version control you select your name fields and i will show you in just a moment how the data entry works and it will capture a signed pdf and store it in the project along with the regular data store now if i want to move a field in a form in between red caps i can download it as a zip and upload it here and if i'm using a common validated instrument i can actually try and import it from the red cap library as well so the search function in the library is not perfect but i can search for it and anything with this red star next to it means that it's a validated instrument that we have gotten copyright permission to include in here and it has gone through a series of checks from the red cap library committee to make sure that it has been properly translated between its paper format or however it was developed into red cap and i can just click on it and import it into my red cap project there we go always a little bit slow when i'm screen sharing so that way i know i have that validated instrument in there it is in there correctly and i do not have to go through the hassle of setting it up myself i've got it exactly the same way anyone else using red cap has access to i know i have copyright permissions and i know it's been checked so a great way to use standardized forms so what does this look like for data entry purposes um i have all the setup as a survey so we can go through you can see a bit of branching logic here i'm going to have to provide values on all of this i provide consent provide my names in the required fields you can see how it's piped this information to display elsewhere if i provide this outside of the range not the field i had specified let's go with that for now email address this is also validated and then we here we're going to see that you can send process where it's showing me an inline image of the pdf with all that information it can be downloaded for the participant's record they have to electronically sign and submit and if i go to the file repository in red cap it's stored the compact pdf there for me and then it takes me to the next question to the next form i had and this is the one that i had validated so it's going to give me an error message from being outside of the range no i don't know i need to go back and double check that basic calculation drop down radio buttons check boxes and this has a none of the above feature on it so if i try to click i don't know it'll automatically clear my other options my slider videos so data entry is really very simple and allows for a lot of different data to be stored in and when i go into red cap then i can see my records everything's been filled out and i can go in and i can see all the data that was entered i also have here the history so i can see when this was filled out who filled it out and what their answer choice was and that will and if i go through and change that it will show every change every single time i can also use the file field comment log so if there is potential issue with this data or i want to make a comment on it i can log it in here it's visible in the data entry form for anyone who looks at it and i can come to the field comment log and see all the comments that have been made there's another version of this that even opens official queries they have to go through a full resolution process to get this out if i want to get the phq 9 then out to people on a regular basis i can also set up automated invitations automated invitations after i sign it so this project's longitudinal meaning that i want to define my events with different visits and then say what forms i want to use on what events so if i add my phq to my events and go back to the online designer i can set up automated invitations to go out after the consent's complete and then send seven days after that event's been triggered i can put in reminders send every day up to three times and activate it put a nice elect now and i can do that for all of my different all of my different events if i can stop clicking on other buttons i can base them all off the same thing or different things so that's an example of getting data into red cap and i showed you a few of the different checks that are going on there to make sure that the data is good i validated my fields i put soft validations on them i have i used some special features called action tags which you can see an entire list of many places in red cap but here's an example using things like character limits default values hiding fields to ensure the best data quality in each individual field some other things that you can do to ensure high quality data in red cap are used data quality rules red cap sets basic ones here for you so you can search for blank values or required fields only look for validation errors outliers update all your calculated fields you can also add your own so for this project i only want people who are over 18 so i set up a calculator set up a rule that that age field has to be if it is less than 18 execute in real time so as i'm doing the data entry throw up a pop up and let me know that this is a this is a violation of the rules i've set up in the project and i can set up as many of these as i want to to help me constantly monitor my data set and make sure that the information i'm putting in there is as clean as possible you can also set up missing data codes so a consistent problem with data sets is people just leaving things blank if something's uncertain or if they don't have if there is no data this missing data codes are a series of codes that you specify in one place where where each value means a different type of missing data so that you know exactly why there is no real data in that field and you can verify it by however you want to for the project with whatever codes you want but the codes will be the same throughout project so you don't have to worry about forgetting what you used in one place and using a different set of codes someplace else honestly even if the codes you're using are not fabulous as long as you're consistent with them it'll be so much easier to work with them on the back end or for your analysts or for an analyst to work with them the number one rule of making of creating a good database is to just be consistent and how you're handling issues in it throughout there we also talked about code books in terms of reproducible data and red cap automatically creates a code book for you a human readable version of it for so for every item i'm going to get the variable name the field label that we saw with data entry the field type if i have any specific extra items on it so i've made so these fields are read only all the coding for my multiple choice questions are available it marks things that are required things that have been flagged as identifiers here i've hidden the button so that they actually have to enter in the day they can't just hit today for text fields that have been validated a certain way you can see how how pull out calculations so everything you would need to recreate this database from scratch is written in this code book that is automatically created for you and can be printed out is as a pdf red cap will do that for you you can also create a snapshot of your database so this is the data dictionary itself created i took a snapshot of it right then and red cap is going to store that for me in the revision project revision history and once i move this to production every time i make a change to the data dictionary red cap will store the previous version so i have full version control of every change made to my project when it comes time to get your data out of red cap you your first decision is if you're going to want to do all your data or if you're just going to want to grab a subset if you just want to grab a subset you'll create a report choose who has access what fields you want to include let's grab some names consent and their feelings about the cat if i need to filter some of this out i can so i only want this where age is come on there we go greater than or equal to 20 i could select by which event i wanted but all this is doing is really just giving me a subset of my data and red cap is an eav database so you're always going to get this returned as a flat data set it runs some very basic statistics for you but honestly red cap isn't what you use if you want to get great if you want to play with your data and get and see what the see what the stats are saying that's when you move to another program so i would choose to export my data so here i can choose i wanted as a csv with that with the raw data the variable names the coded values and i multiple choice questions or the labels i can also export for any of the major stats programs and it will give me a data set and it'll give me a series of files to run to format and upload that data set directly into the program so that none of that has to be done by hand you also have the identification options for the security of your data so i can choose to remove all tag identifier fields i can hash the record ID field um in red cap we always recommend your record ID field be not be an identifier identifiers are they're terrible record IDs you want it to be a unique you want your record ID to be a unique random value um making your making your record ID an identifier means that every single person who accesses your project is going to be seeing that that identifier whether they should or not also if you look in the url you can see some information being stored here like pid is my project ID my report ID if i were in a record the record ID is also being used as part of the url that means if i'm on a non-secure assist if i'm on any system really i have i now have all my record IDs of someone's mrm being saved in my browser history and that's just not something you want to deal with especially with the way browsers interconnect and you might be looking on a desk added on a secured computer right now but you might but if that if you also if that browser login is used someplace else on a non-secured machine it's going to pull up the same thing it can also create a really awkward issue if you're ever screen sharing with someone you can choose to remove unvalidated text fields in the notes essay boxes because any unvalidated text field has the option of having phi in it you can remove your date and date time fields a constant thing to remember is any date below the year level that relates to a patient is considered an identifier for that patient and you can also date shift so that you can do your date differences but you don't have it doesn't tie back as easily to a calendar if comma separated variables aren't your aren't your default this one comes into play the most internationally you can choose what is for your for your export and you can also choose what your decimal format is going to be if you're not in the comma for thousands period for your decimals that is so common in the us so just doing a quick export so you can see red cap really just looks like a very basic flat data set very computer readable easy and easy to use they're not going to get any color coding you're going to have flags one thing in every column really a very straightforward csv file you can also export red capped record data via api so and i'll let the next presentation deal with that one two other features i want to demonstrate here in the last couple seconds one is the logging feature so we talk about reproducibility you want to know every single thing that has ever been done in this project is logged i know when who what then i can see the very specific actions so every piece of data that i enter if i delete if i completely delete a record that the original data entry is still there i can still go back and find it um everything i do in this project is being logged so that i can do a full audit of the actions this is even down to the page view level i can go through and see who has accessed what page in this project at what time and then the last thing i want to show you are project dashboards and this is only going to be available to you if your red cap is version 11 or higher but this is red cap moving into showing a little bit more in the way of statistics so we've got ability to pull bar charts pie charts donut charts stats tables using a series of smart variables they've got a wizard here to help walk you through it well you can see the smart variables smart variables and red cap are things that change as you're going through so the username will always display the current user's username and if i scroll down here these aggregate sums will constantly update with my data so my charts my tables and the nice thing about these dashboards are that you can export them as a pdf to show off very easily and then i can also make them public so this will depend a little bit on your red cap not everybody is allowing it you have to have a certain amount of minimum data before you can do it but you can create this custom link that is unsearchable on google but that you can pass off kind of like a google a link to a google sheet that people can use to come and see this very basic report that you've put together on your red cap project so you can show a quick summary of the data without having to without having done the detailed analysis but so that you can do a quick update keep someone keep people in the loop as to what's going on without doing full detail reports any so that was me presenting a lot of information really quickly and there are questions in the chat Amanda and and they also asked if you could drop a few of your links in the chat that would be really cool i also wanted to mention for folks who haven't seen our poster session is tomorrow but if you want to i drop the link in the chat to that a lot of the posters are already up so you can explore that in spatial chat if you'd like and next i'm going to turn it over to will beasley who's going to be presenting on now that you've got all your fabulous data in red cap how do you reproducibly get this out and pipe it actually into your statistical analysis and are hi thank you first i wanted to make sure everyone knew amanda was kind of underselling her videos she's won a bunch of awards at the national conference including for her videos last year thanks all right um so red cap api if you are it facilitates repetitive tasks so if you only have a data set that you know you're only going to need to download and analyze once go ahead and do how amanda showed to download that csv and run with that but at least in my life that never happens that single shot just doesn't exist i mean the patient data set hopefully is updated every day with relevant fresh information or you're running on a different computer a few months later or and this is the core of reproducibility you want someone else to be able to unambiguously and effortlessly produce the same results as you so instead telling your colleagues or students go first to red cap then go to project 55 then click data exports then click export data then download these files and move these files to the sub directory and filling convergent format say it with a few lines of code say it was something like this your life will be easier so there's and for people running the code on their machines especially if they're on the same campus uh api is so integrated and it's responsive they may not even realize they're connecting to the outside world it may be they may think it's on the internal package they've already downloaded for moat for beginners most of the work is just managing the passwords the credentials which is just part of the reality of working with phi so i am following a vignette here in the link it looks like it looks like this and you can copy the code like this and plug it into our studio any r thing okay so first let's make sure that red capper the package is installed if it is if you run this line of code you should get nothing if you get an error saying there's no package install it most people here i assume know that if you want to install it this code will install it from cram and this code will install it from github i'll give you a few seconds to to use this chat which is in the window to pull it up okay before we get to our handle some kind of red cap uh tasks talk with your admin to make sure you have access to the server to start with two that you have access to the specific project and three that you have the token and this may be the first time that you got asked for a token each institution has a different formal process for getting approval to the api your red cap administrator can help out with that but we don't have to worry about these three for the demo today we'll be using a fake dataset uh hosted on an oe server with this token um if you're talking to your red cap admin this this website so the the link that i put in or where's it here it has a little note for your admin uh to kind of nudge them how to do it a little more quickly all right as um man just showed um before you start hitting red cap api go to the code book so in the main page for project project go into the code book and this will have all the variables if you spend a few minutes doing that it won't be wasted in fact i think most of us recommend that you spend more than a few minutes looking at this and ideally you spend like 30 minutes with the investigator to learn the context of the research the collection process any idea sequences of the dataset and as you're talking out you know i recommend explaining to him or her how you're planning to do it and how you're planning to groom the data to make sure it's consistent with their understanding you know they're not going to understand the models probably that you're going to do but they'll probably understand how you're planning to you know convert gender to what you want or convert some lab into the analyze you know and ideally that process starts as early in the investigation's life as possible okay let's go back to our part of it and we need to um retrieve the token if you're using a real dataset with phi it should never be access from an r file with a line something like this that it's just it's not secure enough that the the token is you pass it to the url to the url the server and it instantly knows pretty much your password who you are and which specific project so you don't want to make that vulnerable at all there's two ways that we recommend um protecting the tokens and we're going to do the easier one for today i recommend one of recommendations have a csv and redcapper kind of has this format for csv with these five columns and store all your tokens in one file and save that file in a very secure location not in the github repo ideally and ideally somewhere local that's encrypted and once you store that and you know you store tokens like this you can retrieve it and so in this case we're going to use a that kind of fake token file built into the dataset i'm sorry built built into the package so you know this path is going to be different than you but you know for for your computer but it's essentially just a string and you pass that string and the specific project you want to this function and we still haven't talked to redcap this is just retrieving your protected information and it returns this object with these five elements and the two important ones are the url and the token and this 53 you're specifying that because in this token file there's three different projects and it's being really specific i want this guy if you need to create a credential file you know either copy this and kind of replace it or there's a little function here that will start you out okay so that's what we're going to do today because it's the most portable and easiest for everyone to do however there's a method that i like a lot better and it's basically storing this tokens in another database that's accessible only with LDAP and so it's easier to maintain but still your institution manages the password that's a pretty long process and it's described in depth in this vignette how to create this database all right let's start retrieving data the basics are pretty simple so we retrieve that credential file and just as a reminder you know nothing special to it it's got the url in the token and we pass those two pieces of information to the red cap for function called red cap for read and it spits out these guys it tells you what data sets for what columns were retrieved with what format and at this point you can get you can get moving you have um can just go here you know you have this data set with 25 columns and you can do all the basics like summarizing and modeling and all the other exciting things that are in the next two days of our medicine so it's kind of tempting just to stop here and say you've accessed the data actually let me pause here for a minute um give me give me sign if you're following along and want some more time you can click on the reactions button and use a thumbs up uh give me a crying reaction if you want me to go through that again it briefly a slowdown all right i gotta slow down so essentially there are two things you need to execute you know what i'll i'll mimic what you guys are doing so the first thing we need to do is retrieve it from the token file so i'm gonna copy this and if you have red cap for installed you know what let me back up run that code in the chat to install r and now that r's installed i'm sorry not r red cap now that red cap is installed you'll have this uh file built in so we're gonna get this file and then pass that string pass that path to this function and i just pasted that in the chat and now that you have this file retrieved with the url and token pass that information to red cap read and let's take a peek at the data so i'm holding click down and you know what i'll do or you can see so here we are you have the full dataset transferred from red cap from the server to your local machine give me a thumbs up if you're ready or a slowdown if you like me to slow down oh i forgot the uh here's the third snippet to pass the credentials and get it from red cap could you explain a little more in depth about what those two functions are doing so path credential it looks like it pulls a list and then we're pulling that with red cap read sure so path credential nothing special here it's just the string that happens to point to the location of the file and that file just comes with that red capper package so you'll need a not today but when you start connecting to your own red cap you'll need to modify this file so it contains your tokens not these tokens to the demo site so the path credentials nothing but a path and you pass it along with whatever id of the project you want to the retrieve typically a local administrator will give you an api token which is this long hash string and you just have to store it somewhere secure and like yeah exactly and then once you have that and you have it in a stored location you just have to tell red cap okay this is what my token is once it's in there you're good to go you can read your data you know multiple times a day if you want yeah we have some automated tasks that run the full from red cap every five minutes and basically populate this dashboard for you know nurses to monitor okay so really the hard part I mean the point of this was just to get this 32 bit or 32 character hash value and then we pass that value here along with the url to here so if you really wanted to like if this if phi was not involved this code would be a lot simpler you just say here's the token and you might hard code the url too and then all this would go away that's basically all it is but because we're taking a few extra steps to protect the token that's going to extra jump great that makes sense thank you and so yes it could be saved in an excel file I strongly suggest csv's instead of excel okay so we have the data set again looks like that and it's kind of tempting just to stop there you've got everything now go analyze however I strongly suggest you'd be uh you do a few things to be more stingy so be stingy about what columns and what rows you pull from the database to your local machine and you know the first time I think it makes sense just to grab everything but as you start refining your analysis I'd start trimming down and for a few reasons if you're filtering the server's almost always more efficient than our python and you know if you have a thousand records and you only need 500 the csv or json that redcaps php code assembles is a lot quicker and of course less is going across your network and r doesn't have a big data set as big of a data set to manage so that's a little more responsive five might be the most compelling to me you know we've had red cap projects with 2500 variables we're not analyzing 2500 you know that stats only cares about maybe 200 of the variables there's no reason to overload your brain with an extra 2300 variables keep it simple for your brain and another one that's you know compelling is a lot of things are collected by redcap to help project management and recruitment and tracking the participants you don't need their phone numbers you don't need their socials just keep them on the server don't expose that so how to be stingy the most direct way is limit the rows and here you can say pass I want this isn't row one and row four it's participant one and participant four and you can pass put that in a very in a vector and pass it and now instead of five variables we have just I'm sorry instead of five rows now we just have two participants now realistically you don't know those four the IDs of what you want more realistic scenario is you want to limit it based on some variable and here we can pass a filtering logic to it and red caps filtering syntax is pretty sequel like but read the documentation there are some differences the main one that gets people is you have to put your variables in square brackets you know database you only have to do it if it's you know got a space or some kind of misbehaving character you always have to escape your variable names and then also sometimes there's some differences between redcap stores everything is a string you know like most EMRs do in their ops table everything is a string um and so pay attention that sometimes but you certainly can do some inequalities so let's do this so same data set same project we're hitting but now we're doing these filtering and we get a different two records uh people born in the past 18 years another way to be stingy is specify the column names and this is pretty straightforward um just pass it just pass the column names in a vector and pass them to fields so now instead of 25 columns we have these three you know what I wasn't going to do this I'm going to do it another realistic way so if you only have like 10 variables that you're doing that makes sense sometimes though you'll say I want the demographics and the health instrument and you don't want to specify all 10 of them you know and even if you don't mind doing it it may change in the future um and you may want to include all the variables in those two instruments uh basically you can instead of instead of spass specifying the exact fields you can specify the exact uh forms forms and instruments are synonyms so you can do records fields forms or logic okay so that's part four you know we started collecting we started retrieving everything kind of as we're learning it we start getting um no unfortunately there are no red jacks in filtering that'd be nice though um uh so we started by getting the big all the data then trimming it down to what we need and now I encourage you to do an extra step on top of that especially if you're worried about especially if you're using it for automation so what's happening you know red cap stores everything is a string it assembles it the php code assembles it as a json or csv shifts across the network red cap forgets it and then passes it to reader reader of course is the package in tidyverse that takes strings and creates data sets takes big strings creates data sets reader will make unless you tell it what to do a reader will make its best guess about what variable uh about what data type each variable should be that's a good way just to start but if you're doing automation I don't like just to leave it guessing and you know even if it's guessing right at the time you know someone could enter some bad data later on and instead of now everything being a number there's some characters in there um and now our red capper reader reader that's that's all going to behave differently so be really structured now the real way to do it is how Amanda showed you to use validation and not don't let any in this case uh characters go into numeric fields but you know realistically that's just going to happen sometimes we can't catch it all so another layer of defense to be robust with your automation or even it's not automation with your with your code is to specify the structure you want so in this case we're going to say give us two fields record ID and race and um for record ID integer and for these other guys give me billion actually let me let me go back one more time uh so here before we specify the structure you know here the values and reader made the very reasonable assumption that these are numbers and it wants you know it's kind of stat oriented not uh database oriented maybe and so it's going to give it the more uh permissive floating point data this double precision floating point and that may work for age uh you don't want floating points for IDs so let let's lock it down and say I want this ID to be an integer and the rest to be billions and so we're going to do this you know that's just a vector of names this is this call types object from reader it looks something like that and now we're going to pass those two things to red capper and notice this is no longer a double precision floating point it's it's a nice tight integer these are nice tight billions every now and then it may make sense to not have reader do it you want control um of the data type in that case I suggest doing something like this tell reader just by default make everything a character so if you do something like that I'm going to keep it on the screen here just to make the comparison easier notice it doesn't automatically convert the zeros and ones so it's stored as a zero string and a one string in red capper in sorry in red cap reader red capper converts it to here is my preference but sometimes you just want the straight up string and one of the times that comes up frequently is with dates if you forget to put a date validation on you know as as peter talked about with excel you're going to get a bunch of different formats and you're going to have to use different algorithms if they went day month years some rows and month a year in that case you'll want to pull it in as a string or maybe everything is a string and parse it yourself let me stop look at the questions yeah reamon I wasn't going to get into the longitudinal but yeah but but I will if you have a longitudinal design and let's say so if you can see my screen now there's one is that big enough yeah one row record per person a longitudinal design as reamon apparently knows has multiple rows per person one row per event and what you get is kind of this sparse matrix especially if not every variable is collected for every person every event and a few ways you can do that someone mentioned tidy red cap in the chat that's one approach another approach I like is to call red capper multiple times once for each instrument and so therefore it's kind of already normalized in different data frames and then I have more control about how I piece it together and so yes it will null out you know it's coming from red capper not as nulls it's coming from red I'm sorry it's coming from red cap as zero-linked characters and then depending on which reader options you have and if you have the default those zero-linked characters are converted to n a's and r so it'll kind of look like this block matrix that's kind of sparse um Caitlyn s is there a way to parse the string with white yes okay so I'll dig into this a sec so here is the code book and notice the way race is defined here it's multiple so you know a mixed person would have multiple selections and the person that the designer calls it the race variable red cap will add three underscores and then the id um and so then it comes out like this but as Caitlyn points out so I so sorry so for something like ethnicity I guess um is there a way to get the you know the data dictionary value the string Hispanic Latino not Hispanic Latino instead of zeros and one there is there is um so I'm looking at their um yeah I I when I wrote my example I didn't realize that race was coded the way you had it coded sorry about that yeah um so if it's a label like that by default um it's raw and the the other option is label and that should produce instead of zero one two produce these guys um and then Kenny I was going to have 10 minutes of your time is that right or do I need a wrap up so we decided last month cool um okay so we collected that we got we transferred everything over in that part four part five we selected it down to only what we needed is that right and then sorry and then part five was that we added structure I think if you did that you've had a great day go forth and analyze and I would do that a few times um and once you're comfortable with parts one through five come back to this part six and think about if any of these next steps make sense for you um dig into redcapper resources um we've looked a few times at the the the documentation you know I love the package down package which creates this you know great website with no extra effort once you write the documentation these vignettes um exist for a bunch of different scenarios and I think my favorite one is this troubleshooting one and the first third of this isn't even related to redcapper or maybe even are yeah and um uh many people not just me have have added to this kind of troubleshooting guide and you know the first third involved just can you connect for getting our forgetting redcapper can you connect and um over some years and lots of feedback and people using it this can identify like where it's breaking in the pipeline pretty quickly um and if you find something new or something please email me or send a poll request another thing is what we've described you know this this code something like this I think makes sense for a small project and you're about to model it um but we've had uh uh you know the this five-year longitudinal project with a few thousand people and I think we've had 10 statisticians over the lifetime like three or four at any one time you can mention there's a lot of our code analyzing it from different angles and almost all of them need race structured well and record structured well so instead of you know and with thousands of variables instead of copying and pasting this over and over and over again think about separating what's the analysis part put that in an r file and as many r files as you need but then pull out that structuring code into its own file and do that once really well and have all those other analysis files call it uh in the software patterns world that's called the gateway this is a one-way gateway that we kind of call an arch and we've described it a few places um you know don't don't start out with this do this a few times but if you find yourself having a copy and paste that kind of access instruction code a bunch of times do yourself a favor pull it out in a single file that's called repeatedly all right and also now we have this data set you know the whole our world is open to you certainly look at um any book by it for reproducible i'm really excited about corto which is i think a more or less a successor to our markdown and that's coming up tomorrow in our medicine uh last year's presentations had a lot of good stuff and of course any book by the developers of nitter and r markdown or worth reading you can put it in the toaster oven it works well in there i would just use a foil can you get yourself put the spray on the foil so it doesn't stay on me okay all right batching um i'm really conservative here i just when you read from red cap using this function a lot times you know there'll be a few thousand records under the covers it's calling red cap repeatedly just grabbing 100 patients out of time and that's just in my experience that's just the most robust way of doing it so there's no time out if it's working for you you can increase that batch size and i try to aim for you know five seconds per batch and then red cap behind the covers is taking all those little batches sub data sets stacking them together converting it um so you don't it won't affect your results at all it may just affect performance messing with the batch size um so far everything we've talked about is consuming red cap and that's mostly what we do is statisticians but a lot that there are a lot of times where you want to write to the server um writing to the server and reading those are only just two of at least 20 api functions uh writing has its kind of own things to do and it's it's a big part of kenny's talk in a second um and after a five minute break and let's see if you are doing r and you're thinking about hitting red cap you may be tempted just to use r curl or hd tr the tidy verse package um i suggest not and you know i'm biased because i'm one of the main developers of red capper but using one of those you could probably do it in five or six lines red capper uses a lot more it's it's it's got a lot more to handle different situations a lot more validation there's tests for everything and if red cap does add something to add a new feature or change something you don't have to change anything right you just have to hit update packages and that's a huge advantage i'm going to pat ourselves on the back is the art world that's a huge advantage we have an r over something like sass with macros um i did a little search just the second go in our in my group's um repository of code uh you know there's at least 124 times that we call that we read from red cap and then includes arches which has a lot more and then write another 60 times i don't want to have to go and figure out which versions are going to break and go to all these update them change them and make sure nothing else broke down the string um stick with the package and if you see something that's not working please tell one of us and the github issue or something and then finally um amanda showed you the flat file and it it absolutely can be um represented as a two-dimensional you know flat file but i didn't want it to be undersold how much hierarchy it can do so it can't do a hierarchy in the way that a traditional relational database can where you can have many many tables and um and you know arbitrary uh foreign keys it it can't do that but red cap hits the sweet spot of patient research really well so it's really common for us to create like if investigator wants kind of a little batch of his data coming from the emr you know one record per patient each patient can have multiple let's say events hospitalizations and each hospitalization can have multiple blood pressures multiple labs um so it it does have some layer some what but it does it does have some capability for hierarchy that fits really well for most patient centered uh research in my experience okay uh any questions for now oh and judy mentioned great yeah that looks cool uh she's got something hooks up shiny and red cap uh tomorrow no in two days two afternoons thank you all right thank you peter okay and i think at this point we're all going to take about a five minute break just stretch get a snack out of the bathroom whatever and reconvene at 14 minutes after the hour and then we'll step it up to multi center studies using the collaborator package and kenny's going to present that we'll see you back at 114 or 14 minutes past the hour whatever time zone you may be in okay i think everyone's had a break and it's time to scale up we started low level with excel moved on to red cap how to get data out of red cap now kenny's going to talk about taking on a large multi center study and bringing all that data together and how to manage red cap across centers fantastic uh thanks very much everyone um so my name's kenneth i'm a clinical research fellow based in edinburgh and uh just to kind of talk about some of the experience that i've had in trying to not just use red cap at one center but trying to work efficiently when you start trying to get more places involved in more sites um i think i imagine everyone's kind of aware that there's been a lot of sort of legitimate move away from kind of just single center stuff um because there's recognition there's potentially more value in studies get done quicker it's done love to center and that kind of thing so um this can be a really sort of useful application of red cap it's got some really good features built into to kind of help deal with that uh just for some context this is this is all aspirational stuff and not what you're going to be probably doing right off the bat but red cap can be used for really massive projects and so just some ones that i've personally have had sort of experience in and there's uh you know what projects that have been done with you know thousands of patients across 150 centers 1800 collaborators right up to you know hundreds of thousands of patients you know 10 000 15 000 people all using red cap to collect data from sites you know not just in um UK and america but you know across the world on a global basis so it's a fantastic resource to kind of try and bring lots of data from other places together in a way that's um you know all contributing answering the same question in the same way and that kind of stuff and so it's really fantastic platform for standardizing like that but you can probably imagine with tens of thousands of collaborators with hundreds of thousand patients very very easy to get overwhelmed with how you're managing all this in the background because all those people need accounts all that data data needs to be sort of checked and cleaned and that kind of thing which is a huge thing for one central team to try and do um so some of the potential issues with multi-central research that are sort of worth considering is um how we kind of try and make sure that that data remains protected among different teams and that only the people that should be accessing that data from a particular site can access that data and so how that's managed within red cap is within what's called data access groups or DAGs and which are basically ways to link different users together and different patients together under one sort of umbrella and which is sort of you know just a section of a subsection of the data set which those users can they're the only ones who can access that and they upload patients within that DAG data access group so they all sort of stay nice and and together and that makes sure you know from a confidentiality point of view there's not going to be inappropriate data access from other teams for whatever reason. As I've sort of mentioned the issue of this comes in when you've got lots of people who need access to their very specific team and to their DAG how do we kind of manage that and red cap as we said fantastic platform is the best platform for this purpose and it does have limitations that do become more apparent as the studies sort of get bigger and as I've sort of indicated for the kind of administration around these projects particularly you know managing missing data monitoring data going in thing can be really difficult to do live and which is really the time where you can do something about potential problems so without kind of ways to manage this make this bit more efficient that can be really really difficult to you know have done in real time and can affect the quality of projects you know data completeness projects live and die on that and so if you've you've not got much data or you're missing some key outcomes then you could really hamper what you could achieve with the data so that's kind of where this package that I've developed is coming so it's called a collaborator and is basically a kind of host of functions that are that myself and a few other people who are involved in these sort of large-scale projects have kind of developed over the years to try and just make these things a bit more efficient and work better at scale meaning that you know it's just as much effort to do something for 1000 users for 100 users as it is for 1000 users or similarly you know across you know 100 patients or you know 10,000 patients and can be used indeed for you know whatever red cap projects you do so it doesn't necessarily need to be done for multi-centre research it works for single-centre stuff just as well and as Will was kind of talking about earlier red cap has this fantastic API that you can push and pull data from and into and push it back into and so you know it makes use of that to do a lot this kind of stuff in the background so when you're thinking about managing projects like these particularly multi-centre ones there's kind of different aspects of the project that cause the most burden for administrating them and you know this is potentially things where you're doing repetitive things at scale or you know fantastic opportunities for kind of automation and so obviously designing red cap projects there's needs a lot of that human element to sort of do that properly so we can't really automate that but making sure that you're getting users allocated to the project to the right team kind of thing that's something that can definitely be automated data collection obviously needs to be done by people but you know that aspects of monitoring the quality data going in the amount of missing data and so on is easy to automate and similarly kind of the kind of more data cleaning aspects can be automated as well so this is kind of different areas where these bits in blue are the bits where Collaborator has some functions to potentially help it do these aspects and make things a bit easier so I'm going to switch screen to a markdown document I've prepared earlier everyone should hopefully be able to see that so I'll take through just a few sort of examples of how this can be used in practice and a bit more about I guess how these these different aspects of red cap work so if you're kind of using these sort of functions for yourself a lot of it is based around sort of tidyverse so you want that package installed and you can find the collaborator package on my github at the moment I still need to put it into cram and that kind of thing that's on my to-do list and throughout this sort of document we'll be using a ready-made red cap project which is sort of it's all made up data but it's sort of based on an example of you know outcomes of patients after a surgical procedure and similarly for this we're going to be assuming that all the data access groups the DAGs have been made and all the users who need users have an overall red cap account so for those who haven't used red cap before and users who are going to be doing the data collection of whatever role need an overall red cap project overall red cap account which gives them access to people to log into red cap and then you need to give them specific access to that particular project and so there's kind of two bits two steps that need done to for people to be able to log in and properly access the data yes I'm very happy to send out sorry we should have done that earlier I'm very happy to share this document around for anyone who wants afterwards and so one of the kind of most helpful aspects are adding new users to red cap and so let's say we've got maybe 50 users who want to add to you know various different hostels that are involved in this project and we want them to be able to enter data and these new users can be added to this project manually and so basically adding each user one by one and that can be very sort of time consuming very error prone particularly if you're doing it just one person so if there's any ways we obviously can automate that that's obviously incredibly helpful and so the kind of first way to do that is you create a user just a single user manually on red cap who has those user rights that you want who has you know the rights to be able to create new records to add new data and that kind of thing and so this is via this sort of side panel here and you can click user rights and you get basically the ability to add new users to that particular red cap project and so let's say we've made one called Abe Archer who we want to be a data collector and she can you know make edit new records but we don't necessarily want her having access to the API because we don't want her pulling all the data and potentially maybe we don't have permission for her to do that and we don't want her further modifying her user rights just again for for data protection purposes and that kind of thing so you would want to first create this user as in the person who has the rights that we want for all the other users that you want to be adding and you can create a data frame of all the new users that you want to have her exact role and so all the data that's in the project just create a data frame of them and you need a minimum amount of information to be able to add users to red cap so they need to have a username need to have first name last name and an email address and then if you want them to be allocated to a specific site a particular data access group that also needs to be fed there and so you know these are various hospitals DEH wherever and and these are all randomly generated names and emails so no one's being shared inappropriately so you've created that template user you've got all the other users that you want added to red cap and then we can start the kind of slightly second point which is using the code so you can use this function called user assign and that basically works with that with the information you use to to use the API in the first place so the URL of the project the API token that you've made before and and then the data frame of new users that you want to add these are the new data collectors and then this is the template person that is acting as you know this is the person that we want everyone else to have the same rights as so everyone should be data collectors and then basically you'll get these are all sort of pushed in one by one and by the end of it you should have you know 33 new user users added to the project and that kind of stuff so it turns a project a task that could take you know maybe 15 minutes half an hour if it's just 30 but potentially days if it's 10 000 users into a task that kind of just runs in the background automatically and you can get all these people added to the project much more efficiently and much more accurately than a human trying to do that and there is sort of like an online thing out that kind of goes into more detail of how you use it and different aspects of how you do that but that's kind of just a sort of basic outline of how you can very easily add users automatically to red cap project without having to you know handle a lot of that yourself. Another aspect of some of the things that that you might want to do you know within a single center project but particularly with multi-center data and data monitoring can be really really difficult at scale and particularly in the context of you know multiple teams collecting multiple data you kind of want to make sure that that data is coming back as accurately so you can then potentially contact those individual teams to update them if it's not or if there's issues and that kind of thing and there's also the aspect of social media and updating progress and particularly when you've got you know lots of centers from lots of different places contributing data you kind of want to keep them updated you want to keep them engaged and enthused about how the project's going and so again this can be a way of easily getting some summary data out to be able to share progress and things like that for social media and or even just internal emails and that kind of stuff. The kind of trouble is that red cap doesn't tends to kind of more the summary information that you automatically get from red cap tends to be at the overall level so it's difficult to get very accurate information of how each individual center DAG is doing and it could be potentially quite challenging and time consuming if you you know weren't using R or that kind of thing and so basically a couple of functions you put together and one the first one is about summarizing the multi-center data that's already present and that's red cap sum just uses the you know simple information so basically the information needed to connect to the API does all the complicated stuff in the background and so where you might use an overall summary of the number of users hospitals and patients that have been uploaded and this is an example of one that's done for a previous project that basically there was a every week or so there was a kind of update on the number of countries involved number of hospitals number of patients and that kind of thing to show people the progress and and sort of induce people and basically once you do and once you run this function you get that overall summary just automatically from it and you can kind of access that using dollar sign sum overall and you can basically get this exact information so this was the function used to to make this and we basically just ran it every week and it very easily got that information out and let's say you wanted a more individual summary so a more center-level summary and it's difficult if you're just getting overall information to know which sites are needing support or or are having issues or are doing really well and so another example from a project here of multi-center study is basically they wanted to have each site and the number of patients uploaded into a leaderboard showing how many patients have been recorded and and you know just kind of induce people and again just give particular centers recognition for how well they've been doing and you can get that at the same time you know same time as you're running the original red cap sum function and you can just do DAG all and you basically get that same information out so the number of records uploaded and the number of records that are complete version of ones that are complete and number of users at each site and that kind of thing and so that can be essentially used to make this sort of diagram or similar ones and can be used to potentially target and you know centers that haven't uploaded any data and you can very very easily tell that and potentially give them a bit more support contact them make sure the project's running well and so simple thing you can do but it can be very valuable when you're working at sort of larger scales and again there's a vignette on that and I'm certainly sure this document around then wants to learn more about that as I've kind of spoken about before and there's aspects of missing data and which as I said obviously is really really important and you know to to minimize within projects otherwise you know particularly if you're doing regression or that kind of thing with you know pairwise exclusion it means that you potentially lose a lot of your data if you're missing variables for some very missing data for some very key variables and the difficulty with red cap is particularly when you kind of run into and when you're using branching logic within your your project and so you can obviously have questions that appear or don't appear depending on what data has already been entered and red cap can't tell whether or not that data it doesn't really it provides some summaries of of how much data is missing particular variables but it doesn't actually really account for whether it should be missing or not and so for an example we've got a variable here which is the severity of respiratory complications after surgery and if you look here there's 710 patients that this has been filled out for but there's over 11,000 patients that have this missing and the kind of difficulty here is that this variable is a branching variable so it's only appearing if they say yes this person has had a respiratory complication after surgery so it means that you don't really know if you should be chasing up those should we be chasing up these 11,000 patients for for their missing data actually no it's only a fraction of those that actually haven't filled in this information yet and so you could be badgering people for you know essentially something that is completely not their fault and more an issue of how red caps recording stuff and so the issue is how do we kind of tell the difference and because red cap can't necessarily do that by itself and this is where we've developed this function called report miss and which again very simple to use and a bit more complicated in the background but essentially what it provides is firstly a summary of all the missing data at the overall data address group level so each possible level and this can be used to potentially identify senders that are kind of struggling with missing data so you can get this information out and you can get the number of patients that have been uploaded by each center and the number of patients that have more missing data than a certain threshold so if they have more than five percent missing data you know in this context they get highlighted and what this does is it accounts for that branching logic which red cap doesn't do in the first place but basically this works out what variables have missing data because it is generally missing and be filled in or it's missing because the branching logic that would make it appear hasn't been fulfilled and therefore the collaborators don't know it exists and the data collectors don't know it exists and so they shouldn't really be penalized for not collecting that data and basically this works out that for us and basically just get an understanding of who might be struggling with this kind of stuff and what it does as well is that that's helpful for the study organizers the steering groups but that's not always as helpful for the people who are actually collecting data because that's kind of vague information that somewhere you've got missing data so what this does as well is it gives patient level missing data that's obviously you need to be extremely careful obviously when you're sharing any sort of patient level missing data stuff and so we've kind of made sure this function builds in that all the data is anonymized apart from the hospital and the record number and then all the other information is anonymized and you know doesn't give information on the actual details being entered and so it can be helpful for data collectors to kind of work in point exactly where their data is missing and hopefully sort of improve data completeness so basically this is just a way of just formatting it for this purpose but basically it's the output from the function and then you can do a record and then basically you get this out and so it's giving the record ID the hospital site and this is for each individual patient as a role and which hospital they belong to how many missing data points they have how many fields are there are in total how much missing data is you know missing for that particular patient and it also gives a pens all the data the whole dataset to that patient there that's all been anonymized so if it sort of uses the code so if it's just a full stop that means that is there it's present don't have to worry about it if there's an M it means it is appearing to the collaborator so it's not so it should be filled in by the collaborator but it's not been so it's missing and then you have these then which are basically these are all branching logic questions that haven't been fulfilled and therefore they're not applicable for this patient they're not relevant to the missing data count and so aren't counted by the function in this sort of missing data stuff here so that's just a way to again give a bit more support to people who are doing it able to get a sort of very easy readout of the overall amount of missing data present in your dataset and target efforts for improving that and there is a another sort of thing yet there's ways to sort of customize this and do other stuff with it and that kind of thing and you know I say very happy to kind of share this round and so that was a bit of a whirlwind through I hope didn't go too quickly and very happy for any questions on any of this stuff and again this this might be more aspirational for people who are kind of just starting out at the moment but multi-center studies are you know fantastic ways to very efficiently do studies get lots of people involved get data quickly and get outcomes for for patients quickly and so it's worth sort of having think about how you if you do want to do it how you can do that efficiently so for folks watching along this may be a little bit of a stretch to imagine having hundreds of thousands of patients but it is doable as you scale up any questions overall or pick just questions and feedback about the workshop and was it what you expected are there other topics that would have been nice for us to cover or would it be helpful to have other resources and drop a few links in the chat great well it sounds like it's been very helpful I would encourage all of you our very first ever poster session is tomorrow and I dropped a link in the chat the posters are actually up if you want to explore and experience spatial chat during the main meeting on Thursday and Friday spatial chat where is where we'll have our breaks and it is a virtual setting but it actually is remarkably good at being able to sort of mimic the break time at conferences so give it a try and Caitlin asked if can you use your talk yes absolutely and feel free to download the slides from the Google the Google folder Caitlin if you'd like me to email you the PowerPoint let me know be happy to do that and anybody who's had trouble accessing the resources with the Google Drive let me know we can I'm sure we can find a way to work around it okay well thank you everyone if you have no more questions I think we've covered everything for today parts of it we're whirlwind and you know we covered a lot of ground and a lot of range we'll drop a few more links Amanda I don't know if you could drop the links to your videos at Colorado since they are award winning and fantastic of course just a second sure awesome so here's a link to our oops there's a link to our info page that has everything on it you can also find us on at UC Denver red cap on Vimeo or YouTube okay awesome and there's a website that's probably accessible to none of you unfortunately it's accessible only to red cap admins but if you have a question and you're feeling isolated at your institution talk to your red cap admin they can go to that site and then they can get more resources it's called community the red cap community site if they don't know what the red cap forms are but that's another resource too great okay well and Kenny's our markdown and yeah Kenny I think you have everyone's emails from the emails we circulated right before the session this morning so if you could send that out that would be awesome I think everybody would like to see it definitely okay and hopefully we will see all of you virtually in the next couple of days in the spatial chat and in the sessions at the main part of our medicine and feel free to ask more questions but we've this is all of our content thanks for coming and uh end it here