 and ways to make data entry in Excel and spreadsheets better and try to use the best features of these and avoid the pitfalls that unfortunately a lot of people fall into. So for this workshop we're going to assume that you have access to Microsoft Excel or to a free tool that will let you work with Excel files like LibreOffice or OpenOffice and you've downloaded the three Excel files for this workshop and they're pretty small and that you can use Zoom to post questions in the chat window or post reactions or since it's such a small group today, you know, feel free to unmute and jump in and ask questions. No problem with that. So the three files you should have are Exercise 1, Exercise 2 and Organized Excel files. These should be at this link, which hopefully you all got in your email. If you can just to test the reaction sharing, can you post reactions? Thumbs up if you actually have the Excel files. There we go. There's some. Super. And we have a few more people hooking up right now. So that's encouraging. Dawit and Javier and Kenji and Inakashi just joined in. So that's encouraging. We're just getting organized. Just make sure you have the three spreadsheets for the exercises. We'll get to those probably in the last third of this. The Google Drive link is also in the chat. So anybody who can't get in, we are recording. So if you missed the beginning, we will have that. We're going to start a little bit slowly because a couple of people are still joining in. There's Troy as well. So I'm going to start with you. Feel free to join in and ask questions through the chat or post reactions. So our goals for today are to learn about the pitfalls of Excel, particularly from medical data entry, learn how to structure tidy data, which is a particular data structure that's very helpful for data analysis. Learn how to store your metadata along with, but separate from your data. So it doesn't interfere with downloading or data analysis. Learn some common data errors and how to avoid them. So we're going to start with, we're going to talk about the pitfalls of Excel, which are superior on many fronts, but which require a substantial investment of time into your research career to learn them. So one of the temptations of Excel and pitfalls of Excel is that it's easy. It's readily available. A lot of places. You can just open it up and type in your data. Unfortunately, that doesn't necessarily produce good outcomes. And much as you. In the far side cartoon have to organize your posse. And even the worksheets in Excel to produce the best outcomes. So Excel is everywhere. But it was designed to help small businesses to help with data entry act as a sort of database, but not really perform calculations, format tables, visualize your data and sort of be an all in one. But Excel is not great at any of these. So Excel might not be the right tool for your job. But Excel might be the right tool for your job. Using Excel is like falling into a pit of not great. That's hard to escape. And your research, your medical research deserves better. So what tools do you actually need for reproducible medical research data management? Ideally, you'd have a data entry tool with data validation, which checks the data type and the range when you enter the data and tells you if there's a problem. Tools that make it hard to overwrite data by mistake. And so you have to be aware of it, especially protected health information. A format that's good for collaboration. So you have an audit trail. So each step, any edits of the data are recorded. Data that are easy to analyze data analysis that is reproducible. And any errors that occur are obvious and easy to detect and not silent. But in reality, Excel isn't great at hitting those benchmarks. Excel is great at making it easy to overwrite data for small businesses. And unfortunately, it doesn't scale well. It tends to fail with big data. Excel is filled with compromises to make it more user friendly. It's really easy to overwrite data. You just click on a cell and start typing. You can type anything into a cell, even if it's the wrong data type or obvious typos like the 15th month of the year. It's very easy to overwrite formulas. So if you got a calculation, it will look like you calculated the value of 47. Formulas are also hidden, which is great for printing, but it's bad for checking that your formulas are correct. And it's frequent to use point and click or copy and paste to manipulate your data and move it around, especially if you do multiple steps, leads to undetected errors. The structure of Excel and our habits and wide computer stream encourages very wide data left to right. Excel is actually designed for very tall data. The limits on Excel, 16384 columns and a million plus rows. So it's designed to have a lot more rows than columns, but that's often not the way it's actually used. And that's somewhat a human interface problem. And there are many tools on the web to open password protected Excel files because spreadsheets were never really intended for private healthcare data and are not really secure. And Excel fails silently. And this is really a big problem. You can run into problems in Excel and never know it. And that's really one of the scary things in medical research. So I'm going to tell you a few horror stories of Excel. And these are things that actually happened. In 2020, the UK COVID tracing program lost tens of thousands of COVID cases when Excel ran out of rows. They used data in CSV format imported into XLS templates. And CSV is comma separated values. But XLS templates, which is a slightly older format for Excel are limited to 65,000 rows. So any extra rows were silently discarded. Someone eventually noticed they were getting exactly 65,536 rows every day for eight days in a row. So whenever they went over that for that region, they lost cases. So just over 15,000 additional cases of COVID-19 were not traced during those eight days. And contacts were not tested or notified. And a quote from the University of Cambridge, Excel was always meant for people mucking around with a bunch of data for their small company to see what it looked like. When you need to do something more serious, you build something that works at scale, but you wouldn't use XLS. Nobody would start with that. Unfortunately, they did. A few other examples, 88% of forensically examined business spreadsheets, and you wouldn't do a forensic examination unless there was a problem, contain errors. If you really look, there are a lot of errors hidden in spreadsheets because the formulas are not obvious and not surfaced. More than 20% of gene data sets submitted with scientific manuscripts contain errors. And perhaps most famously in the potty Neven scandal, a Duke microarray study of precision cancer therapy due to multiple Excel errors and a lot of off by one errors labeled less effective cancer therapy as more effective for a particular cancer, leading to bad outcomes in gene expression driven cancer clinical trials. And we're going to send you the slides and I have these links and details on this particular really bad episode in science. They tried to do precision medicine with Excel files, made a number of errors in the files themselves and how they entered the data, how they calculated the data, leading them to get very excited about the idea of precision cancer therapy and actually starting clinical trials based on these data that were eventually shut down and millions and millions of dollars in damages were paid because they actually selected worse therapies for patients and may have done them significant harm. And also a silent calculation, this is on the business side in a spreadsheet in 2012 led to the loss of over 6 billion US dollars by JP Morgan. This story is known as the London Whale trade. It was an error caused by manual point and click manipulation of spreadsheets where they brought data from several spreadsheets together manually. And in general, it's important to remember that any kind of point and click manipulation of day shot where you do multiple steps is prone to errors and is not a good practice for reproducibility. Ideally for any kind of data merging, you would do that on the programmatic side with a script written to do it the same way every time. So there are several spreadsheet pain points and I'm going to talk about a few specific ones where things go wrong. Data entry is particularly problematic. There are lots of errors. Typical error rate in studies is between one and 6.5% of cells just entering values are wrong. The FBI National Crime Information Center database of open arrest warrants where they take data from all over the country and try to merge it, 15% have errors and identify the wrong person, which is really kind of frightening if you have someone searching a database for your name, especially a common name. Spreadsheets often auto format data to be helpful with unintended consequences. They interpret a lot of things like gene names in one example is called March 1 as dates. Not completely inexplicably, but it's a real problem. And there were 27 gene names based on the international standards that had to be changed in 2020 by an international consortium because Excel would not change this behavior. There were stories about this in 2020 and it's just one of those dangers. Excel is looking for dates trying to be helpful and will change things that you don't necessarily want. Dates are also funny in that they're stored actually as whole numbers, not as date format. And on windows, they use the number of days since January 1, 1900. And that's also true on Excel on the Mac since 2011. But in the Mac Excel version before 2011, depending on what version you have on your computer is actually based on 1904 dates. So you could move from one computer to another and it would actually change the underlying dates. So just one of those weird things that you wouldn't know about until you ran into it. And hopefully post 2011, there aren't that many copies of the old version of Mac Excel around, but you can run into really surprising silent problems that can cause real issues in your data analysis. Another common one is what's called the off by one error in data entry. And this happened several times in the Nevins and you may think, you know, one to 6%, that's not so terrible in data entry, but you can get it much larger than that. And it's not specific to Excel, but you can get what's called an off by one error. And if you can imagine someone entering a column of systolic blood pressures, 147, 133, 121, 127, and then 136, but they somehow just quite missed the return key. So you end up with 12,000, now 127,136 for the systolic blood pressure. And then a bunch more entries and then what looks like a piece of missing data. That's a real problem because all of these after this off by one error are going to be wrong and associated with the wrong patient. And ideally your data analyst or you, if you're doing the analysis, we'll see this and not just go to the source document say, oh, that was supposed to be 127 and stop there. Hopefully you'll recognize that this is an off by one error and stop and put the 136 for patient five, 144 for patient six and keep going and get all the blood pressures in the right place. You can get the reverse here with the diastolic blood pressure. If you hit return twice, this was supposed to be patient four's diastolic blood pressure patient five. And ideally your data entry person would get to the end and realize, huh, there's an extra value. There's a problem, but sometimes they're, you know, paid, maybe they're undergrad or whoever. And they'll just say, ah, whatever, next column. And, you know, you could look at this and go to the source document say, what was the blood pressure for patient four? Oh, it's supposed to be 66. It wasn't missing. You'll put in 66, not realizing that all of these are also wrong. So you can get a lot more errors than you might have thought very quickly with an off by one error and something to know about and watch for. So better data entry for medical research data. You want to have data entry forms with labels. So, you know, which patient, which field rather than just open sheets where you can easily get off by one errors. It's best to use a tool that's secure and protects protected health information is HIPAA compliant in the U.S. And it's best to use a tool that can validate data and entry and look at something and say, is this a valid date? This day is certainly off. And the year is most likely off. It's probably not valid for your current study. Be able to check if a value is out of range or a systolic blood pressure over a thousand or detect when there's a problem. When someone enters a potassium of 6.7 that adds the text knows, but laced, that's a problem because that should be numeric and that should be rejected and ask the user to just put in a number. Or if there's a typo or if your N key sticks when you're typing Hispanic, that looks like a new category to a computer, but it should be rejected and ask you for the right value. So another problem is that in general Excel files were not designed for security and privacy. Excel and design. This is a fairly old program. Assume that all data will removal from a computer and locked in an office. Usually on a removable floppy disk that could be locked in a safe at the end of the day. And this was before hard drives were a thing. So they kind of backward hack this. You can password protect spreadsheets, but it's so common, unfortunately, that people lock spreadsheets and they need to get into them. There are lots of programs on the internet to open lock spreadsheets without a password. And that's really unfortunate because it means even your password protected spreadsheets aren't that secure. The other big issue is formulas for the purposes of printing formulas are hidden in cells and they're sort of behind the display data. And this is great for printing, but bad for checking. And you can easily have errors missed and you can easily just by clicking and typing, type over and overwrite your formula in it inadvertently edit your formula. Nothing is there to stop you. There are no warnings and there's no audit trail. So you might, if you're not familiar with the term, might wonder what's an audit trail. An audit trail is tracking every step in the data analysis. And means that you know every step and you can reproduce every step and you know for every edit to the data, who made the edit, when the edit was made and why. So you can take your raw data, repeat every step and end up with the same figures, same tables for whatever the publication was and essentially end up with the same results. If you start with your raw data. That's really hard to do if you don't have the data or you don't have the code that resulted in the scripts that produced the final result. And just to give you an idea of the state of the field in 2021, I recently contacted four well-known authors of papers published in the last five years in my field, gastroenterology, to extract data for a meta analysis. None of these folks had access to the raw data. There was zero reproducibility. And this is a real problem for the field. It's been more popularized in oncology and psychology, but I think it's true of a lot of medical research. And it's kind of the dirty secret. We don't want to talk about very much. The other big problem is that Excel generally fails silently. So a problem that occurs like too many rows or a change in formulas generally does not trigger a warning or error. This is a feature to make customers not annoyed, but it just lets you do whatever you want. And it's a little bit dangerous. Because when something goes wrong, there's no way to tell. And when problems occur in spreadsheets, particularly like the Enron scandal is a few years back, this requires what's called forensic accounting, which is a whole field of people who know how to go in and check every cell in every formula to find out what really happened. Excel encourages wide data. Now, this is not officially technically Excel wants tall data. The formats for Excel, the dot XLS and XLSX have a lot more rows than columns. It wants tall data. But humans just generally don't do that, unfortunately. Most users record data over time from left to right. We read from left to right. We also like to avoid scrolling and our usual screens are wide, not tall. So we tend to enter data in a spreadsheet in a wide fashion, especially longitudinal data over time. And this is a typical example where you have patient IDs, dates, and blood pressures at screening visit, randomization visit, treatment one and treatment two. This creates unwieldy data files. And one of the biggest problems from the computer point of view is it sees, hey, there are four columns named date. There are four columns named SPP. What do I do with those? And unfortunately, what this person really needed to do was to have a visit name field that had screening, randomization treatment one and treatment two. And we're going to go into that as we talk about building tidy data. So why do good research people do bad data things? Some of it is just natural. The screen is wide. We want data to fit on the screen for less scrolling. So we make wide data. We also read from left to right. And when we add new data, we tend to add it to the right side, which is problematic later for analysis. We don't like to type more. We want to avoid repetition, which leads us to do things like variables and headers like the visit names in the last slide. Excel also encourages us to encode data as colors, what it calls cell styles, which is really a problem for computers to read because that's encoded data. Excel is silently helpful on certain data types. It changes the net genetic data to dates. And changes all column of numeric data to text. If only even only one character is present. If only one cell has a little bit of text in it, it assumes that that whole column is supposed to be a text column. And because Excel enforces no rules, we can do anything, which in some ways is great, but in some ways is really bad. So some of the temptations and limitations of spreadsheets, Excel was easy to use. It's very accessible. It's being used for purposes for which it was never designed for at a scale. It's oftenly not comfortable with Excel frequently fails and fails silently. The use of manual point and click and copy and paste frequently leads to errors because it's not reproducible. It's important and to separate data entry, calculations and reports is actually a good thing. And they're all in one is convenient, but it's problematic because it's not necessarily the best tool for the job. And ideally you want to use tools fit for purpose. And I'm going to make a recommendation. We're not going to talk about this a lot today, but think seriously about using red cap. If you have access to it and a lot of people do. It is research electronic data capture. It's an open source freeware program. You can download it. It does all the things I'm going to talk about today from data capture. It's available for free at almost 5000 institutions worldwide in 141 countries. It hosts over a million projects and has over a million and a half users. Future state. If you can learn red cap. It's a great tool for data entry and dedicated to medical data entry. And addresses a lot of the issues I'm going to talk about today. But most of you are here because you need to do this now, whether it's data collection or data analysis. And so for today, we will focus on how to best use spreadsheets to minimize these data problems. But in the long term, if you're committed to medical research, it's worth the time to invest in yourself to learn red cap and it worth investing the time to learn some are or work with the statistician who uses our way. You can download directly from red cap into your statistical analysis program. So now we're going to go onward to tidy data and talk about data structure to make your life easier and best practices for spreadsheets. So tidy data is essentially a data standard that people can use to be able to share data or share data analysis tools, because if you have a certain data structure, all the tools that can work on tidy, tidy data can work on a variety of data sets as long as they're in the tidy format. So in general, why do we need data standards? Data sets can really be in all shapes and sizes, partly because you can do almost anything in Excel. But data manipulation tools, whether it's to summarize some data or plot some data, need data standards. And if you can build your data set in a standard tidy format, there will be a lot of tools to work with your data. And tidy data has essentially become the standard data format since about 2014. And there are many and increasing number of tools, statistical tools to work best with tidy data. So tidy data is a standard way of mapping the meaning of a data set to its structure. So in tidy data, each variable forms a column. And each observation forms a row. So each time you have a visit or gather data on a patient or a patient encounter, that's a row. So if you see one patient six times during a study, there should be six rows. Each observation, each visit should be a row. And each cell should be a single measurement. And the standard structure of tidy data means that all data stats are essentially alike. But every messy data set, unfortunately, is messy in its own ways. And you can see a lot of creative and interesting examples of data sets if you are working with groups that are doing data collection and working with Excel on their own, you can see all kinds of things going on. When you're working with tidy data, one of the advantages is you can use the same tools in similar ways for different data sets to read in the data, visualize it, rearrange it, and develop models. But with untidy data, you have to create custom tools and work with them to get them into a model or into the format you need to plot or analyze them. So having a standard, a data standard promotes collaboration. Because you have a consistent data structure, it makes it easier to combine data from different sources or share data with colleagues for collaborative research. And having a tidy data standard makes building data pipelines easier with standard input and output. Every input is a tidy data set. Every output is a tidy data set. You can do multiple steps in sync from wrangling to visualizing to modeling, and you can do it much more efficiently with a data standard. So in terms of what is tidy data, what is untidy data? So here's an example of a very tiny untidy data set. And given the rules I've given you, where each column is one variable, each row is one observation, and each cell should contain one unit of data, and each cell should contain one unit of data. So why is this one, and feel free to chime in on the chat, why is this one untidy? And I am going to hunt up the chat. So think about that for a moment. And the problem is there are actually two problems. We commonly refer to blood pressure as two distinct numbers and bunch them together. So you might have a randomization or a diastolic blood pressure, plus a little bit of punctuation, which should be separated as two distinct variables. There's also kind of a hidden variable, this screening, which is the name of a visit. You would imagine if you've got a screening, you might have a randomization or a treatment or a follow-up visit. So you probably also need a column for the visit name or a visit number if that's what you use. So this is a tidy version of the same data. And this is a tidied data format, and it's noticeable. Tidy data format is often more cells. In this case, nine versus six. And it's a little bit more work. And we're not inherently going to start with tidy data. We have to think about what do we need to do to make this data tidy. Here's another example of untidy data, really tiny example with a pre-test glucose and a post-test glucose. Maybe this is a version that somebody put together for a glucose tolerance test. Again, think about this. We want one observation per row. So if you think about a visit pre or post or a measurement pre and post, the problem is that we don't, we want to be able to measure one variable, glucose, and have one observation per row. So for patient one, their pre-test, patient two, patient three, their pre-test values, and their post-test values should be in distinct observation, distinct rows. And sometimes the unit of observation takes a little bit of thinking and a little bit of decision making. What is my unit of observation? But generally, if you're measuring the same variable twice, those should be in different rows. Another thing to notice is that tidy data is often taller as well as being more cells. And in general, if you're making data more tidy or structuring it to be more tidy, it's going to be taller, which is the format that Excel actually likes best. It's just not normal for humans to make data tidy and tall, partly because we don't like to scroll down. Here's another example. Why is this data set untidy, small data set, looking at systolic blood pressures? And again, you want each variable to form a column, and any variable should be the same measurement. And of course, we've got systolic blood pressure twice, which is a problem. And so this is another version where the tidy version has systolic blood pressure in one column, and it has a hidden variable kind of encoded in the header names, which you don't want. You want that as a clear distinct variable with the visit number, one or two, or as many visits as you have. And again, it's usually a little bit taller. So that's just the idea behind making your data tidy. And when you bring untidy data to your statistician, sometimes that's where you get a little bit of the statistician glare of death. You mean well, but they would really prefer you start with tidy data. And if you know the principles and can structure your data that way from the get-go, ideally in consultation with your statistician before or shortly after you start collecting data, you can prevent a lot of problems later. So now I'm going to talk specifically about good data practices, not just data structure, about building data sets. And we've got 14 good data practices today to make clean and tidy spreadsheets. So number one is make clean rectangles of data, but no more, no less, and no extras, no metadata that you've added. So your worksheet should have one row, what's called the header row of variable names, not multiple rows, just one. Lots of data with no blank lines or columns, no empty cells and nothing else, no extra comments, titles, et cetera. So here's an example of some extraneous stuff. This is not a clean data rectangle for several reasons, but the extra stuff is the title data from my study and kind of global notes on data. This is useful information, but it's metadata. It's not measurements you've taken. It's extra to your data set. And these should be stored on a separate worksheet. And this is a really important component of using Excel or other spreadsheets. Well, is use multiple worksheets, use those tabs at the bottom of the Excel page to create multiple worksheet, to store separate pieces of data versus metadata versus code books and we'll go through all of that. So one of the problems was an extra header row outside of the data rectangle. I mean, look at that. So we've got visit one, visit two. It's actually encoding a separate variable. You don't want visit one and visit two. You want visit number or visit none in this case. So you need to add a column for visit none. And you don't want the same measurements appearing twice in the column. So you have SPP here and here, just the diastolic here and here. This is really confusing from the software point of view because they have two variables with the same name and how do they even differentiate those? So you want to clean this up. You also don't want any empty cells and truly empty is a problem because a truly empty cell looks the same as a cell with a space or even three spaces. And from the computer, they're completely different. An empty cell is one thing. A cell with one space is a different thing. And a cell with three spaces is even a different thing from that. So it's better to fill empty cells with a clear marker and capital N, capital A, not available, is the standard. Sometimes people in the past have used code numbers like negative 99. That can sometimes cause havoc if people don't know that or your data analyst or somebody later tries to take an average and they're really puzzled as to why they got a negative number or their numbers a lot lower than they expected. It's better to use this capital N, A. And if you want to store details on why the value is missing, store that separately in notes. You know, if the sample was laced or the blood clotted, that's metadata. You don't want to store that in your data rectangle. So a clean data rectangle should have, again, that single hetero variables, none of them should be repeated. Every row should be one observation and every column one single variable. Again, none repeated. Every cell has one unit of data with no blank lines of columns and there are no empty cells. If there's anything missing, that should be clearly marked with N, A. And no straight notes or titles or metadata, we're going to store those elsewhere on a separate worksheet. So principle number two, use consistent variable names because consistency makes computing easier, much easier for computers. To humans, BP, capital BP and small BP are both blood pressure. That's obvious. And then you can look at it. Oh, sure. That's the same. But they're different cases, upper case and lower case into a computer. These are completely different. So avoid capitalization and variable names. Computers also struggle with most punctuation. A lot of times it has meaning to a computer that it doesn't have to us. So avoid punctuation other than underscores or dashes. Those two are allowed and are actually pretty helpful. And humans have no trouble understanding something like BP and standing. But computers really struggle with spaces. They wonder if BP and standing are two different variables. So in general, avoid spaces and replace these with underscores or dashes in your variable names. And generally lower case snake case is recommended. Here are a few examples of different cases people use. And typically we use lower snake like systolic blood pressure in millimeters of mercury, heart rate and BPM. AST and international units, albumin in grams per deciliter. You notice we're separating chunks like the variable name and the units with an underscore. Whereas the units or any chunk that goes together, we're separating with dashes. And that helps us kind of keep track of those. It's important. This is principle three to include units in variable names. And just wonder if there's any possible way someone could guess wrong on units or possibly the units might change. And so someone in new opening the data set in 10 years might assume you use the new units and not realize you use that really old version that nobody uses anymore. So it's helpful to use these where the variable name and the units are documented right in the variable itself, you know, per microliter or millimoles per liter. And again, chunks separated by underscores and words within chunks separated by dashes. Principle four, do not encode data with color. Avoid using font color or highlighting to differentiate one group of data from another. Your data should be plain vanilla, even though Excel encourages you to use this. Highlighting and font colors are easy for humans to read. Computers don't get it. Computers need ones and zeros. So here's somebody encoded the two different arms of a study. Treatment is darker green. Control with lighter green. Computers have no idea what's going on here. So ideally you would have a separate category for study arm for intervention and control. And Stefan pointed out in the chat in general, a clean data ranked rectangle will look a lot more bland than what we're used to with pretty formatted data. You want bland, you want tall, you want everything to be a distinct variable. Principle five, be consistent with categorical variables. Always use the same spelling and capitalization. Humans know that Hispanic capitalized Hispanics lowercase Hispanic with an extra N or maybe a capital H are likely on the same value in a categorical ethnicity variable. But computers don't get that. I think these are four different categories in the same variable. And it's really hard to be consistent to avoid typos, spaces, changes in capitalization, even inadvertent spaces at the end of a word, which are invisible to humans, look like a completely different category to computers. So it's important to set yourself up in structure for success, to use consistent capitalization or no capitalization if that's easier, and to set up dropdowns in Excel for each categorical variable. We're going to go through that a little later, how to prevent problems with categorical variables. So this is where we're going to jump to one of our exercises. We're going to open exercise one, the XLSX file, and I am going to change my sharing to our first exercise. And so you've got this file, your first sheet is called data. There's a second sheet called cat var dot underscore lists. And if you want to add more sheets, if you've never done this before, you can click on the plus sign and add a new sheet. If you want to change the name of a data sheet, you can click on it and call it data one. I'll change it back. So you can have multiple sheets and this is a good way to separate your data from your metadata or other things like these lists to make your data entry more consistent. So we're going to create a column for categorical variable called the study site. And this is in cell A1. And we're going to then go to our second worksheet called cat var list, which also we're going to set up, call that first column study site. So if we do that, that helps us create consistent categorical variables. So I'm going to enter and go ahead along with me enter. We're going to enter four cities and Arbor where I live, Boston, Chicago, and Detroit. So you've got four cities conveniently in alphabetical order from cells A2 to A5. And this is going to be the list of acceptable values for this variable study site. So I'll give people a moment to get that sorted. Once you have all four of those sorted, let's go back to the data tab. And we're going to select this column study site. And we're going to select and open this up a little bit more under the data tab, data tools. We're going to do data validation. And this opens this data validation window. And under allow, I want you to select lists. So we're going to give it a list of values to choose from. So just in case anybody's gotten stuck here, I'm going to back up a little bit. It's the data tab, data tools. And you want to select data validation. When you get here to settings, you want to select list. And then you want your source. And our source, of course, is going to be on this second sheet where you put cat bar lists. So if you click on there, then you can go to cat bar lists. And we're going to select A2 to A5. Then we just press return. And it puts in a value. And basically it tells Excel, okay, look in the sheet cat bar lists and take the values from cells A2 to A5. Those are going to be the allowed values. Okay. Let me repeat that because that was a little bit fast. So let me cancel that. So you're going to go to the data tab, data tools, data validation, select list. Then you're going to click on this little icon here. And that will allow you to find it and click on it. So I'm going to click on that. And it says, okay, go find it. Click on cat bar lists. I'm going to select all my values. And it's important not to select study site, just these four values. And while it's blinking like that hit return, that will pull that into the source. When you're happy with it, and it's saying A2 to A5, click on, okay. Now the result of that should be that you can click on your data sheet. And there should be a nice little dropdown arrow here. If you click on the dropdown arrow, you can pick in Arbor or you can pick Boston. You can pick Chicago or Detroit. And you can enter as many as you want. And you will always get the same spelling, the same capitalization. You will not make any errors. But if you try to type in, let me pick an E city El Paso. It will say, this value doesn't match the data validation restrictions defined for this cell. It's not in the list. So then I can either cancel or retry and go back. And I put in one of the allowed values Boston. It doesn't like that Boston, or if I try to type in Chicago, it likes that or I type in D and it pulls up suggestions. If I start typing A, it'll suggest Ann Arbor and I can hit return. And that's usually faster. I can just type in B, C, interesting. And you can avoid problems that way. So let's switch back. And I'm going to switch the sharing back to my slides. So let me know how you're doing. Feel free to put comments in the chat or questions in the chat. If you got through that fine and it worked for you and you were able to set up a list and make that data entry work, please send a reaction of thumbs up in zoom. The reactions icon is the smiley face with a plus at the bottom of the zoom screen. So hopefully you can all send some thumbs up in the next 30 seconds or so. We've got a few. Seb and Evan are there in Evan and Kena. Okay. That's encouraging. If anybody stuck, please put a question in the chat. But if that looks good, I think we're okay. Okay. So I'm going to go ahead and do that. I'm going to go ahead and do that. I'm going to go ahead and charge on. There's also a link here on how to do this or step by step, but hopefully that was pretty helpful. Item six is to be consistent with dates. Dates unfortunately are messy, poorly standardized data field. And that's not necessarily a spreadsheet problem. That is a world problem. There's so many ways to mess up dates and so many formats to do that and we have a bunch of different ways to do it. So for example, we have, we have data hyphens, slashes, many more. It's common in the U.S. to start dates with the month. It's common in Europe to start with a day. You can use in American months or text months. You can use two or four-digit years. You can use dashes, slashes, underscores and separators, which leads to a lot of problems with any international collaboration. And it's very common to get into problems with dates. issues. An international standard was established called ISO 8601 in 1998 by the no surprise international standards organization. So now the data standard for dates is four digit year dash two digit month dash two digit day. Computers get this and another nice advantages if you use this format they can computers can sort the dates in this format in chronological order. There are also parts of 8601 that include standards for hour minute second and time zones if you need to record data in that small an increment. But it is all usable and there's a nice link here going through the really gory details of how this was created and what the format for ISO 8601 is. So you can actually format your Excel date columns to ISO 8601. So if you and I'm going to jump back to Excel now. If you go back to your Excel spreadsheet and you enter a date column and enter a couple of dates. Say I'm going to enter Jan 5, 2019, March 7, 2020. But you can do all kinds of crazy things like three, seven, 2018. There are variety of formats that are not that helpful. You can select the whole column and say I want a consistent format. And here's where you type either control one on Windows or command one on the Mac. And that should open up a format cells. And you want under number, you want date. And it's the third entry down 2012, a four-digit year, two-digit month and two-digit day. Select that. And unfortunately these are icky, but let me see if I can fix these. And it will, if they're reasonably formatted and can figure it out, they will format these in this consistent format. You can type in dates. It will add leading zeros even if you didn't type those in. It's not perfect. It won't prevent you from doing completely crazy things. It will assume you have a 15th month that you've invented. But if you have these lined up, this will align left showing that it wasn't formatted correctly. And at least you'll recognize if you look at this align that there's a problem with that date. Unfortunately, there's not 15 months in the year or I'd be rounding a lot more. So that gives an idea of one way to make dates a little better. And I'm going to switch back to my slides. Number seven is being consistent with your subject IDs. Frequently, you'll have distinct participants and you want to have a clear, unique subject ID for each participant. This can get a little messy. But each study participant should have a unique ID, usually numeric. It should be particular to your study. It has to be unique across your study, but and it cannot be their name or some part of their name. This shouldn't use or be derived from a medical record number or a birth date and name or any other item of protected health information. It should be a generated distinct number. And ideally, you assign a unique number at first contact and keep that through screening and enrollment. Because you may have a multi site study. A lot of times this is where they're using unique prefix for each study site to allow ID assignment across sites. So 15-00041 would be the 41st participant at site 15. And you want to keep in your data frame. The site number and participant number is separate variables. But in terms of putting them together for analysis, you can put them together later as a distinct and unique ID. Principle eight, you want to be consistent with files and file names. So if you use multiple files, you may have distinct files from each site or from different data sources. If you're pulling data from a data warehouse, you want to have files that have the same data structure. And if they're supposed to go together and merge later, then you want to have the same, exactly the same variable names in the same order with the same categorical dropdowns and same data validation rules for each variable. Ideally, you would set up one data template and test it in a pilot at multiple sites and finalize it and make sure every site is using the same template. And ideally, as the data come in, you'll check and validate data from each site and each file frequently, especially early on to make sure you're getting consistent data. You also want to use consistent file names, especially if you have multiple files of different data types or different sites. You want to decide what data chunks you need in what order to identify this particular data file. And usually what you want to do is use the same name chunks in the same order with underscore separators. So for example, you might have serum, potassium 2019 CSV. So this is the fluid source, the analyte and the year. Similarly, in the same set, you might have urine magnesium 2020. Or if you organize things by study site and data type, BP study, the name of the study, Boston and labs, BP study, Chicago and vitals, again, separating distinct chunks with underscores and separating words in a continuous chunk, the name of the study with a dash. So principle nine, being careful with text strings, especially things like gene names, which are notorious for having problems. Excel tries to be helpful and guess the data type in each column. Any column that looks a bit like a date gets turned into a date. Any column with text in it gets turned into a text column. This is why you can't add text notes to your data, especially in numeric columns. You want to store those separately as metadata. You can and should set the format of a column in Excel. So you can again, select column with and then do control one or command month and format the cells as text to turn off some of the helpfulness for that column. Principle 10, you want to include a code book, especially if other people are going to look at this data or future you is going to look at this data, your close collaborator future you future you collaborates with past you. But unfortunately, past you never returns their email. So what is a code book? Code book is a detailed description of every variable in your data set. So you want a different and separate sheet in your Excel workbook labeled code book. And this will have columns for variable, the definition, how it was collected, the type of variable and the units or the range, if that's important for that particular data type. It's also important on your data sheet itself to do no calculations, just the data in a rectangle. One sheet in your data workbook should be labeled data and shadow only your raw, bland, tidy data in a rectangle with no colors, no calculations, no formulas, because formulas can be altered, typed over or deleted silently. Let your statistician or statistics program do the calculations reproducibility, reproducibly. Principle 12 record the closest thing to raw granular data within reason. In general, do things like the raw data, the data visit and the birthday. Don't ask people their age, you can calculate that. Get the data diagnosis, not the years of disease, don't ask the patients to try to do this math, do the subtraction. Don't ask for any implicit calculations. Get the raw data in granular form. You know, maybe if the data diagnosis is hard to remember, maybe you just collect month and year, you're certainly for birth date not going to collect hour and minute. But you know, within reason, try to get the most granular data you can get. Principle 13, and this is a big one, use data validation, you want to protect yourself from data entry errors, or off by one errors. So what is data validation? It's a set of rules for each variable, each column that restrict the data type, whether it's numeric, character, date, or categorical, and the range of allowed values. You can set up data validation for each variable, which when you do it dramatically reduces errors. The data entry error rate is about three to 5% per cell without data validation, probably less than 1% with data validation. We still have issues with dates, but that's probably the best we can get. Data validation is not the default for Excel, but it is available. And only double data entry, which is when you have two separate people enter data from source documents with data validation is better. So how do you set up data validation in Excel? So you're going to plan this out and for each column to sign in the data type, whether it's numeric, character, categorical, or date. And for each column, think about what your restriction should be. For numeric, decide on the reasonable allowed minimum and maximum values. Think about the most extreme values you've seen, particularly if you have a set of patients who may be outliers. And for categorical, make the list of allowed values. And you may have to include a value of not available. That's okay. And for your categorical list, you set up an ex sheet in your Excel workbook when you cat bar lists, which we just walked through. Just for your list of allowed variables with one column per categorical variable. So now we're going to do an exercise with numeric variables and take a moment to open exercise two. And I will switch. I share its screens to look at exercise two. Okay, hopefully you can all see this. Let me get the headers on there. So you can see in your data tab, you've got the variable heart rate. We're going to select the entire column for this heart rate variable. And we're going to again, select data, data tools, data validation. In this case, rather than letting you type any value, we're going to say no, we want whole numbers for heart rate. We want it between and we'll pick a minimum and maximum. I'm going to be generous here and say 20 and 300. 20 may not actually be compatible with life, but maybe if you're really, really cold and 300 probably a reasonable limit for most populations. Then if we hit okay, that data validation is now in effect for this heart rate variable. So you should be able to enter and try variables, try values like 34, 299, but also try values like 18. And that you'll get a message that's out of range. 301 should produce the same result, but 21 or probably even 20 should be just fine. And this is going to prevent some of those just wacky mistyping, you know, typing two values next to each other. So if somebody types a heart rate of 87 and they missed the return key and then types in 66, it's not going to allow you to put a value of 8,766. This is the kind of thing you can set up in advance and prevent a lot of problems later as long as you take a moment to think about what a reasonable range is. Okay, we can also do this. Let's add another variable. In this case, we'll do ethnicity. I don't know if I can spell it right. And we're going to create this as a categorical variable. And from the NIH and the US point of view, ethnicity values can be Hispanic and not Hispanic. That's all you get. Now, since you've done this once before, this could be pretty straightforward. We want to go back to data, select this column. We're going to do data tools, data validation, and this one rather than whole number, we're going to make this a list. And then it's going to ask us for the source and we're going to click the icon to go back to the spreadsheet. Click in cat bar list, select the values, not the name of the column, but the just the two allowed values, and hit return or enter. And it should say, okay, B2 and B3 are the only allowed values. We hit okay. And now we can have our dropdown. If I type in capital N, it'll, I can drop down for not Hispanic. And those are the only two allowed values. I'm not allowed to put in other, which for the NIH, I've only got two values I can choose from. So now you know how to data validate for numeric as well as categorical variables. Okay. I'm now going to switch back to my slides. So now our principal 14, and this is the last of our 14, and that is to protect your raw data and treat it like gold. You should always have backups both onsite and offsite, just in case you have hurricanes, I get Tulane or at NYU, completely flood your data center and destroy your data, and never modify or overwrite your raw data. It's very tempting if you see something like a blood pressure of 8,000 to type it over. But then there's no audit trail. There's no record of how that got changed or why that got changed. Always make a copy and modify the copy and save it as a new data file. So you might go from study raw data to study clean data. But you want to document every single time you change a piece of data and measurement so that you can track exactly how, when, where, and who change that data point. You should always be able to retrace your entire audit trail from the original raw data through all processing steps to the final products, whether that's figures, tables, or even the manuscript itself. If you can't do that, your research is not reproducible and is really not that trustworthy. So to kind of put it all together, in terms of starting a new data project, best practices, you want to start with an organized Excel workbook. And we're going to explore one. And this is the Excel file, organized.xlsx, and I'm going to switch sharing screens. So let's go to the organize sheet. And, you know, in a perfect world, this is what an organized study, a very small study, would look like. Your first data sheet is your data file. And this is raw, boring, and rectangular with no colors, no excitement. Second is the code book. So each variable, its definition, how collected, the type, the units, some overall notes about your study. You know, what the study was, who ran it, dates, where the centers were. You can have notes on the particular values. If there were issues, you know, it was supposed to be just seated blood pressure and this person refused to sit. Patient ate a lot of salt this day and this value is missing. Your validation list for all your categorical variables are here. And then if you really, really want to, you can make a pretty version where you can do calculations and do all the things you're not supposed to do with color and whatnot, if you want to show it to someone. Okay. So that gives you an idea of what that can look like. I'm going to switch back now to the slides. And let me share that screen. And you may want to have two to compare, but it's tempting just to open up Excel and start entering data. And honestly, that's what most people do when they're getting started. But it's much better to start with a plan. You want to have a data sheet, a code book, your notes on overall and notes on individual cell values, if there are issues with measurement. Data validation lists to make sure your categorical variables don't change. And a pretty or formatted version, if you really want one for printing or for sharing with other people. So the first tab data should be plain vanilla, no colors, bolding, comments or notes, no metadata, just a rank angle with a single header row of variable names, and no empty cells. Any empty should be filled in with NA. The code book should have a standard code, but with the name of the variable definition, how collected data type in units, to document all of this for future you, because if they have a question later, you're never going to email them back. Sheet three should be notes on the overall data collection and basically like a newspaper story, who, what, when, where, why, and how. If somebody opens this up 10 years from now and doesn't know where the heck this came from, this will document all this, your exotic study of Mount Olive brand pickle juice and all their centers and the dates. To do individual cell values, and this is one that there are several ways to do it. I've seen people do this different ways, but you want to attach the metadata about individual values. Was there something odd about the measurement? Was it missing for a particular reason? One simple way to do this is to write a metanode and Excel can do this on each problematic cell. So you actually right-click the problematic value and click on one of the right-click drop downs is new note, and you can type in an explanatory note about the data collection of that specific value if there's a problem. The nice thing is the resulting value will have a little red triangle to tell you, hey there's something funny about this, and if you hover over it, that note will pop up. The nice thing is you can still read the data into a statistical program, it won't get in the way, whereas if you actually typed it into the cell, that would be a big problem. So you don't want to put the data into the cells, but you can put it into these cell notes. For missing, you want to put Na, that makes it easier for statistical programs, then you can put a note about why it's missing in the notes, but you'll never get that data out into a statistical analysis program, your metadata about why it's missing. This is pretty quick and easy and visually appealing. There are fancier ways to do it. So the next version is to have a separate sheet on notes values, where you basically have a column for the data point, where you explain which data point, the systolic blood pressure versus subject seven at visit seven, serum potassium, subject 114 to visit five, and then the notes on what happened, what went wrong, why there's no result, can't count VMs because they have an iliostomy, you name it. And so you have enough detail to understand what's happened and document it, but Na in the data sheet, if it's missing. An even fancier version, and this is fairly convenient, it helps you see patterns in data problems, you basically copy the structure from the data worksheet, and you leave identifying information for each observation, the patient ID in the visit, but you reformat all the data columns as text and leave these blank unless there's a problem, and then when there is a problem value you type in for that value, okay the patient was prone or whatever the problem was, and that lets you see in the columns in the rows if there were recurring issues with particular variables or particular subjects, and help you see patterns a little bit. It may not be worth going to all that trouble, but it can be helpful if there are certain recurring issues. Sheet five should be your data validation list for your categorical variables, and listing all the acceptable values so you can do the drop downs on your data sheet. And sheet six is of course the optional pretty version, go wild, do whatever you want, as long as your original data sheet stays clean you can do whatever you want on this tab. So to sort of summarize our best practices, you want to set up clean and tidy data rank angles where missing is always denoted as NA, as capital NA, with one observation per row, one variable per column, and one unit of data per cell, use consistent variable names with no spaces and snake case including the units, do not encode data with color or in headers. If they're in headers they probably need to be a separate variable name, column, be consistent with categorical variable values and use drop down list to help you do that. Be consistent with dates, use the ISO 8601 standard, everyone gets a consistent unique participant ID, be consistent with files and file names, don't use spaces, use very limited punctuation of underscores and dashes, take control of your formatting of categorical variables with drop downs, include a code book, include notes on the overall study and include notes on individual values and measurements gone wrong. Both of these are metadata that should be on separate worksheets. Do no calculations including implicit calculations, get the granular data, use data validation to prevent errors, protect your raw data, always keep backups and never overwrite your raw data, just write to a new data file, document any and all changes in your data to create a reproducible audit trail and use at least five tabs in your Excel workbook to separate the different functions of your data and your metadata. Even better research practices, you can make your research data more secure in your analysis, more reproducible with things like red cap for validated secure data entry with open source tools like R for data analysis, visualization and reproducible reporting. If you can't access your data and reproduce your outputs, the tables and figures for your manuscript three to ten years later, your research is not reproducible and is less trustworthy. And if you're in medical research for the long term, go ahead and invest in yourself. It's going to take time to learn things like red cap and R to be able to do better, more reproducible research. This talk was inspired by two important and free papers both worth reading and the links are here. Feel free to open those up and take a look. Hopefully I've summarized them well and I'd be happy to take any questions and we do have some upcoming R medicine webinars that are follow-ups on this. One on red cap for a single site study and one on red cap for multi-site collaboration. If you want to take a peek at red cap on your own, the red cap site has videos to help you learn about it and learn how to use it. Thank you and feel free to put questions in the chat and let us know. So there's a question from Arthur. Would you also record the acceptable values for numerical variables in sheet five? Generally, if it's a continuous variable, I would not. I would just say the low end and the high end of the range and use numeric validation. But if it's a categorical or ordinal variable like zero to three, that would be reasonable to make that truly a categorical variable and use a drop-down. And any other questions or in general feedback would love to have feedback on this format. Was this what you expected? Anything that's helpful or other topics you would like to hear about would love to hear about it. Feel free to put questions in the chat or comments in the chat. It'd be very helpful. And that's all we've got. So feel free to give us feedback. We will have a recording up for folks who missed this and we will be sending out and sharing the slides. So if you missed a little bit at the beginning, we should have that in the recording and in the slides. Thanks everybody. And Daniela, I think we can stop recording at that point. Excellent.