 So repeat those links for everyone. And since we're at 1102 I'll go ahead and start the introductions for those of us still joining us. This is the cleaning medical data with our workshop and we'll be going for the next three hours. And I'm Peter Higgins I'm co hosting with Crystal Lewis and Shannon polegi. And go ahead, you can go ahead to the next couple of slides. These materials are freely available under Creative Commons zero versus version 1.0 universal license. And maybe I should have crystal first and then Shannon introduce themselves. Hi everyone, I'm Crystal Lewis. I'm a freelance research data management consultant and that basically means I help people learn how to wrangle document and share their research study data. So I'm happy to be here. I am Shannon pledging. I'm leading a scientist at the prostate cancer clinical trials consortium. I'm also on the our ladies global leadership team. I'll say the all of our contact information is on the main page of the website so LinkedIn and Twitter, Macedon, whatever else so if you want to connect with us definitely feel free. And I'm Peter Higgins I am a clinician at the University of Michigan. I do gastroenterology in my day job but I spend a lot of time worrying about reproducibility of data analysis. And we are essentially addressing the challenge of the data on the left which people often receive in the format of Excel files, and trying to turn it into tidy data on the right. As in Allison horse illustration and we're going to address some common problems and some common solutions and hopefully give you a workflow that you can use in attacking messy data like this. After this introduction crystal will be talking about principles of data management while the short break at noon. Shannon will do stage one data cleaning will have a short break at 1255 and I'll finish out, hopefully before two o'clock to have time for questions with some stage two data cleaning. Please feel free to put questions in the zoom chat while one of us is talking the other two of us will do our best to address questions in the chat. And, you know, at the end we'll have the speaker available to address them as well. And the exercises are all available versus the opposite cloud in the exercise instructions and in the links that are being put in the chat, you can get access to the positive cloud workspace so you can code along. It's important to get that login settled before you we get to that point of actually doing the exercises. So if you can take a minute to do that now it's probably worth it. Otherwise you can just watch along. And Shannon's we're going to handle some of the next parts on syntax. We'll have a little bit of hands on coding and inter coding. One thing that you might see is something called the pipe operator. There was a pipe operator introduced in 2014 that's the percent greater than percent. And then more recently in 2021 there was a native our pipe introduced which is the vertical bar was a greater than sign. So you're going to be seeing that native our pipe in our code. Just as for your information in case you've never seen it. It's a bill of glass because that's a great blog post on it if you're not familiar with it. Crystal basically what the pipe does is it allows you to pass through a first argument into a function so instead of saying calling some function and specifying the three arguments in order that first argument can be moved into the top level so you read that vertical greater than sign pipe as and then so you'd read it as our one and then do something. So for example on the right hand side instead of taking the mean of the vectors year through 10 we're going to take the vectors year through 10 and then compute the mean on it. Another example from this is in chapter 18 of our data science and it kind of goes into a little bit more detail about how the pipe operator can make your code a little bit easier to more sequentially follow step by step. So instead of having these nested statements that you see on the top, you kind of have these more in order statements that you see on the bottom. Another syntax thing that you will be seen in our slides that you may or may not be familiar with is the concept of namespacing. So namespacing is when you use to double columns to indicate a specific function within a specific package. So it's package colon colon function so you could do dplyr colon colon select. And what that does is it tells are to explicitly use that function from that package. It can help you avoid name conflicts for example there's a select function in the mass package and a select function in the dplyr package. And when you do that namespacing you don't actually require a library statement to access that function. So in these slides what we have attempted to do is namespace functions that are not from tidyverse packages so that anything that's a general tidyverse function will just like write out that function name. But if it requires some sort of other package installation we've tried to namespace it. So what that would look like, for example, for the select function is like on the left hand side we're calling library dplyr and we're selecting mpg the columns mpg and cylinder from the empty cars data set. So instead of doing that we could instead of calling library dplyr we could just say dplyr colon colon select to achieve exact same result. Okay, thanks Shannon. Yeah, thank you. Just around real quick. So welcome to the first section of this workshop, we are going to spend the next 30 minutes or so reviewing some foundational principles of data management before we dive into actually writing smart code. So imagine that many of you at some point in your career have either created or received a spreadsheet that looks like this, or maybe one that looks like this, or one that looks like this. A spreadsheet formatted like this with color coding and various headers and notes throughout can be really helpful when you just need to eyeball and review some information. But at some point you're probably going to need to actually analyze this data and some sort of statistical program, such as are. And if you've ever tried to read a spreadsheet like this into a program like our, then you probably felt a little like this, maybe a little defeated. But if you realize that all those colorful spreadsheets are formatted to be human readable, they're not necessarily formatted to be machine readable. And now you're going to have to spend hours and hours of time cleaning those spreadsheets before they can be analyzed and are. And that's what today is about. So we are going to first provide you a foundational understanding for how data should be organized for analysis purposes. We will also briefly discuss how if at all possible you want to correct messy data at the source. And then last, the meat of this presentation will include a review of various our functions that will help you to quickly and efficiently turn a very messy data set into a tidy and usable one. So first we're going to talk about data organizing principles associated with four ideas. And by reviewing these principles, my hope is that we will all have this shared understanding of how data should be organized. And that understanding will help you to strategically plan for how you should wrangle those messy data set. So we're going to talk about data structure, variable values, variable types, and missing data. So we're going to talk about data structure first. So data should make a rectangle of rows and columns. You should have the expected number of rows or cases in your data. And you should have the expected number of columns or variables in your data. And at the intersection of those rows and columns are filled with values. And you should have the expected number or no less than you expect in your data as far as rows and columns. And when I say that probably you have some idea of what should exist in your data. So either you collected the data yourself, so you have an idea, or you've been given a code book or a data dictionary that tells you what should exist in your data. So if you have extra columns in your data, that can mean that you have empty columns or unexpected variables. If you have extra rows, it can mean you have duplicate cases or empty rows in your data. If you have less columns in your data, it could mean that you're missing variables or maybe you dropped variables when you imported your data. And if you have less rows in your data that could mean you're missing cases and all of these things need to be remedied in a data cleaning process. So the second principle regarding data structure is that variable names should be the first and only the first row of your data. They should also adhere to best practices. A variable name should be unique, not duplicated. They should be meaningful. If you have a variable that represents gender instead of calling it X1, call it gender. Do not, don't include spaces in your variable names, don't include special characters except for underscores. So no backslashes, no dashes, no exclamation points, no quotation marks. I mean also don't start a variable name with a number or a special character. And these aren't arbitrary practices, they all serve a purpose. So first, they make your variable names more interpretable and easier to work with. And then second, they make your variables more compatible with languages like R. So for instance, R does not allow variable names to start with a number. It will actually give you an error if you do this. Another example is that R doesn't allow you to use dashes or hyphen than your variable names. They are considered subtraction or negation operators. So again, you will get an error if you include those characters in your variable names. So let's do our first exercise here. Take one minute to review this messy data and look for any structure issues going on here. And remember by structure I mean review how both rows and columns are laid out, as well as variable names. And if you find any errors, please type them in the chat. I will try to monitor the chat over here. So I'm going to start this timer. And just kind of drop whatever you see in the chat and we'll, we'll gather back in a minute. Good ones. Yep. Or is, oh, is, did your screen freeze again? Oh my, no, no, no, we haven't done anything. Oh, did you click the timer to start it? Yeah. Did you not, you don't see it? I don't see it going. Sorry, everyone. Now that was my fault. It's showing on the other screen over here. Yeah. It just ran out of time. So, but it looks like everybody got similar things to what I was thinking about. So let's go and review that together. Okay. So I think a lot of people saw the same things that I saw. So one was that variable names are not the first row of the data. They're the seventh line down actually. The other thing is that our data does not make a rectangle. We have empty columns. We have empty rows. And then our variable names don't adhere to best practices. Right. So we have backslashes in some of our variable names. We have spaces in some of our variable names. But I think. Oh, I'm trying to say something. I think we all saw pretty similar things. So those are all things we would need to deal with. When we're kind of cleaning up our data. Okay. So the second thing is. Variable values. So variable values should be explicit, not implicit. We don't ever want anyone to have to guess what a cell value means. So if a blank cell is implied to be zero. Fill that cell with an actual zero. No color coding should be used to indicate information. Make a new variable. So if you're a color coding a variable in order to indicate treatment, for instance, instead make a treatment variable and add the values to that variable. All you should be analyzable. This means no more than one measure should be captured in a variable. So for instance, we don't want both weight and height in the same variable. It would be very difficult to analyze a variable with combined information like that. We want to split this information into two columns. Variable should be captured consistently within a column, pick a format and stick to it. So for instance, they should be captured consistently. You can make a decision to always catch a date in the international standardized format shown here, which is a really nice format to work with. Honestly, whatever format you choose, make sure that all dates are captured using the same format. You can click on this date in the slides here to learn more about this specific format if you want to learn more about it. Category should be captured consistently, both spelling and capitalization. So if you're capturing something like gender, you always want to spell male the same way, female the same way and so forth. This allows your data to be easily categorized. And last, it's a variable numeric, the value should fall within your expected range. So if the range for a variable is 1 to 50, you shouldn't see values outside of that range. So let's do another quick exercise. I will do the timer correctly this time. What variable value issues do you notice in our sample data? Please drop in the chat, whatever you see. Yeah, that's a good one. Yeah. Yeah. Yeah. These are great. Yeah. You're catching it. Next theater's question. Yep. You are right on. Okay. So let's look at it together. And I think everyone found all the issues that I found. So color coding is used to indicate information here. And actually the color coding perfectly aligns with the treatment variables already. So we can actually just ignore this color coding, which is nice. But otherwise we'd have to deal with this. That's in the treatment variable. Then we have two things measured in one column for both start BP and the pre-post weight. So we would want to split those out into multiple columns. And then last categorical values are being captured inconsistently in both our ethnic and our race variable. So we would need to re-categorize those. Okay. Variable types. Variable should be stored as your expected type. Or in our world, we may refer to this as variable class. Let's review a few of the variable classes that you might work with. So the first is numeric. And this contains numeric values. It comes like any of these numeric values here. Numeric variables cannot contain special characters, spaces, or letters. And if your variable contains non-numeric values, the class will be character. And you'll no longer be able to form calculations on your numeric variables. So no means, ranges, and so forth. The next one is date time or date time. It's represented in R as one of these formats you see here. And because the way dates are stored in R, they allow you to perform calculations using your dates, which is cool. You can add dates. You can subtract dates, and so forth. So as long as your dates are stored as dates, then you are good. However, if your dates are stored as character values, you're not going to be able to perform those calculations on your dates. So it's important to check your date types when you read your data in. So sometimes they are right in the date. Other times they could be right in the character values or numeric values. And it's important to be aware of this. Another one is character. They contain character values or strings. It could be any of these kind of examples. You can even store numbers of characters. But remember, if you want to analyze those values as numeric, you will need to change them to a numeric format. And then the last one we're going to talk about today is the factor. This is a special class of variables, helpful when working with categorical or ordinal variables. Factors assign an order to your variable groups. So they're really useful for ordering groups in tables or graphs or models. But you must assign this class to your variable. So when you read in your data, your character variables will not automatically be assigned as factors. You'll need to assign this class to yourself. And then lastly, because factors can be a little tricky to work with and understand, you can learn more about working with factors in this article I linked here from Amelia McMurray and Nicholas Horton. OK, so one quick exercise. What is the R class for the following variables? So this variable right here, if somebody just wants to drop it in the chat and tell me what this variable is, I'm not stomping anybody today. Yeah, it's character. And I don't know if you can see why, but it's because there's a space right before the 7.5 here. And I've worked with data where this has happened and it's stumped me for quite a while. I couldn't figure out why it was a character and it's one little space before 7.5. So watch out for that. The next one, what about this variable? I think most people are getting it. Yeah, so this one's actually, if we did class in an R, it's a factor because it's got these levels associated with it. And so this is what makes it a factor. And then our last one here. Anybody want to tell me what this one is? Yeah, yeah, character, right? Because these letters are messing up my number values and so it's currently a character variable. Yeah, great. One more exercise. What variable type issues do you notice in our sample data? Again, we'll do one more minute. Have you dropped something in the chat? Yeah, you guys are seeing things that I would see. I think everyone caught the same issues that I was seeing. Let's go ahead and review this together. A lot of people saw that one of our date variables, the start date is being stored as a number instead of a date. And then we have a lot of text being stored in our numeric variables. And so that's going to make all of these character variables instead of numeric. So we're going to need to deal with all these issues here. Even this one variable that has one character value is going to make the whole column character now. So we will need to deal with those. And the last thing to talk about is missing data. So missing data should appear as you expected to. That's both the amount of missing as well as the variable in cases that data is missing for. And there are varying opinions on how missing data should be assigned. So some people think that missing data should be explicitly assigned with an extreme value like negative 999 or negative 999. That way you know that the cell wasn't just skipped over by accident when data is being entered. Some people prefer to just leave the cell blank to not cause confusion by adding these extreme values to a variable. I imagine within this audience we have varying opinions on missing data. And all I want to say is I have no preference for which method is used. I think it is important to be aware of the problems that can be caused by adding extreme values to your data. So if you do add negative 999 as a missing value, be aware that someone could accidentally interpret that as an actual value leading to bad results. But ultimately what I think is important is that you use consistent values to represent missing data. So choose one and stick with it. And then make sure your choices are documented so that future users know how to interpret the values in your data. And then last, you want to make sure that your missing values match your variable type. If you use text to define missing values in the numeric variable, that variable will no longer be considered a numeric variable. So be aware of those kind of issues. So what missing data issues do you notice in our sample data? And we'll do one more minute. It looks like everyone found the same issues that I found as well. So, yeah, so the first thing I noticed is that I had some unexpected missing data. So row 23, you know, maybe that's a row that I wasn't expecting to be missing data for that entire case. So I would need to look into that. And then I have inconsistent missing values used. So sometimes I'm using not done and sometimes negative 999 being used for numeric variables. And so it's inconsistent. And then missing values don't match the variable type for the end, email and in DL. So we have character values being used for missing data and then numeric variable. Okay, so the number one way to reduce data errors is to make a plan before you collect data. So if you have the luxury of being able to collect your own data, you want to make sure that you spend time planning so you can correct data at the source. You want to plan the variables you want to collect. You want to build your data collection or entry tools in a way that follows your plan. You want to test your data tools before you collect or enter data and you want to check your data often during data collection or entry. So let's walk through each of those. So plan the variables you want to collect. You want to plan ahead of time how you're going to name your variables. You want to plan for what the labels for those variables will be or the item wording. You want to plan for what the allowable variable type will be, numeric, character, so on. You want to plan for allowable values or ranges of each variable. And you want to plan for how you're going to assign missing values. So the items on the right here can be really helpful to plan for as well if you're collecting something like survey data in particular. So these items will help you better understand when you win and why you might have missing data for items. So the first thing is skip patterns. Are there skip patterns for any items and what is the logic for those items? The second one is required items. Are items required or are people allowed to skip them? It helps you kind of decide whether the missing data is planned or not. And then variable universe. What is the variable universe for each item? Will the whole table get each item or are only some items shown to maybe a sub table of your group? And then you want to add those variables to a data dictionary. So a data dictionary is a rectangular formatted collection of names, definitions and attributes about variables in a data set. So you set this up similar to a data set with a row and column layout with variable names in your first row. So in this data dictionary here I have my variable names, the labels associated with those items, the types that are allowed for each item, the allowable values and the missing values they plan to use. And then you want to build your tool based on your data dictionary. And by tool I mean whatever program you use to collect or enter data. So that could be Excel, Redcap, Qualtrics or something else. Be aware that not all data entry and collection tools are created equal. So make sure to consider the limitations and the strengths of your tools. You want to consider things like how your tool handles data security and privacy, versioning, data validation. We're not going to explore different data collection entry tools today, but if you want to learn more about the strengths and weaknesses of various tools, specifically if you want to learn more about the limitations of your tools, you can watch this video that's linked down here at the bottom from a previous R & Medicine workshop. So once you've chosen your tool, you want to build your data collection tool based on your data dictionary. You want to name your variables correctly in your tool. So instead of Q1, Q2, Q3, you want to name those variables ID, start, date, treatment. This reduces confusion during data entry and it also creates less data cleaning steps when you start. You want to build items to only allow acceptable to accept allowable values. So if you're working with numeric items, only allow values in a specified range. So for example, 0 to 50. You can set these validation rules in your tool so that if someone tries to enter 51, it'll say, hey, this value is not allowed. If you're working with categorical items, only allow values in specified categories. So here it can be really helpful to use them like a dropdown menu instead of open text boxes to make sure that you're only collecting allowable values. And then you want to build items to only accept specified variable types. So only numeric or only date in the specified format. Again, you can set these constant validation rules in your tool so that a warning will pop up when unexpected formats or types are entered. And then you want to test your data so collect or enter some sample data. Check. Check the sample data. Are any items missing? Are you getting unexpected values for items? Are any values out of range? Do they have incorrect format and consistent entries? Is the skip logic working as expected? Are people able to skip items that they shouldn't be able to skip? And if you find anything wrong, fix this in your tool before you begin to collect or enter data. And then last you want to review your data often during data collection or entry. And one option for reviewing your data is to write code in a program like R to validate your specified criteria. So you can write code to validate that variables are your expected types, falls and expected ranges, that IDs are not duplicated and so forth. So here are a couple of R packages that have functioned specifically for validation purposes. And they export really helpful reports for you to review. There will actually be a presentation during this conference on the last package on this list. This one right here. I believe on Thursday at 3 Eastern. And then at the very bottom, this is a link to a great talk on validation that reviews all these packages and more and I highly recommend watching it. But here I'm showing a very brief example of how I might set up some validation criteria using the point blame package. And I could run this on a recurring schedule during data collection or entry to make sure everything is being collected as expected. So I'm checking the things that I'm checking in this function are do I have distinct IDs? Do I have missing IDs? Are my columns the expected types? Are they falling within my expected ranges down here? And when I run this code I receive this report that assures me that everything is being collected as expected except for there are two variables that are failing. So my date variable is not being collected in a date format and my ethnicity variable has collected some unexpected values. And this is something that if I caught it early enough I could go and fix in my tool because if I don't fix this I could end up with really missing data or I might even end up with data that's completely unusable if there's some values I collected that I can't interpret at all. And so you can see that they're failing in this report based on the failing numbers here. A second option is to review your data during collection is to create a codebook. So codebooks provide descriptive variable level information as well as univariate summary statistics like means and ranges and counts. There are several R packages that automate the creation of codebooks. I'm showing four here but there are much more. But unlike validation where we write code based on individualized criteria for the most part these codebooks provide similar out-of-the-box summary both the validation and the codebook methods provide you great information to help you better understand if your data seems like it has expected. So here is one example of a codebook using the codebook R package and if I ran this codebook you can see or ran this function you can see that it gives me a codebook that looks like this. And at the top it provides me some overarching data set summary information and then it quickly jumps into that variable level information including summary statistics. And I can once again see here that I'm having issues with my start date variable. It's being collected as numeric and my ethnicity variable has some unexpected values and I would want to go on and practice at the source so I can fix this issue sooner rather than later. The other nice thing to know about codebooks is that they are even more useful when you're working with data that contains embedded metadata like variable and value labels. And you see in this data you often see that you can get data from maybe SPSS or STATA but when you're working with labeled data those labels are displayed in your codebook. So for instance if the data were labeled you would see variable descriptions under each variable section that describes what each variable represents. So for instance under our PAT ID you might see a label that says patient unique identifier and that descriptive information would help me better interpret the data but as you can see these codebooks work fine without labels okay. So all these practices we just covered are obviously done in this ideal world where we have autonomy over how data is collected but there are still going to be situations where you are handed data that you had no control over this data collection or entry process or maybe even if you did collect your own data despite your best efforts to collect or enter clean data you still ended up with data that contains error. So for the remainder of this workshop we will be working through a sample data set to both identify and resolve issues to leave us with a usable tidy data set that's ready for analysis. So here's our scenario. We have data that originate from an observational study comparing three treatments of alternative colitis and we have this analysis question here are there differences in change in MEF and QOL scores between start and finish and are the decreases in scores greater for any medications. And in order to answer this question we've asked the student to extract data from the medical record into itself. Along with our spreadsheet of data we're provided a data dictionary which is great and we start to review the data and we find a number of errors that need corrections. So we have this scenario and believe it or not we are not going to jump right in to loading the data into RGIS chat. The first thing we really should do is to open the Excel file and review both the data and the data dictionary. Well it seems kind of low tech. It's actually really important to learn what you're getting into before you read the file into R. So both the data and the data dictionary are in the same file. Let's take five minutes to log into Posit Cloud and navigate to our project and then to our data file. We're going to open and review both the data dictionary and the data to see what's actually going on in our file. And I also just want to say to Rockwell's workshop when you finish with an exercise go ahead and give us a thumbs up. If you're having trouble with an exercise give us a wave or drop something in the chat or even just unmute yourself and chat with us. And go ahead and hopefully most of you have made accounts with Posit Cloud and gotten into our project but you can navigate to all that through our website here so if you click on this should get up to five minutes. So I'm here and I'm going to go ahead and start this timer and then just let us know just let us know what questions you have. I think this medical student was not given the guidance that we just covered. So if you get so mine looks different than yours but if you get to our she and I'm looking for ours but it's probably so long so long now. You have to go to sports to get more content. Yeah, this is what I wanted to show you. So if you can get into our project a copy of the project so if you're in our website you can click on this link here to get to Posit Cloud and then once you get into our assignment it should look something like this you can see the environment over here in the file. Maybe you want to do control L to clear your console and restart R to clear your environment. Yeah, let's clear all this. Let's just clear all this. Yeah, so it should look something like this. And then just click on that data folder. That's where you're going to find your Excel data. If you click on this this is where it's at and so I think if you you can say you file and then I think it should just download over here so it won't be named the same it'll say file show but that's okay. So they keep carrying the mic it could be me. I don't know. I hear it too I'm not sure where it's coming from. Don't hesitate to let us know if you have questions that's a lot kind of getting into how to cry and getting into the project and all that stuff. Oh, so somebody I think Nancy's done. Yeah, we're doing on fine here. You can always give a little more time for me to pull or getting some thumbs up. Okay. Well, I saw some thumbs up. So feel free to keep kind of if you're still trying to get it logged in and all that stuff feel free to keep working on that and then we will move forward and then Shannon Peter just let me know if I need to stop for any reason. Okay, so hopefully for those of you that were able to get into the data file. You noticed a few things upon reviewing our data which I think Peter alluded to in the chat. So one our variable names are not the first row of our data. There are six kind of rows of irrelevant information before we finally get to the variable names and that's really important to know before we try to import our data. The other thing to know is that our data is not the first sheet of our Excel file. It's actually the third sheet over and it's called data. The name of the sheet is and that's also important to know. So that's why we wanted to get in there and start reviewing things. So we are now ready to import our data and the first thing we would want to do is open an R script or you know an R markdown file or a portrait file and we are going to use the read Excel function in this example from the read Excel package to import our data. So there are several arguments to consider when using this function. So the first one is path. This is where we list the path to our Excel file. The next argument is the sheet argument. We can add the name or the position of the sheet to read in. The next argument is call names with a default of true and this just says should R grab the call names from the first row of your data? Yes or no? True or false? The next argument is the Na argument and here it's saying are there any values that R should read in as Na? So for instance if we had those negative 9 9s as our missing value indicator we could tell R to convert all of our negative 9 9s when we read the data if we wanted to. And then the skip argument is asking what's the minimum number of rows R should skip before reading anything in. There's actually a lot more arguments. You can type a question mark, read underscore Excel under your console to see more arguments for this function. So as I mentioned before the very first thing we would do in the real world is open our script or R marked on our corto file and save our code as we write it. So your R Studio pane would look something like this. And then we would use the read Excel function to read in our data. Notice that I'm giving you a hint here that in this situation we would want to use both the sheet and the skip argument. I'm not showing you the full code here because we are going to do this in an exercise in just a moment. But do pay attention to the fact that I'm putting quotation marks around the sheet and that they aren't needed for this skip argument. And then once we read in our data you should get something that looks like this. This is just a small portion of the data but you'll get something that looks like this. So it's your turn. Hopefully you're now in public cloud in our project. So take three minutes to actually import the data. So you'll want to find and open that preloaded corto file called exercises.qmd navigate to the one exercise. Update the code in that chunk and then you can run the code chunk using the green arrow on the chunk for anybody who's new to our markdown in corto. So let's do three minutes for that and then I'll try to pull it up so you can see what I'm talking about as well. So it should look something like this and you should be able to navigate to cl1 here and this is the green button I'm talking about to run the code chunk once you update this section. And then again just give us a thumbs up when you're done or a hand wave if you need help and I will try to see what's going on here. I'm seeing some thumbs up, that's promising. Please don't hesitate to ask if you need help or if you're stuck. And if you do go to our website there is the solution to the exercise and there too if you are stuck. You're right Peter. Yeah so there are two different ways to do the sheet argument absolutely. On the website they have a code fold option on them so if you want to see the code you'll actually have to click on the little triangle beside the solutions to see the code that was used and that's to challenge you. You've got a little less than a minute left to let people wrap up. Okay awesome. So to keep moving forward we will move on but feel free to keep communicating with Shannon and Peter in the chat. Can you just show them the solution real quick? Yeah of course. Just that one so we can move forward in the workshop. Yeah good call. Here we go. I'm sorry the exercise QMD but it's not opening for me. I'm not sure why it would not open. Are you clicking on it in Positive Cloud? Yes. Of course I'm already disconnected. So you're in this kind of area and you're clicking this one and it doesn't open that that stumps me I'm not sure. Okay that's fine. Yeah I apologize I'm not sure about that. Okay. But you could yeah all the exercises are in the website as well. So if you can't get it to work in Positive Cloud you can still kind of review this. So here's the code chunk. So if you review the code fold here this is the solution. So you can put it in the name of the sheet and as Peter put in the chat you could also use the sheet number so it was one two three over you could put in sheet equals three instead of the name of the sheet in quotation mark either would work. So here's the first six rows to get to the variable names and remember that the the column name is defaulted to true to pull the column names from the file so we didn't even need to write that argument in here. Okay so now that we've imported our data it's time to start reviewing that data and to quote the authors of R for Data Science EDA or Data Exploration is not a formal process with a strict set of rules more than anything it's a state of mind during the initial phases of Data Exploration you should feel free to investigate every idea that occurs to you. So what does this mean? It just means that there's no one way to review your data there are endless ways to figure out if there are errors in your data try any ideas you can think of with that said there are still some common steps you can at least start with so I would say after you read your data into R you're going to once again use that old-fashioned method of opening up your data and looking at it to see if everything imported as you expected after that you can start to run some functions to review your data for the following thing so you want to see how many rows do I have how many columns what are the variable types what are variable values are my values falling within expected ranges or do I see outliers is there a lack of variation how much missing data do I have and how are variables related do things like bivariate plots you know is one variable high on the other is low you know is that expected there are several functions that you can use to explore your data these are 70 samples here there's much more than that but this should get you started so here's an example of exploring our data using this DF summary function so this function if you run it provides you some overall summary information including the number of rows and columns as well as variable level summary information like variable type the value frequency is in histogram here's another example using to explore data using the skim function this function provides similar information DF summary just formatted a little different so what's the good so it also provides that overall data summary information at the top number of rows and columns as well as variable level summary information such as variable type completion rate values percentiles and histograms a quick clarification about this function you'll notice that it provides a min and max value for some character variables so that's right here you might be like what the heck is that this is actually the min and max character account for each variable so while this may be a little confusing at first this can actually be really helpful information okay so our next exercise is to use one or more of these exploratory packages to review your data and then start to kind of jot down what fixes do you see that need to happen so let's go ahead and take five minutes to do that before we move on so that'll be the cl2 exercise and keeps disconnecting so if you just scroll down a little bit you get to this blank chunk and just type in you know one or more of those functions and try them out so I'll leave this up here so you can see the different functions if you have questions please let us know I see somebody have a question hi in the last in the last exercise there was a new names that shows up in the output what did that mean sorry oh it said like new names underneath when you ran it yeah I think that because there were spaces in some of those or like characters that character values that aren't allowed or special values that aren't allowed are kind of renamed things in the import process so since it doesn't allow spaces or it doesn't allow you know whatever the special characters are that doesn't allow like a dash it'll replace those with its own kind of allowable value if that makes sense so it's saying I gave you new names not please give me new names yeah exactly it said hey check your names because they're not as expected they're no longer what you thought you were reading in they're a little different awesome thank you so much yeah you're welcome and Shannon is mentioning that she's going to talk about renamed stuff too because even though our names for you you still probably aren't going to like the names and you'll want to rename them again yourself right thank you yeah you're welcome Shannon should we go ahead and take a break and then people can keep working or take a break somebody's thinking that seems reasonable we're at 11 anyway so I'm we can just if people are done and want to go take a break this might be a good time to you know bathroom, coffee or whatever yeah and put up a break at timer so that people know when we will resume and so yeah feel free to use this time to either take your break or continue with this exercise okay so stop sharing we can swap alright so our timer is up and so feel free to use this time to continue with exercise or take your break and then we will resume normal content at 12.05 alright everyone I hope you had a good quick break and it is time to get started so I am going to be talking for the next 15 minutes about stage 1 data cleaning and then you will get another 5 minute break before Peter Higgins section for stage 2 to data cleaning just to kind of illustrate how Louis did in the last exercise the objective was to use an exploratory function you can just grab all of these code solutions right here use this little icon to copy to clipboard and then you should be able to copy and paste it directly into Posit Cloud you will notice that Posit Cloud does tie it out after a little bit of unused but you can certainly get it back up and running quickly so I am just going to paste all of those code solutions into my R chunks there was a question in the chat about do we put our code into these brackets so yes this signifies the beginning of an R chunk this signifies the end of an R chunk and then we can just click this green arrow to run all of the code and it is going to produce all of the results for all of those different exploratory analyses which unfortunately we don't have to review in detail but they are there for you to review and the Posit Cloud instance will stay open if you want to use this as a little play ground I am going to look at the full screen size I am going to navigate to my section of stage 1 data cleaning I noticed a great example in our question in the chat about someone noticed that there was a messaging when we imported who would just skip this section someone noticed that there was messaging when we initially imported the data set and that was a great catch when we saw our original variable names in Excel this is what they looked like and then when we imported them we saw some slightly different variable names so some of the differences that we see here are instead of a blank column this is imported as ...7 column in the data set and I think that's the only difference that re-exceled it for you and some of the reasons it does that is just to ensure uniqueness of variable names for example if you had another column you can see here that our variable names still aren't entirely consistent and also kind of difficult to work with in programming so variable names with white space in them are challenging to work with or with special characters can be challenging to work with so our variable names are still not ideal and this is how they were imported with re-excel but one of the very first things that many of us do when we get started with a new data set is we use the function from the janitor package called clean names and what clean names is going to do is it's going to remove special characters and implement snake case encoding by default so again we're just taking our raw data frame and piping it into the janitor clean names functions and you'll see that we have a few differences so instead of a variable name with periods that is translated to an X instead of white space we have underscores and also everything is converted to lowercase and separated by underscores instead of having special characters so it's only underscores are the only characters in addition to numbers and letters that are allowed with janitor clean names so it gives you a better base for starting your programming another thing that we saw in our data frame already is that we have empty rows and columns and so for the next few slides for stage 1 data cleaning I'm always going to set this up in terms of the problem we have how we can address it with a solution and then we're going to confirm that our solution worked so problem solution confirm problem solution confirm so just to kind of show you the problem what I'm doing is using some tidyverse functions to only show you a few columns in the data set so I'm selecting the patient ID column as well as everything between race and start VP and I'm slicing the rows so I'm only showing you rows 13 through 18 and so what you can see here is we have our blank column here and our blank row here which is not ideal but honestly like I received a data set and work this week that has data just like this like it is normal so a solution to this is to use another janitor function called remove empty and remove empty you can specify what needs to be removed in this case we're going to remove both empty rows and empty columns so you can see when I do the same selecting in the same slicing we no longer have that blank column and we no longer have that blank row another way you can kind of confirm this is with the glimpse function so the glimpse function allows you to get a quick overview of your data set so for our original data set you can see we had 31 rows and 38 columns and now that we've removed our empty rows and empty columns we're now down to 30 rows and 37 columns another problem we have already identified in this data set is the need to do some recoding so when we count the different values of our ethnic variable you can see we have inconsistent coding so Hispanic is coded three different ways and not Hispanic is coded two different ways a solution to address this problem is to use the mutate statement in the tinyverse package which allows you to create a new variable so in this case I'm creating a new variable called ethnic clean and creating a clean version of the ethnic variable by using a case when statement what we're doing is laying out specific conditions on each row for recoding variables so we're going to read this is when the original variable ethnic takes on one of these three values so percent and percent is the numerator that says it has to be in one of these three values exactly so if it's lower case Hispanic uppercase Hispanic or this misspelled version of Hispanic we're going to recode that to be Hispanic if ethnic variable is in either the two ways we can specify not Hispanic we're going to recode that as not Hispanic and then another option we're going to use in our case when variable is the default value so if condition is not satisfied and if condition two is not satisfied then whatever rows are left over should take on a default value of the original ethnic variable so when we count our ethnic clean variable we should only see two values Hispanic and not Hispanic we should not see the four values and I believe it's very important to maintain both your original variable and your new cleaned variable as separate things like not to overwrite your original variable because if you overwrite your original variable you have no way of confirming if your coding works correctly so the way you're going to confirm that your coding works correctly is to count the two variables side by side so I'm accounting my original ethnic clean variable versus my sorry my new ethnic clean variable versus my original ethnic variable as Hispanic in two ways you can be coded as not Hispanic to confirm that my recording worked correctly so now we're going to go to exercise sp1 and data cleaning fundamentals sorry that should say stage one data cleaning so I'm going to go over to posit cloud I just want to highlight here that you will need to run some things to get started right you want to make sure that your df raw has been imported you're going to specify data and your sheet and skip equals six you're going to run this to make sure you have df raw in your global environment I'm going to go ahead and clear this out so that it looks a little cleaner you also need to create your df clean that's kind of our base to get started with so I'm going to go ahead and submit this chunk that's going to remove empty rows we have our df raw and our df clean and now your objective for exercise one are to explore the values of race kind of identify the problems to clean the values of race by creating new variable called race clean and to confirm that race clean is coded correctly so I'm going to hop back over to the slides and start the timer right that is time and we had some great questions about what does the dot default argument actually do so just to make this super quick and easy I'm going to pop over here and grab the code solutions and copy and paste them into Post-it Cloud so first of all I'm going to count the values of race to kind of find that problem and explore what's going on as you can see we have some mixed coding of certain values and we want to combine African-American and Afro-American and some other values as well so to illustrate more concretely what that default does I'll show it to you with and without the default option so I'm just going to copy and paste a couple of things over here in the interest of time and I'll get out my confirm code and put it in here so what I'm going to do just to show you what's happening is I'm going to comment out the default so it doesn't actually execute so let's run this so our solution here is to use a case when statement we're going to create race clean we're going to assign these two values to African-American and these three values to other but without a default it might require something after the comma it doesn't okay and now we're going to see what happened without a default and so all we did in race clean is we evaluated two specific conditions and evaluated a condition on African-American and Afro-American that got recoded to African-American and then these three values got recoded to other but you see because we only evaluated those two conditions the remaining values were not recoded in our race clean variable and basically we just want to transfer these over these values over to our race variable so like the race values itself should be the default for race clean when these conditions are not satisfied so when we put dot default equals race back in our case when statement run this and then rerun our cleaning now we can see that all of the values are assigned in race clean and they are assigned correctly so there's two ways you can be coded as African-American there's three ways you can be coded as other and then everything else is left as is hope that clarifies thanks to the people that have questions about the dot default I'm going to case you're watching along there was some great discussion in the chat about people didn't know about the dot default argument and it's newer in the release of so another issue that we are going to be facing in our data is we need to replace some values with missing so for example we already identified in crystals portion that there were some negative 99s in our data which actually should be coded as in a so we can see these again by just counting the values of that variable so in this case this variable itself is named in a which actually has nothing to do with missing values I believe it's the ending value of sodium for this particular subject so but here we have a negative 99 value and you can also see if you do some visual exploration this is really inconsistent with the other values in the data set and that should be a flag for you as well that something weird might be happening so hopefully you can go back to your source documentation and find that oh yeah negative 99s were represented missing values so a solution here is to use the in a if function so what it's going to do is it's going to take any values in the end in a variable if they are taking on a value of negative 99 it's going to convert it to a proper are in a notice again I'm using that not okay not so thank you for clarifying Peter in the chat there so notice again I'm still using that strategy of creating a clean variable and leaving that original variable as is and that's so we can explicitly confirm that our code worked so it's a really important strategy so we're going to again use accounting technique to confirm that things worked so you can say when we count these values side by side negative 99 is explicitly coded to in a and this would be even clearer if you had more than one value of negative 99 I mean like you would see a count of two there or three there that's why counting works so well and then if we look at our histogram and our ranges of values you can see that there's nothing greatly out of range and our visual of this variable looks much better so we're going to use in a if to replace values with missing another thing that you'll see here is an incorrect variable type so Crystal talked about how variable types or classes of variables are super important when you're working in art to make sure you can do what you need to do with the data so here we have and email which corrects me if I'm wrong again Peter represents the end emotional score I believe for these subjects and you can see here for some reason it looks like a number but we have these quotes around it so it's clearly read in as a character our glimpse function is telling us that that in emo variable is indeed a character and there's consequences to this so a consequence to this is that we actually can't take the mean of that variable because it's a character like even though it looks like numbers you can't take a mean as something that's stored as a character and if you kind of look at all of the values of this in emo score let's see if we can kind of hunt through and figure out why these are stored as characters so this is all of them just printed out and you can see here one of them was coded as not done which is why it actually imported as a character even though these values look numeric one person I believe I'm going to mute one person and so that's the problem that's what happened in our data so let's look at our solution again I'm going to create a clean version of this variable so I'm going to say in emo should be in a if it takes on a value of not done and then I'm going to pipe that result into another function so all this is going to do is replace that not done value with in a but we're still going to have that problem that that value variable is character doesn't can do the character to numeric version so then I'm going to pipe it into the as numeric function and that's actually going to coerce that variable to now be a numeric variable so when we look at our confirmation when we look at in emo versus emo queen originally it was character now it says it's double which is the type of numeric storage we are actually able to take the mean of our emo clean variable because it's now stored as a numeric and we can also again count all of the values to confirm to ourselves that the coding was done correctly and even though you can't see it here if you were to scroll through all of the values you would see that not done was coded correctly to MA we also need to correct dates so dates in the wild can be wild when you get them into our weird things can happen especially when you're going from Excel to R so we've already looked at the start date variable and we noticed it does not look like a date at all it is read into R as something called double which is a numeric representation and when we look at all the values we can see here we still have these large integer values to represent our dates oops and then so let's look at our solution so there is a very handy function in Janitor package to handle these date conversions specifically for going from Excel numeric representations to actual proper date variables so we're going to create a new variable called start date clean and then convert it to a date using the Janitor convert to date function and so when we confirm that if we look at our start date and start date clean we can see that this 44,208 is converted to January 12 and when we count our two variable side by side you can see how the conversions lined up if you're being like super specific I would like look into numeric representations of dates and Excel and actually like do one by hand to confirm to myself that I believe the conversion is actually happening correctly another thing that we might want to do is extract numbers from text so again we're back to our start in a variable and here when we're doing a glimpse of this column we can see it stored as a character because it has character text in it and again the consequence of this is you cannot take the mean of a character value and if we kind of swirl through all of the values you can see there's nothing like no missing data problems like there's a proper N a in there but all of these text strings are throwing itself so the solution is again is to create a new clean version of that variable and you can use the parse number function so with that parse number function is going to do is it's only going to extract the numeric values from that field so then when we go to confirm it we can see that our start in a clean variable represents has a value of double here a numeric assignment and you can see it only has numeric values you are indeed able to take the mean of that now and you can confirm that the coding works correctly again by your friendly count function so 133 is coded as 133 alright so now it is your turn again to complete an exercise the exercise in this case is to explore the type and values of start PLT to clean to create a new variable called start PLT clean that corrects any usual values and assigns the correct coded variable type and then of course to confirm that coding so you can drop down here in your sp2 to attempt that exercise alright so that is time we see a lot of excitement over the parse number function so let's go over here to our start PLT so let's explore the values there is of course different ways you can explore the values you might want to count the different values and we can see here that it takes on some mixed string values and also it does unfortunately in addition to these mixed string values it does unfortunately have this one assignment of clumped which should be represented by an NA as well so we kind of have two different problems going on here another thing that you might want to see as well is that the glimpse function can of course always be handy so we'll just kind of take a I'm not sure what I did wrong there you can see my error please let me know I am going to move on in the interest of time is it using the wrong select? somebody had an issue with that earlier Shannon so maybe just go ahead and yeah let's use the wrong select also unfortunate okay it might have been we loaded the packages thank you for that so again there's a name space conflict and that's why it couldn't find it wasn't working appropriately so we can see that it's a character value there and then also of course we want to just kind of take a look at all of the values we can pull it out using bracket notation and see all of the values we have to work with and so our solution someone else but it might not be the right solution I'm not sure what that comment was but let's go ahead and hear and see so we're going to create a new variable and instead of I just started off with us parse number just to see what would happen right I don't know what's happening but that's it again okay hopefully it stays alright so I'm just going to use the parse number and see what happens I mean I know we have that character string of clumped and it might like make the correct assumption that clumped should be in a like there's no number to parse there so let's just see what happens so we've created our new df clean with start plt clean and it says we had a warning right so there was one parsing failure where we expected a number but we saw some just some texturing some clumped so it's not an error like it's not stopping the r code but it is a warning it's a nice little message oh it says need to add in a equals clump so let's just kind of open up the help file for parse number and see if that helps parse number there are other questions about what package is parse number in so parse number is in the read our function and oh it's in that handy look so it says we can specify in a values and that also could solve the problem so we can say in a equals clumped here we run this and now that warning disappears and we've explicitly specified our in a option so thank you for that suggestion Hannah and so now we are just going to confirm that the start number coded correctly so again we are going to just grab this code solution and paste it in so we're going to count the old and the new values you know it's only going to show you the original 10 so you can either paginate through the rows here just to confirm that clumped worked correctly and one thing that I often do like you don't let's see the default in this workspace as I often view the result now you don't want this render in your quarter document if you're looking at the entire quarter document but that's just so you can skim through all of the values and so we can see here when the original value was coded as clumped it does take on a value of an a and all of our other values look like they were assigned correctly again another really important thing is that we started with a character string and we're ending up with a numeric string and we want to confirm that that string is actually indeed numeric so we can just do like df clean select we'll do these two variables and we'll do dpire select just we have a conflict namespace and then we will do a glimpse and so here you can see that our clean value is indeed coded numerically and set as a character right someone asked about the code so I will drop this one with an NA option into a chat we did about 12 more minutes in this session before we take another break right so another problem that we might face is a character variable should be a factor and that really just depends on how you're using that variable and what you're doing with it so in this case we have a treatment variable that takes on three values of OSA, UPA, and UST and Peter can tell you about those all day long I'm not familiar with these three treatments but the consequence of this character variable being left as a character variable is that in any kind of presentation type things you do it is going to be presented in terms of alphabetical ordering so even in our simple count here it's listed in alphabetical order and it's the same with like our ethnic clean variable again it's character so it's listed in alphabetical order you can see that these two variables of treatment and ethnic clean are both stored as character values and when you go to kind of present things more formally like in a GT summary table or in a GG plot figure those tables or those figures will be arranged alphabetically so I love GT summary for creating summary tables so here this is just a simple table summary and again treatments are listed in alphabetical order and our ethnicity values are also listed in alphabetical order and you might want to have those treatments or ethnicity values rearranged so for example maybe one of these treatments is a standard of care and it should come first in your table or maybe you want to present the most common value of ethnicity first instead of an alphabetical order and the way you can approach all this problem is by converting these variables from characters to factors so here we have two different functions that I'm using I'm going back to my ethnic clean variable and I'm piping in the results of this case when statement into this function called FCT underscore in freak what it's going to do is it's going to convert that character string or character variable into a factor in the frequency of the values observed so the most frequent value will be the first level of that factor variable and those levels will be in accordance of the frequency observed a separate function that I'm using to recode the treatment variable is I'm using the FCT underscore re level function so in this case if you recall and we all they have the same frequency right 1010 and our objective here isn't necessarily to code it in terms of some sort of frequency but maybe there's a standard of care that needs to go first or there's a logical ordering to these different treatments so here I'm explicitly stating the level ordering of this treatment variable so I want to be the first level then then OSA now these two functions FCT in freak and FCT re level are from the forecast package and there's a whole host of factor functions you can use to handle factors in the forecast package so I'm just showing you two here so now we want to confirm what happened so here is our original our new treatment variable so we have recoded it as a factor you can see that assignment right here where it says FCT so it is a factor variable and now it is in the order that we specify UPA, UST, OSA so we have UPA, UST and OSA again for ethnicity now we are again we have a factor variable and it's presenting an order of most frequent to least frequent because we use FCT underscore in freak so we have not Hispanic followed by Hispanic if we take a just a brief look at those two variables just to confirm again we can see that these are now coded as factor variables instead of character and then when we go to make our formal summary table using GT summary they're going to present an order that we specify so we're going to have UPA, UST and OSA and we're going to have our more frequent ethnicity followed by our less frequent ethnicity so that takes us to data cleaning exercise 3 I'll give you a quick overview of that it is a prerequisite of this that you completed data cleaning exercise one or SP1 so I would recommend that you copy and paste the code from SP1 before you get started and when you do do SP1 so you want to look at the values of race clean and you want to convert the values of race clean to a factor such that it presents in an order of the most common values and then confirm the new coding of race clean so I am going to start the timer for you there so let's illustrate this exercise super quickly I'm going to go to SP3 we're going to explore the values of race clean and we're going to see that it is a character and we want to change the ordering of that character variable by converting it to a factor so let's clean that and in here I'm just going to pipe in that result of that case when statement to an FCT underscore in a freak to put it in frequency order I've got that warning again which was from the parse number here that we corrected earlier so I'm not too worried about that and then we're going to confirm it so we're going to confirm the new coding of race and race clean so it should still have kind of the same assignments but now you'll notice that race clean is stored as a factor instead of a character and if we count the values of race clean what you will see here is now it's in the order of the frequency of those race values it is time for our last break unfortunately so there is a problem of separating values so one way you can separate values that are contained in a single variable that's separated by slash is using by separate wider delim and then another feature that we're not going to have time to discuss in detail I can move to the next slide is assigning labels and so you saw me kind of twist my tongue over here when I'm like I don't know the exact value of all of this clinical data and it's because that context is contained in that excel sheet and so one thing that you can do is you can actually create labels for variables and it's a little bit more complicated but you know that when we view our data here all we see is our variable names and instead you can actually convert your data to have labels in it so underneath your variable names you'll see a brief description of those variables so you don't you have that context as what the data means at your hands but again apologies we don't have time to go over that today but that is their view if you want to learn more so we're going to go ahead and put it up for our five minute break before Peter sticks over for stage two playing the crystal is going to put take over the sharing and put on a timer do I have a timer oh sorry okay then I can do it I can so let me take the sharing back and I will put on the timer and I'm going to share again we'll put it back over here for our schedules you know exactly where we're at and so we're taking our quick break right now and then Peter's going to pick it up with stage two cleaning while you're breaking of course we're happy to answer any questions in the chat okay so our break is done and welcome back hopefully everybody's had a moment to get a break there were a couple of notes in the chat about the difficulty of wrangling medication data there is a package that was presented at our medicine a couple of years ago called the dope package and there's some links in the chat about that that may be helpful particularly with drug names and putting drugs into categories so I'm Peter Higgins and I'm going to be talking about wrangling your data or stage two data cleaning stage one was critical for doing the key things that you need to do in nearly every data set data exploration and validation data validation is a continuous process right up to the time that you and your investigators co-investigators agree on data lock we discuss the importance of preventing data errors at the source being directly involved in data collection and data entry preferably using a tool with data validation at entry like red cap and we did the stage one cleaning to clean each column and row cleaning the variable names removing empty columns and rows fixing variable types and classes addressing missingness violations of tidy data principles to separate combined data pieces like the two pieces of blood pressure adding meaningful variable labels all that creates a clean rectangle of data which generally is required in nearly every data cleaning and while stage one is required nearly all projects the stage two data cleaning I'm going to be talking about as frequently needed but not in all projects so we'll cover in this section the frequent but optional topics of restructuring data as long or wide format thickening or padding longitudinal data and joining multiple data sets so typically we think of data structured in wide or long format and here's an example of wide format where we have one row here we don't see your slide oh that's bad fortunately a lot of it was text that's okay which I read through and hopefully that all made sense so here are three topics in this section and here's an example of why format where we have one row per patient or we can have long format which is one row per measurement you can see this is a row for start MES and MES another one for start bowel symptom score and bowel symptom score all for patient one so we end up with six rows for patient one whereas we have those six measurements listed in a single row when we're analyzing per patient so wide format and long format the same data but structured in a different way we may want to do an analysis by patient as each patient may or may not have the outcome if this is our goal we if we have multiple observations of data points on each patient this leads to wide data with one row per patient and one column for measurement but we're often interested in the change in an outcome over time we need to make this work we need one row per measurement of the outcome and this leads to long sometimes called tall data with multiple visits and measurements for each patient the unit of analysis usually depends on the question we want to ask if the unit of analysis is the patient which is usually true for dichotomous outcomes if the patient die have the outcome of colectomy reach disease or mission then wide data is usually appropriate if the unit of analysis is the patient or encounter or a particular measurement often these are within patient outcomes so did the C-reactive protein improve from week 0 to week 8 did the number of sickle cell crises per year decrease after CRISPR gene therapy or did the endoscopic score decrease on treatment versus placebo or all structures questions where you use a structure of tall data or pivot longer so long data deciding on the unit of analysis it's actually pretty common and most common to use long data and this data structure allows you to look at multiple predictors and outcomes at multiple measurements like blood pressure a PHQ-9 depression questionnaire or hemoglobin measurement depending on the analysis question you may want to use wide data analyzed by patient more often with dichotomous outcomes for inpatient data that gets a little complicated because you'll have to use it or day so you need to decide up front how do you're going to handle multiple observations of the same type like vitals every six hours in the same day or the same hospitalization you could use the 0600 observation each day for blood pressure or use the daily average for SBP and DBP or use the max values each day if that's what you're concerned about sometimes you may want to do an analysis on both long and wide data but for different questions so it's often helpful to be able to pivot your data between long and wide data structures so we often enter data by patient spreadsheets encourages to enter longitudinal data as long rows per patient when we end up with wide rather than tall data and so here's a typical from our messy you see data set now claimed of six measurements taken on each patient at start and end of a six month retrospective evaluation now you can rotate this to a tall version where you have one row per measurement and that is the same data just rotated or pivoted to pivot longer so this is the long form of the data are in most our functions are vectorized to handle tall or long data there's one small observation per row and most analyses in our easier with all or long data so it's very common to pivot data from wide to tall or long on the regular this lengthens the data increasing the number of rows and decreasing the number of columns and we'll be looking at visit dates start versus and and measures the arguments for pivot longer of course you have to reference the data you're going to be pivoting the columns to pivot the names the columns will go into into a new column and where the value should go and there are many optional arguments the details versus pages here somebody should be able to click through and see this link or share it in the chat data is your data frame or table you can pipe this in calls are the columns to pivot as a vector of names or by number or connected selected the title select functions names to as a character vector specifying a new column or columns to create the information stored in the column names and values to as a string specify the name of the column where you should put the data stored in the cell values so you start with the wide version we're going to use selected columns from messy you see currently there are 30 rows one for each patient with six measured quantities for each patient so you predict if we pivot all the measures there would be six times 30 or 180 rows and that's part of the confirmation that Shannon was talking about when you make a data wrangling change confirm it check that it turned out the way you expected so the tall version we want to end up with you can see there now a 180 rows and four columns 30 times six so there's one row per observation measure so that work that makes sense the question is how do we are we going to do it and the key is defining for the argument which columns are we going to pivot where where the names for each variable going to go and where the values going to go so we're going to start with the wide data the code basically says we're going to take the columns from start MES to end Emo so our six six measurement columns we're going to put the names into a measure column because it's the name of the measure and the values in and I'll just call this score I could call it values and the results ends up looking like this so we have patient ID and treatment these did not pivot at all these identify each patient and their treatment then we have a column for the measure and a column for the score and we now have 180 columns so just want to check and make sure this makes sense to everybody and if you this makes sense at this point go ahead and give us a thumbs up emoji in a chat or a raised hand if you're puzzled or you have questions and this is a fundamental concept so worth making sure everybody's on board with this okay that looks promising folks are coming along there we go okay so let's see if we can make this happen now there's one issue with our measure if I go back and look at this you can see our measure is actually two pieces of data it's the time point start or end and the actual measure now Shannon just quickly walk through separate but essentially tell it the column you want to separate what the separating character is and what you want to separate it into so to separate that one we can take measure separate with the underscore character and separate it into time point and measure and that gives you a tidier data set where you've separated the time point from the measure and distinct columns now that's what I did initially it turns out if you read the documentation for pivot longer a little bit more closely you can actually do this within pivot longer with just one more argument one I don't commonly use but you can just put in names underscore set and put underscore there and names to and values to and it's smart enough to do that right within pivot longer so now it's your turn we're going to use a data set called endo data these are measurements of the trans epithelial electrical resistance or tier which is sort of the inverse of leakiness and biopsies three segments of intestine and patients with different levels of portal hypertension and with liver cirrhosis this could affect the leakiness of the gut or the tier so let's find out what the data show us so if you can get to the exercises you can load the data on your home computer and we're going to go ahead and jump to posit cloud to do this and try this out I'm going to be still sharing some good questions about spread and gather those have been soft deprecated I think that's the official term they are still functional but they're not being updated so pivot longer and pivot wider are sort of where things are at now so this is what endo data looks like I'll just run this to get that into endo data file and this is what it looks like these are the people who one have portal hypertension or zero do not and these are the measurements of tier for the duodenal biopsy and the colon biopsy just for folks who don't know duodenum is the beginning of the small intestine illum is the end of the small intestine of course the colons at the end and you can so we are going to basically work here at this part that says code and you have to tell it which columns you want to pivot it will use this names pattern argument to select the things within the parentheses basically all the characters before underscore tier so it will tell you if it's duodenum, ilium or colon and then tell it where you want to put the names and the values and you should be able to do that pivot and get it to 180 rows rather than 30 rows and I'm going to try to jump back and you get this one so here's the data the arguments calls names pattern names to and values to code basically you're filling in I don't want to spoil anyone with a solution but you can peek at the solution on the exercises page if you hit the gray code arrow and then the results should look like this where you have for location duodenum, ilium or colon and then the tier value for each one and then you can look at who has portal hypertension and who does not and see if you think that portal hypertension has an effect on tier and the inverse epithelial leakiness feel free to put thumbs up in the chat if this is working or comments on whether you think portal hypertension affects epithelial resistance or tier and in which intestinal segment I should have started the timer a while ago I guess I'll count it down to one minute feel free to let us know in the chat if you're struggling or if you're figuring it out let me know if you want go ahead and go to the exercises pages and peek at the solution who I think is from Poland with a thumbs up thank you Luke's got it done go ahead and got the code here feel free to put a hand up if you're having trouble or need a hint definitely phone a friend and the result should look like this go back to the exercise here and so code is here and the solution we select the columns to pivot from dood tier to colon tier names pattern poles in the organ I just put location because it was dood or colon and values to tier and so we got our intestinal location here and our values here overall from this very small sample it looks like portal hypertension affects the doodinal tier it's lower so there's more leakiness it doesn't seem to affect but in the folks with portal hypertension their doodinal tier seems noticeably lower so little liver factoid for you there you can also pivot wider wide data is a little bit less common but sometimes needed for per patient analysis and so you may have data in tall format that you want to convert back and this is what the tall data look like as we just made to pivot wider you give it the ID calls the ones that are not going to be pivoted in this case paste an ID and treatment names come from the measure values come from the score and basically we're doing the reverse three different arguments but essentially we end back where we started from and that can be helpful particularly if you get data out of things like red cap where it's by measurement you want a patient level analysis this may be an important tool for you turning out a longitudinal data a couple of things you can't you may run into data issues when one or two things happens you want to analyze data by day or month or year and most of the time data in your electronic medical record like when a medication was administered is collected in time stamped by the second not the most convenient time interval or you realize that some observations particularly on weekends are missing and you need to fill in these dates as missing or you really don't want to do this by hand the pattern package can help with these issues so here's an example of a data set the emergency data set in the pattern package contains over 120,000 emergency calls from Montgomery County, Pennsylvania over a period of about 11 months each call has a title of what's going on and a time stamp with date and time down to the second and we want to know in which months the emergency department should order an extra case of Narcan and it's been suggested that there are more overdoses and you need more Narcan in the summer months we want to test this so our starting point is the pattern package contains a thicken function which adds a column to a data frame that is of a higher interval time interval than the original variable the intervals for pattern the options are year quarter month week, day, hour, min, and sec that wouldn't help that much but you can use it if you have data in milliseconds I suppose the variable time stamp has an interval of seconds that's the one we start with we can thicken the data to the time interval we need we can count events by usable unit of time we want months for overdoses so original data has these daytime time stamps down to the second we're going to thicken these to month and basically the command is pipe it in, thicken month to make it easier to show I just took the head for six rows and selected out a few variables so it would fit on the screen so here's your original time stamp and time stamp month essentially just pulls the date and assigns it to the first day of the month we can also thicken to week same idea, thicken week remove some extra stuff and it basically gives you the first day of the start of the week and assigns that so you can graph data by week or you could do day or whatever thickening you want to do so we want to do this by month and look at overdoses so we want to get thicken it to month and then select and count the overdoses set up the monthly overdose plot we want so the code is thickened by month group by time and select time stamp month and summarize overdoses and basically detect the word overdose in the title assuming people coded these correctly and select time stamp month and overdoses and we get from December 2015 to October 2016 the number of overdoses per month and if we do a plot we can see this rise fall and rise again so when would you order extra cases of Narcan and and good question what's the difference between thickening just extracting it's really not that different it's easy to do it multiple ways in this case sadly enough I think we're looking at March and September unfortunately this is a back to school effect where folks who are narcotic addicted drop their kids off at school and use drugs and it's a sad statement on humanity but it is a reality you can also pad unobserved dates the pad function allows you to fill in missing intervals for example my hospital anyone's fecal Calprotectin tests on weekdays so you get these weird discontinuities where you have a fecal Calprotectin on Thursday then Friday and then the patient magically gets much better by Monday because there are two missing days in between because we don't analyze that so you get weird discontinuities so generally no observations on weekend holidays now you could go in by hand and insert NAs for those days but is something you can do by padding these with the pad function and essentially you pad by and you group it by patient ID so it's not padding days in between one patient and the next patient and I'll just show you 12 rows and essentially it goes in and says okay this is the patient I'm going to put in the missing days and gives you NAs just filling in the missing data in a systematic way that makes your life a little bit easier so turning to our third topic and that's joining data there's a this is another data issue often faces you have two or more interesting data sets and they would be more interesting if you could link the data in one to data in the other much like Reese's tells us about chocolate and peanut butter in this case data sets when we often collect data from different sources that we later want to join together for analysis so maybe from your electronic medical record and the CDC or the U.S. Census external data can illuminate our understanding of local patient data and why patients who seem similar have different outcomes so we're going to set up a problem with two data sets one local with demographics and census track and one from the CDC that has values of the social vulnerability index by census track we want to know if the SBI for the neighbor of each patient influences health outcomes or is associated with their outcomes so we need to left join these data sets together by matching on census track so the data we have the demography over here from our local EMR and we have the CDC data that we downloaded over here and what we're looking for something to match them on that's identical in both and in this case it's census tracks which fortunately not only matches by the name of the variable which is exactly the same lower snake case but also both are doubles every once in a while you get foiled if they're different data type because they won't join and they won't match so the typical code is left joining the demographic data which starting by patient and adding new data social vulnerability index from the CDC data set on the right hand side and sometimes this is called left hand side right hand side or X and Y but the general idea is you left join your base data set your new data set by whatever the key or unique ID the matching variable is so in this case it's done demo CDC by census track and we get these linked now and we can now look at is there an association between social vulnerability in this column and hypertension and in this very quick look at a grand total of nine patients it looks like there is that people with higher social vulnerability are more likely to have hypertension this is tiny tiny data but gives you an idea of the kinds of things you can do by joining data sets so here's an example patient demographics with lab results and this will be your turn to join so we have a group of patients in their age we have from data queries from our electronic medical record we have a set of potassium values for these folks and a set of creatinine values for these folks through for affordable reasons they were pulled separately so we're going to join them one at a time to these patients you can load the data on your own computer with this code here but we're going to do most of this in a positive cloud so you want to join the correct labs nine rows from each of the potassium and creatinine data sets to the correct patients unique identifier called unique ID or key or record ID is PAT for patient ID underscore ID it only occurs once for each patient or row it appears in each table we want to join and it's of the same type in each data type you can sometimes get foiled if they look the same but don't match we want to start with demographics that will be our left hand side or X data set then add data sets that match to the right so we use demo as our base data set on the left hand side and first join the potassium results on the right hand side and this is what's called a mutating join so new variables from the Y data set the new data set are created and added to the left hand side the base data set that match the ones that don't match or eliminated and for row three which has no matching and as inserted so we can do this on positive cloud we're going to exercise pH to take a shot this I will actually start the timer on time and let's go take a peek at positive cloud so see if I can find a little bit further down you'll have to scroll pretty far down to get the pH to sorry about that so you can load the data by just running this first set which will load all three data sets and we want to join the labs and the code should be data X data Y by your unique ID in this case demo to potassium and we want this to result in the data frame new data so with this sorted just give us thumbs up in the chat and hopefully that's not too hard to figure out your two data sets in the unique ID and we're just adding the potassium data set at this point if you want to jump ahead we can then as a secondary part add threatening down here so looking out to I hope I'm pronouncing that right have thumbs up and remember you want to have quotes around this by variable your unique ID a little bit unusual so it's one that I not that I get spoiled by that one fairly often to be fair and it's always annoying but it's by equals and then in quotes other folks having trouble getting it done moving ahead to threatening just let us know in the chat the result should link all these patients and their ages to the threatening column and as we're at 132 I'll go ahead and show you the code you can just jump in it's demography potassium by in this case patient ID in quotes pat underscore ID that should give you the patients link to their potassium levels so our next one and you can continue in as a cloud taking new data and then adding threatening to produce new data too and the key is to pick the right unique ID and end up with age potassium threatening all linked anybody struggling please raise a hand if not I will continue to charge forward we got a thumbs up from the hop so I'll show you the code and it's not too different we're off joining new data which we created with creating again by patient ID you get new data too and now we have a complete data set and we can look at who are the patients who have high potassium and questionable creatinine levels hypertension and diabetes so he's got early renal failure at age 44 and one is surprising is that he's only 27 but he's got bad looking potassium and creatinine and he unfortunately has early stage focal sclerosis chlamydia scleritis so we can identify folks with problems by linking data from different places and start looking at interesting things like outcomes so left join is generally your workhorse join you generally start with the patients you want to study and add data to the right side that's your why data set and so where there are matches you'll be adding new variables in this case the why variable where there are no matches you'll get NAs like with subject 3 now occasionally you'll bring in data and need to wrangle and process it and then pipe it in and then you'll come in as the base data set so you might need a right join so that this right hand side data set is your base and the new variables are coming from the piped in data set most of the time I use left join but if I have to do a lot of data wrangling and then pipe it in I might use a right join there are also fancier kinds of joins semi joins, anti joins, inner fold joins, union intersects which do come in handy once in a while all of these are explained well at this link here feel free to click on the link in the slides the two I probably use most often are what are called semi-join and anti-join so for semi-join you're subsetting your data keeping only a particular group so if I have the patients I'm studying on the left and the patients in the why data set of patients who've been hospitalized on the right I only want to keep the ones who've been hospitalized but I don't need that 1-0 to say that they're hospitalized so I can do a semi-join and just say take all the ones here and only keep the ones that have a match in the why data set now the opposite of that is an anti-join which occasionally comes in handy so here's an example you have a data set on the left on the right a data set of patients who have died or are listed in the medical records as dead and you don't want to study the dead people so you can say okay take my original data set eliminate anyone who is dead, who is present in the other data set but I don't really need a new variable 1-0 for patients who are dead just eliminate those and save the ones who are not dead and that's where you can use anti-join so there are even newer fancy joins using the join by argument which gives you a lot more options not just equality or matching but you can have inequality you can only match if it's greater than value so maybe you're looking at re-admissions and you only want to look at the admissions that happen after a previous or index admission you can select the closest value closest date or closest date after the comparison date but within 30 days important for re-admissions or only match if the interval or date overlaps or match a value if it's within a specific range there's a lot more you can do with joins at join specifications and a lot of inequality joins or range joins that are really helpful and I want to mention what happens when you encounter something that is incredibly messy that constitutes a data crime there is not a single variable in a column there is not a single patient in a row in fact there's sub-setting and they're counting but they're not actually observations so what should you do when you're confronted with data of this sort or even worse well there are two main approaches one is if you can find the person who collected and entered the data this is an opportunity if you don't correct this behavior now this could torture many future data analysts it's a teachable moment to educate them about tidy data and send them to watch tidy spreadsheets on YouTube here to prevent this kind of data crime in the future and you can improve the world one now tidy data collector at a time alternatively and unfortunately sometimes there's no way to find the person who collected or entered the data just send it, found it, downloaded it this is where you can pull out some advanced data cleaning packages made for this particular kind of mess like unpivoter tidy XL which is great for color coded data and unheader there's a nice video here in this link to watch about unpivoter and tidy XL and how to use them and a free ebook called Spreadsheet Munging Strategies for when you face really hideous data you don't have a way to get the data to the collector to do it in a data entry in a tidy fashion so we've gone through step by step cleaning, stage one importing your data cleanly, skipping rows and getting headers, exploring your data removing empty rows and columns fixing the variable classes and types, checking dates cleaning variable names, creating a code book, validating for missingness and outliers fixing bad category values so that the values make sense and are not misspelled or capitalized in strange ways and identifying consistently labeling missing values and separating mixed values like systolic and diastolic blood pressure and assigning variable labels we've also talked about letting your question define your unit of analysis pivot longer or pivot wider if needed thicken time or pad missing times if needed join to add new data if needed and knowing what package to use if you're faced with data horrors and I'm referring you to unpivot or tidy Excel header, hopefully you'll never actually have to use them but if you do now you've heard of them you can easily Google them they're on CRAN and they can with good vignettes walk you through how to deal with really difficult data messes so thank you and you can click here on the slides to submit workshop feedback that you really appreciate and we'd be happy to take other questions yes Beth mentions that these new joins really help with the not exactly matching but near matching inequalities and things that are close are super helpful. I'll also say that if you enjoyed this workshop please give Peter a huge thank you and round applause because this was Peter's presentation idea and he organized it and made it happen so thank you very much Peter. Thank you and I hopefully we've given you maybe a checklist or a pathway to think about stage one cleaning to a clean rectangle and checking each variable checking each row checking the data values and then thinking about ways to add data whether it's or restructure your data which are optional don't need for every situation but can be really helpful and of course we have a few minutes so we're happy to take any questions and I got a question earlier will positive cloud be available forever I'm sure it will be available through the end of June this particular account but all the exercises and the solutions are available on the web page for the course and I can put that in the chat as well the feedback link I actually took out of the slides if you refresh the slides they shouldn't be there any oh did you create one Peter? No no this is a gtregline Yeah yeah that's an old link from a different workshop that never got replaced so yeah just ignore that feedback link and I thought I took it out of the slide maybe it didn't go through Yes Aatu points out the cloud project Yeah I forget where that download button is there is a way to download that repository I think you can give feedback on the R Medicine website they'll have a link for you to give feedback there so on the schedule I think Are there any questions I mean I know there was quite a bit of discussion about superseded functions and we're glad to do some newer ones so there's a question in the chat about how would you deal with data sets providing different statistical measures for the same variable like mean and standard deviation for some rows mean IQR for others I think that's in my mind I always separate no the raw data from the summary data and in my field specifically in the pharmaceutical world there's a new concept called analysis results as data so basically saving analysis results in a tidy format which would be like one row per summary statistic like one row per mean one row per median one row per IQR one row per standard deviation so that you could then take that summary data and transform it to any other tabling style that you want I know that's specific to my field in terms of like what's happening in clinical world or medical data do you want to have any thoughts Peter I was just noticing there actually is an official feedback page which is now in the chat so that's a possible thing and so Shannon in terms of can you explain that because I don't think I understood the question there is a question just about how you deal with data sets providing different statistical measures for the same variable mean and standard deviation for some and rose meeting and IQR brothers and I would just encourage you to think about the structure of your data set and what should be a row and what should be a column in order to keep that data moving to where it needs to go in terms of summary tables are used yeah I mean I think most people are very comfortable with means and standard deviations but if you have wildly non normal data like hospital stays or medical expenditures which tend to be very low with very long tails with the at the extremes it really makes a lot of sense to focus on the medians and the ranges at that addresses the question I'm pretty sure the other questions that's really on that looks right I'm trying to get the YouTube link right for I think that's right yeah we did a tidy spreadsheets workshop a couple of years ago yeah that is right okay so the link most recently is for teaching people especially if you have a med student who needs a little bit of guidance it can at least get them started in the right direction is it possible to access the arm mark down to the slides well believe it or not I learned Korto for this at Shannon's behest so these are actually Korto slides yes it is possible the repository is public so if you are on our course website there is a I will share the screen yeah you can see the website now there we go okay so the website I keep getting crush notifications I think you're good now okay I think I'm not good Peter when I go to the link that you shared I'm still seeing dope really yeah but you know what the link to the Excel top is in the slides but I did share that early on too well I'm I think my copying maybe stuck yeah I was just having some pretty big problems with zoom but are you seeing my now okay so if you for the person that was asking about the source code for the slides if you are at our website and you click on this little hit hub icon is going to take you there's a source code for the website and the source code for the slides is contained in the slides so you'll dig into the Q and D's to see the source code and again their portrait sites Beth you're right I've got the proper link which I got from the website I think Chris just shared it too so we should have it double-shared now so how do you do the timer on the slides that is from that's a great question it always comes up was that called counter what was the name of that it's called from the countdown package countdown and yeah you can just pull up the slides in corto to find that we had another request for how to download the closet cloud repository so you can try to demonstrate that so so actually I've got it thanks okay we're going to do it awesome yeah I'm assuming you guys can see that so if you click here to go to the medical cleaning and this is my version right over here there's a export slash download button which allows you to download the space I don't see it on mine yeah I think there's a different way you can do it if you're in it um and I always forget so if you click on that workspace like just go into your cleaning medical data and then this is my version Shannon's version is up here but for my version I have multiple buttons I can throw it away I can move it or export it and just click here to export can a participant confirm if they have that button no I don't have it yeah so the other way you can do that we poke around and try to remember yeah if you find it Shannon go ahead and share that one because it did an export from from the R interface which it would work and what is the export that you got to work if you go to the R studio interface and then you go to the right bottom when you see the files and everything mm-hmm ah yeah oh in the file you click on the folder and then you have an export and then it creates a zip file makes sense okay let's see maybe Rodrigo can teach me how to do this okay good yeah yeah so here we are do you see my screen we see a notepad let's switch it around yeah if you go one level up on your files go to one level up yes check the project alright more then go to export all right click on project more and then export in order to download this project yes and then he's going to ask you for your local folder computer so again in order to do that go to the bottom left yeah there it is so you are again going to go up one level if you click on your little R2 here that's going to bring you to the home directory you're going to go up one level to your cloud we're going to click on project then go to more and export would you please share the github url oh sure there's my chat thank you hi khan how to save the chat if you're in the chat go to the bottom of three dots you can save the chat as a text file thank you everyone I do not plan to remove this repository so the repository and website should stay up as long as it exists we will update a couple of exercises based on some of the changes the website will have updated code as well for those of you who are asking about it i know shana made a couple of tweaks today in the code so we'll update that thank you thanks everybody yeah thank you thank you everyone