 Here's a tidy data set. It was tidied and cleaned from this messy spreadsheet. This video is a case study in cleaning an untidy Excel data set. The outcome we want is a tidy data frame. The input we have is a spreadsheet designed for human eyes, but it was not designed for reproducible computation. Regarding analytics, one of the things we know is that data cleaning consumes effort, reportedly 50 to 80 percent of your data project. And you know what? The spreadsheet model of data computation is one of the many culprits. So this one video on data cleaning is a building block tip. It's part of a series of video clips where I explain how to use the per map functions to iterate custom functions over data frames. As I mentioned, even though this spreadsheet is untidy, the Excel file does actually have useful structure and serves reasonably well as a report. At the same time, it's not especially computationally friendly. Why? Well, iterating over multiple reports from this US Census weekly pulse survey series will be more challenging than necessary since there are Excel spreadsheets. The fault is not that of the US Census. The fault is more systemic, a reliance on spreadsheet technology. Spreadsheets enable and even promote unfocused workflows. They basically encourage data chaos. Yeah, I said that. The report has no single purpose. Rather, it serves as both a computational workflow and a rendered document. Spreadsheets such as this are bad at displaying the artifacts of the project workflow. Furthermore, they are inconvenient for large-scale computational thinking. Now, back in the 1980s, spreadsheets were the killer app. In these early stage, personal computing tools proceeded and informed our evolving conversation about data science and reproducibility. But simply put, spreadsheets haven't kept pace and they have enough design flaws at this point that you should consider using better tools to support your research. But look, don't feel bad if you're using spreadsheets from time to time. I mean, it's fine. Instead, make it a goal to incrementally modernize your computational workflow. Over time, a tidy verse approach will feel safer and more reliable. All right, now for some code. The first part of my code around lines 33 is about getting the data from the US Census website. There's also a copy of the data file in the GitHub repository. So as always, see the repo for my shareable code. In lines 41 through 67, I introduce some of the code to normalize the data. Remember, my goal is to create computationally tidy data. To do that, I'm removing the report niceties because in a reproducible context, a rendered report is a trap, especially if the report is the only artifact you have. In a reproducible context, a pretty report can always be derived from a reproducible workflow and a code notebook. So yes, my final report can still be pretty and easy to read. But for now, I first need to clean and normalize the data so that I can leverage computational power and iteration. All right, let's take a closer look at the R code. I ingest the computational data set and the source data set begins at spreadsheet line four, not at the top at line one. I ingest the spreadsheets lines four through six. Look at that spreadsheet. These double row variable names, they're a mess. They're profoundly untidy. They're hard to use computationally. Notice also, I ingest only the first three spreadsheet lines. That's metadata, an element of a report title perhaps, but it's not tidy data. I'll just save that in a vector for later. That metadata will come in handy later when I generate labels for my bar graph. Okay, these ordered vectors will be useful when I coerce a sorting arrangement in my bar graph. Now for the good stuff, the data wrangling. I want a tidy data frame. First, I'm getting rid of the row that has totals in it. Why? Well, I can easily compute titles. I want raw data in my tidy data frame. Also, see how the spreadsheet has two levels of categories in column one. All right, again, nicely formatted for human eyes, but a big pain for computation. I want one variable for a subcategory, and then I'll fill down, and I still have my category or variable name called select characteristics. I want both. Now I'm going to get rid of empty rows again. Good for the eyes, not needed for computation. All right, let's have a look. We're tidying up. It's starting to look like a tidy data frame. In my analysis at the end, I want to look at all education variables for just North Carolina from this data set. Later in the next video, I want to build on what we're learning in this series, and I want to iterate over several Excel files so that I have data for all the pulse surveys. But for now, let's just stick with a single spreadsheet. It will help me refine my workflow. All right, here's a problem. All the numbers are character data types. I need to transform those. We'll use the across function to mutate all the columns to the right of the select characteristics variable name. I want to coerce the data types as numeric data, not character data. So from the variable total, all the way to the last column to the right. In this video series, I've mentioned how tall data is often computationally more useful for iterating. So now I'm pivoting this data into a tall shape. I want to get some data summaries and then coerce a categorical order to the select characteristics variable. As a tall data frame, this data will be easier to iterate in ggplot2. And using the select characteristics variable, I'll be able to facetwrap my bar graph. Recall that above in code lines 56 through 78, I gathered some metadata from the Excel report. Now using that metadata and using the ggplot labs function, I can present that metadata in my rendered plot. Lastly for good measure for reproducibility, let's save the wrangled data files as artifacts of our reproducible workflow. In my next video, I'm going to demonstrate custom functions, iteration, regex, and data cleaning. I'm basically going to tie it all up and demonstrate how to ingest multiple worksheets from the multiple census policy Excel workbooks and then clean them all up into one giant data frame.