 Hi, I'm John Little and you're watching the Introduction to R series. This series is part of the R Fund Learning Resources website sponsored by the Center for Data and Visualization Sciences, a part of the Duke University Libraries. In this section, we'll talk about how to turn your messy data into tidy data. What is tidy data and what tools are available to reshape that data? First of all, the concept of tidy data. It comes in a variety of formats and as a data analyst, you'll spend an inordinate amount of time just reshaping your data so that it's more convenient to do efficient and reproducible analysis. One of the best ways to do that is to turn your data into a rectangular grid data like a spreadsheet where every variable is a column, every row is an observation, and every cell or single value is the intersection of that variable and observation. Now to do that, we often need to pivot our data. It's worth noting that the tidyverse has evolved somewhat rapidly in the last couple of years and there have been lots of different names listed here on how to pivot your data. Most recently, tidyR, which is the package that enables this reshaping, was using the terms or the function names spread and gather. Recently, new functions were introduced that are a little bit easier to understand and do nearly the same thing. PivotLonger replaces or does something similar to gather, pivotWider replaces or does something similar to spread. We'll talk about pivotLonger and pivotWider. I'm going to open this longer, wider markdown file. First thing I'm going to do is I'm going to load the tidyverse package, which by default will load tidyR. I've turned my messages off. And because I've loaded the tidyR package, I have access now to several onboard data site sets, two that work really well with tidyR, Relidge underscore income, which is a data set of religious income by religious affiliation and fish encounters. So I'll execute those code chunks. Now let's talk first about religious income. That data frame, if I execute it, looks like this. And what I see in the first column and then income brackets has variable names across the top and how many people identified into those income brackets, based on the number of people, obviously, who answered the survey. The last column is don't refuse or don't know or refused. If I were to pivot longer using this tidyR function, the first argument are what columns am I pivoting? And I'm actually pivoting from column two all the way to the end. Given tidyverse deployer selection methods, I can simply say all columns accept religion. Or I could have written from column two to the end. There's a number of ways you could write this, but this is the simplest way. And then what I'm going to do is I'm going to take the names or the column headers and turn that into a variable called income. So the column header names become variables in a column called income. And then all the rest of this, these values become variables in a column called count. So what does that look like? That looks like this. So we can now demonstrate certain tidy data concepts, right? Namely, every row is an observation. And every cell is a single value, right? Prior to that, there were multiple observations in this first row. 27 people who identified as agnostic or identifying as making less than $10,000 a year. Whereas 84 people were identifying as making more than $150,000 a year. Here, each one of those observations is its own row. And each one of the cells has its own value. There may be some redundancy, but that's okay. We scroll through this whole thing. We see that the atheists and then the Buddhists and then the Catholics, they all have values for each observation. We can do something similar with PivotWider. So there's that onboard data set Fish Encounters, where each fish has an ID and each fish it goes by has a release station. And there are a number of times that fish is encountered by that release station. If we wanted to pivot this back to a wide format, could do something similar to what we did before. We could use PivotWider names from so the new column headers come from the station variable. And then the values come from the corresponding scene variable. So when we run that, we now have a wide table format, which may suit certain analyses that you need to do. That's really everything that this video needs to tell you. But sometimes in my workshops, people want to know a little bit more. Essentially, why exactly do I want my data to be long? Why am I doing this? So I just want to go on a little bit deeper and say that, for example, for example, in a tidyverse context, GG plot really kind of mostly prefers long or tall data. So we pivot that data to make it easier to use GG plot. It's not absolutely necessary, as you'll see here in a minute. But if I take my religious data and I run that same pivot command that I had earlier, and then I pipe that to GG plot, identifying the X and Y axes, I can get a bar chart. First, let's take a quick look at the data just to refresh our memories. I'll highlight those two things and type control enter. Here's my data frame. So every row is an observation. And religion becomes my X axis. Count becomes my Y axis. And I'm making a bar chart where it'll be a stacked bar chart so that different income values are associated with different consistent colors across the bar chart. What does that look like? There's GG plot. Now, this isn't the most beautiful chart in the world. In fact, it needs quite a bit of work. But the fact that it needs quite a bit of work can, at least indirectly, demonstrate why tidy data has some value. Once my data is in a tidy data format, it becomes more natural, in fact, easier to make variations, variation plots to tell my data story. So what would I want to do with this data set? Well, for one thing, I'd like to sort from tallest bar to shortest bar. And for another thing, I'd like to have the legend be in order. And you'll notice here that it's not really in any logical order. We go from 10,000 to 20,000 to 100,000, 150,000 to 20,000 to 30,000. And we can't really read the X axis labels. So these are not strictly tidy R concepts. This is more demonstrating how the tidy verse comes together to make for a more holistic and easier set of functions to use as you tell your data story. All right, so what would we want to do with this chart? Well, we'd like to order the bars. We'd like to order the stacked bars so that they're in the right order and would like to order the legend. So now that I've got down the concept of converting my data to a tall format with pivot longer, I'll introduce a couple other concepts, which I can talk about later, such as turning a vector into a factor or categorizing my data. And by turning that vector into a factor, I can impose order, which I'll do here. And so I'm creating a vector called income levels in a particular order. I'm using the four cats function factor re-level to impose income levels, the income level order, on the income variable. Now it wouldn't be perfectly obvious if I just run, oops, I need to run both of these. If I just run that part of the command, it's not perfectly obvious that that's done anything other than I can see that this vector is now a factor. But those levels have imposed this order through this function factor re-level. And I can pipe that then to ggplot and using the fill argument, which is also going to be factored based on income values. I can create a stacked bar chart. Oops. Then in this case uses the veritous color ramp to categorize. Now my legend is in order. My bars are sorted in order. And the color of the stacked elements are in the same order as the legend. Again, now I want to go back. You don't absolutely need to reshape your data in order to do this. So I could look at a bar chart of just religion in the $40,000 to $50,000 range sorted because I'm using factor reorder here. I could do that without ever pivoting my data. But if I wanted to look at a different income category, I've got two places to change my data. And every other income category I want to look at, I have to change that my code again. All of that introduces opportunities to make typos, to make errors, and it makes it far less likely that the code will be reproducible. So by the nature of pivoting it, I can create subsets of the data by income category simply by using a filter statement. Right here, I'm filtering to the value $40,000 to $50,000. And I'm getting what looks like the same chart. Right, but if I wanted any other value here, I only have to change one thing. And if I wanted to look at it for all of the variables, all of the values of all the income variables, as it turns out, once again, using the tidyverse as a whole, I can use other ggplot functions to look at all of the values, or I could use something in the per library to iterate all of those values. Here's an example of where I'm using a ggplot function called facetwrap to iterate over income. So even though this is a really tall chart as well, displayed tall, and I displayed it tall so it would be easier to read these categories, you can see that, what did I do? Well, I pivoted and I mutated some things, but ultimately that was all so that I could simply use this one function here called facetwrap by income and each one of these subcharts is one of those income variables, and I never had to manipulate a single income variable. I was able to manipulate all of the income variables. One more example, suppose I want to clean this chart up even a little bit further, I don't want to look at all of these small sections here where there's not a lot of response, and maybe I'm really just interested in this, and maybe I'm interested in highlighting this category, 40 to 50,000 versus this category, over 150,000, but I really want all of the data to be present. I can do all of that with a few more ggplot functions, but again, because this isn't the tidy ours section, I want to note this is possible because I first pivoted my data into a tall format, which is more amenable to iteration using the ggplot function in combination with the four cats library to make categories, and ultimately this data story becomes easier to tell because of those features.