 Hey folks, I'm in the process of trying to make this slope plot look better and one of the things that I noticed that would perhaps help the visual interpretation of the data here is trying to get the countries to rather than be alphabetical to be in the order of whether or not people were willing to receive the vaccine. Also, although I've got the months in the right order of August and October, what would happen if I instead of August had December? Would it still go October, December? Probably not. Well, in this episode of Code Club, I'm going to show you how we can specify the ordering of text variables, categorical variables, like country names and months. So keep watching and I'll show you how we can do this. So yeah, the country names here in the colors are in alphabetical order. They're not in the order of the percent willing to receive the vaccine. And the months August and October, we kind of got lucky that they're in the order that they were in, right? If we had September and October, it would go October and then September, because what ggplot does by default, it takes these categorical variables and then alphabetize them, and then put the variables out there in alphabetical order. I don't think this is going to totally solve our problem of having 15 colors for 15 countries. But if we could put the countries in order of either the August data or the October data, it might be easier to visually connect the country name with the color. So again, I don't think this is going to solve our problem, but it's a great opportunity to show you how we can modify the order of the country, as well as the order of the month on the x axis. To do this, we're going to have to learn about a special variable type in R called a factor. So factors forever just were really frustrating to me and caused all sorts of problems. A factor is really nothing more than an ordinal variable. So it's an ordinal variable. It's a categorical variable. Okay, what's a categorical variable? A categorical variable is a data type that fits into different categories, right? So like a country, a month, a gender, sex, these are categorical variables. An ordinal variable means that the data then have an order, right? So months are ordinal, right? Because August becomes before September comes before October. Countries aren't necessarily ordinal variables. Because there's really no, you know, reason to put one ahead of the other, unless you're thinking about things like perhaps population size, distance from the equator, or oh, yeah, whether or not people are willing to receive the vaccine. So we are going to convert our countries and our months into factors so that we can specify the order in our visual. Again, over in our studio, I have the current R script that we've been working on in the last episode in this one. If you want to get your own copy down below in the description, there's a link to a blog post for today's episode, where you can get this R script, as well as kind of further down, you will also see the data that I'm using that I got out of the Ipsos website. So again, the output of data has in one column the country, the month, and the percent. One of the nice things about a table format is that we can see the type of data in each of the columns. So looking at data here, we see that country and month are both CHR, which is short for character, and percent is DBL, which is short for double. And so again, character is another way of saying categorical. And we have month where we have two values, August and October. And I would like that to be a factor so that on the x-axis, I could put, you know, just for practice, humor me, put October and then August. To do that, we are going to use the factor function. And at the end of this data pipeline, let me add that. We'll pipe that to a mutate on month. And so I'll say month equals factor month. And then I'll say levels as the argument. And then in levels, I can give it a vector specifying the levels in the order I want them, right? So the default would be August, October. But I could also do October, August. And so now I see the opposite of what we had before. We now have October and then August. Of course, this isn't what we want. We want it to go chronological from August to October. So I'll go back up here and change August to October and October to August, so that we get basically what we already had. And so of course, then we get August and October back to the way we want it. One other thing that you can do with the factor function is that you can also set the labels. So we set the levels, we could also do labels. And then I could say August, 20, right? And I could also do October, 20. And this of course needs to be wrapped in the C function. And so now that says if I see August, I'm going to replace that with August 20. And if I see October, I'm going to replace that with October 20. Of course, there's other ways to do that that we'll see later with things like scale x discrete, you could also, you know, set the name of the month up here, right? So we said August equals total agree August 20. There's different ways to do the same thing. But know that you can also set the labels as an argument to the factor function. One last thing to show you is if we look at the output from data, we now see that month is no longer CHR, but it's FCT. And we no longer see August or October, we see instead the actual label that I specified for normally when I work with factors, I wait till the very end when I'm making the plot and modifying the labels directly on the plot, rather than on the data frame, but know that this is also an option for you. So the next thing I'd like to do is go ahead and modify the order of the countries as they're being displayed on my figure. And so again, by default, they're being plotted in alphabetical order. And it's kind of hard to see, but the lines also go down in alphabetical order. So this pink US line is here. And if you zoom in enough, you will see that that pink line goes over the top of every other line it encounters, because you for the USA is at the end of the alphabet and the last of the countries in alphabetical order. But I would rather France be last because France deserves to be last, right? Because they're not willing to receive the vaccine, they go at the bottom. And you know, perhaps China or India should be at the top of my list here, because they're the most willing to receive the vaccine. Then the question is, do we order by August or by October? So looking at our pipeline to create the data data frame, you would be totally justified to think, well, I'm going to come back into this mutate function where I made month a factor, and I'll say country equals factor, and then give that country. And then I could say levels equals, and then make a C vector where I list out all 15 countries. But that would be a royal pain in the rear, right? And also, if we then looked at other months of data as they became available, then we'd have to update this list and we just get really tedious. So we need another approach. So thankfully, there's a package called four cats, which was developed by Hadley Wickham, to be part of the tidy, tidy verse to make working with factors a lot easier to use. So instead, we'll use FCT reorder. And for FCT reorder, we give it the name of the factor. And so it doesn't have to be a factor, it can be a character column. And so we'll say country. And then we give it the column that we want to reorder by. And so I might then say percent, right? But you might be thinking, you know, Pat, at this point, data only has three columns where it's got country, month and percent. If I order by the percent column, well, that country is going to show up twice. So what's going to happen there? Well, let's take a look actually. And so this is what the output would look like. And I know that this big slope down at the top is China and the flat one is India, right? And so I am seeing China at the bottom, France at the top, followed by the USA. So first of all, the order is flipped, right? Flipped from the order that I have here. And it seems to be using the August data rather than the October data, right? So we're not really sure what's happening here. And I've also seen situations like this where you might get an error. And so when you use FCT reorder, or even with levels, you want to make sure that that level only appears once in the column that you are setting the levels by. And so to do that, we can actually take this line from our mutate and move it further up before the pivot longer. So if I look at the code here, right before my pivot longer, I then see that I've got my country and my two months. And so I can then reorder the countries by the month. So let's go ahead and do that here. So again, I will add that to this mutate line, where I've got factor reorder on country by percent. And instead of percent, let's go ahead and do October. And so now we see India is the top, which is that horizontal line, France is kind of this orangish reddish line down here at the bottom. And the USA is kind of this brownish greenish color here, right? And so we now have things reordered by the month, but it's the opposite of what we want. So how do we get India to the top and France to the bottom? Very easy. What we can do is again, we come back up to our FCT reorder, and I can put a minus sign in front of October. Now I have India at the top, followed by China, which is again, they have the decrease between August and October. So that makes sense, followed by South Korea, Brazil, right? And at the bottom, we have France. And then the USA now is this pinkish color right here, right above or connected to Spain and just below Italy, right? So again, I think this kind of helps interpreting the colors and seeing what color matches to which color, but not really. So again, if we wanted to instead of doing it by October, do it by August, then all we have to do is come back up here to October and change that to August. And so now we have China at the top, France is still at the bottom. But then we have South Africa in between the US and France, because again, it's picking that sorting by the August, rather than by the October, you could even do, you know, something a little bit different. And you could say, well, let's do it on the average of August and October, right? So you could make, you know, an average column, and you could then say, August plus October, divide by two, put a comma there, and then throw average in there. And then we've got everything sorted by the average between August and October. But again, I don't really want to do that here, but I'm kind of showing you this for demonstration purposes. So I'm going to go ahead back to how we had it, sorting things by the October data. And that because that makes the most sense to me at least. Again, in this episode, what I've shown you is how we can specify a column to be a factor and set the order and the labeling of those values in that column. We saw that by specifying the month. So that instead of the months across the x axis being alphabetical, or any other categorical verb, it doesn't have to be month, you know, you could do gender or sex or race or any of these other types of categorical variables. Instead of being alphabetical, you can specify the order. Also, we saw how we could use FCT reorder from the four cats package to specify the order of our categorical variables are our country in this case. Note also, I did not have to do a library function on four cats. That came preloaded, if you will, with the tidy verse. And so again, I think that was really nice to be able to have the ordering specified for our countries in the legend, so that it connects to the ordering of our lines, at least on the October side of the lines, to make it easier to interpret what's going on to say like, Okay, that red is India, that orange is China. And then to kind of see, Oh, yeah, China really fell off. Right. And then to see that Oh, this down here is France. So this doesn't solve all of our problems with this visual because there's a few more problems to think about. But, you know, if you had a fewer number of countries here or of other variables, then it would be a I think a great help to your audience to be able to connect what's going on. In my world of microbial ecology, people love to make pie charts or stacked bar charts. I hate both. And one of the problems that I have with them is that there's often too many colors, right? Like this, if you think 15 is a lot, check out some of these figures. Anyway, but if you had five figures, then and you could order the legend by the color or by the abundance, rather than alphabetically, then it'd be easier to connect the color for say the formicities or for the lecno-spraceae or lactobacillus to that wedge within your stacked bar chart, or in your pie chart, because you would have that order there and it'd be easier to kind of make that connection between the color and the taxonomic name. Anyway, I hope you find this helpful. Go ahead into your own code and see if you've got cases where you have categorical data being represented in ways that that you don't really want it. Another place that we often see this is where you might have a wild type strain and then different mutants. Well, maybe you want your wild type as your control or your reference, whatever that might be at the left side. And again, all your mutants or your other variable variable things off of that control could be to the right of that. Again, there's just lots of applications for using factors and that it works really well with ggplot to get the ordering of those variables exactly how you want. So keep practicing with this and we'll see you next time for another episode of Code Club.