 It's LinkedIn Learning author Monica Wahee with today's data science makeover. Watch while Monica Wahee demonstrates how to use the upset R package to make an upset plot in R. Hi everyone. Today we are going to make one of these, an upset plot. So let me explain to you what an upset plot is and why you want one. Imagine you have a bunch of experimental units, like survey respondents as we have here. I was analyzing data from the behavioral risk factor surveillance system, which is a health survey. So these people were asked, has a healthcare clinician ever told you you have? And then they'd say arthritis? And the respondent would say yes or no or whatever the answer is. And then they ask diabetes? And you go through all these conditions. Would you believe some people have more than six of them? That's what you learn when you spend your life in BRFSS data. Okay, so the analysis my colleague and I were doing had to do with people who had one or more of these conditions. So we did the preliminary analysis and I was wondering, okay, what is the pattern here? What kind of people are we talking about? Are we talking about really sick people with multiple conditions? Or are we talking about people who might be really sick, but they only have one condition? So as you can see, this upset plot was totally helpful for answering this question. As we can tell by these tall bars over the single entry of arthritis, then depression, then diabetes, the answer is most of them just have one of these three conditions. So that is a useful plot because it answered our question. And now I'm going to show you how to make one. So I put the data set I used in this code I'm going to show you in the plot on GitHub. Just look for the link in the description. Now the native BRFSS file is totally huge, like over 400,000 rows, and it's served up in SAS format and like CSV and TXT, I think, but not ours native format, which is RDS. I find it really helpful to put big data files in RDS format if you're going to analyze the data set in R. So I did that. Well, the native coding of the variables was kind of a mess. If you've watched my SAS videos on how to make macros, you'll see I work with those native variables. If you are a native variable, and I'm building a SAS macro about you, that's a bad sign. So anyway, for those reasons, I prepped this data set in R just perfectly for the upset R package. And now let's read it in and look at it. Okay, you'll see here that I read in the file and I call it plot underscore DF. Then I run a call names command so we can look at the column names. And I run a head command with a five parameter. That shows us the first five records of the data frame. Let's highlight and run this code and look at what comes out. Okay, first I want to point out the column headers. In the package we are going to use, which is upset R, it's just easier if you name these columns with the exact name you want to show up in your plot. So you see how I put kidney disease as kidney dis with upper and lower case and a period in it? Yes, SAS users, R lets you do things like that. Very unorthodox. But you're allowed to do it, so I say do it. It makes your plot easier to format. And again, with upset R, there is more than one way to prepare your data. But I find it easiest to just create a set of binary flags coded one and zero for have it and don't have it. In upset plot land, these are called your sets. Okay, now we have the data in. What's next? Oh yeah, all this. See these comments. Let's go to my blog post. Okay, all of my data curation fans, you are gonna love this one. What I did was annotate my plot to make it easier for you to understand all the different options I said. The problem I always have with this package is what I call the New England driving problem. In New England where I live, the roads are so funky that if you use a GPS and you come to a seven road intersection or a rotary or some goofy other road thing that they have out here, how is the GPS supposed to tell you where to go? So the upset R package is awesome and lets you set all these options. But the problem is what is everything called? So here I annotated it with what everything is called in the package. These are the two titles you have to worry about configuring. Intersection size title and set size title. And we have two sets of tick labels we can configure. Intersection size tick labels and set size tick labels. We don't have to worry about the set names because we name the columns in our data frame the way we want those to come out. So that won't be a problem. But then we have all these colors to set as well. And of course we have to beautify our plot with communicative and meaningful colors. The main things to color in this upset plot are the set bars, the ones over on the left, the main bars, the violet red ones at the top, the matrix which are the purple barbell looking things indicating the patterns in the middle, and the shade which is in those light colored bands behind the purple matrix shapes. So the cheat to setting options in R is to set them either to a variable or to a vector and then call the variable or vector in the plot. You will see I made a cheat sheet for a vector you can make for just the text scale options which we will use later in the plot. Please note I didn't choose to display data labels on this one. They were just too big and ugly. But if you show data labels you can include formatting for that at the end of the vector. I called it numbers above bars. Okay now let's go back to our code. See this huge comment? This basically says the cheat sheet I put on the blog post that I was just showing you. So we'll skip that. We already went over that. We'll go on here. So here is me using my little cheat sheet to create the text underscore scale underscore options vector. But as you can see I couldn't decide. I made three different vectors with different options so I could try different ones on. You'll see later that I settled on the third vector and used that one in the plot. Now here I make variables to set the colors. So I'm calling them main underscore bar underscore call sets underscore bar underscore call matrix underscore call and shade underscore call. And if you use my diagram when you do this you know what colors you are setting when you do this. So you'll see all these variables show up in the plot code. Oh and then there's this mb.ratio option that I wanted to set so I created a vector. Let me read to you from the documentation. To change the proportions of the plot heights assigned to the matrix and intersection size bar plot use the mb.ratio parameter entered as percentages. Okay I could not really imagine that in my head so I set these parameters here in this vector. You will see that I named it mb underscore ratio one because I was just throwing spaghetti at the wall. I really didn't know what proportions I wanted but it turned out the first spaghetti I threw stuck so to speak. I didn't need to make any more but if I had I would have made mb underscore ratio two and mb underscore ratio three like I did up there with the text scale option vector. That's how I like to try on different formatting in our plots. It's kind of like when you go to buy a dress or a skirt and you just get a bunch of sizes of the same one off the rack before you go in the dressing room. It really saves the time of having to put your clothes back on and go back out if you pick the wrong size and this is the same sort of idea. Okay let's proceed. All right here is where I create the vector set underscore vars with the list of the variables in the data frame that contain our sets. In other words these are the variable names of the binary flags we want graphed. Now after I did this I realized I could have automated it like I could have turned the call names of plot underscore df into a vector. But the way I did it is the way you want to do it if you have extra columns in your data frame that you do not want graphed. Okay now it's time to call up our libraries just ggplot2 and upset r. Now let's take a look at our plot code. So you will notice I formatted this in a very neat way but it is not ggplot2 code. There aren't a bunch of pluses in this code. It reads more like base r code with a lot of commas but it should be fairly easy to read. You should be able to pick out all of our variables and vectors we made that we are calling up now. Remember them? It's kind of like a treasure hunt. You will see I hard coded the x and y labels. I could have made those variables too but I knew what I wanted to say. Okay enough talk. Let's ctrl a and ctrl r. What can I say? I'm speechless. It's gorgeous! Another successful data science makeover. Thank you for watching this Data Science Makeover with LinkedIn Learning author Monica Wahee. Remember to check out Monica's Data Science courses on LinkedIn Learning. Click on the link in the description. Thank you for watching this video. If you found what you learned useful then please hit the like button. Also I invite you to look around my channel and if you like what you see subscribe. I hope your data aren't misbehaving and it's a good day. Peace.