 Well, good day, everybody, and welcome to today's presentation, Lovely Labels with Paste and R. Thank you for coming today on this nice Tuesday. It's good to see you. And I'm Monica Wahee, and I'm a data scientist. As you can tell, I like R, SAS, and more. And so if you hang around me, you're going to learn about R, SAS, and more. But today, we're going to focus on pasting. So I use R GUI. Some of you may use RStudio if you're used to doing more web development. I'm an epidemiologist and biostatistician by training. So usually when I'm using R, I'm doing a research analysis. And so it's sort of static, like I write a peer-reviewed article. But either way, I need to make visualizations, and you need to make visualizations. And so one of the things that can be difficult is when you're making a data visualization in R, I'm using that term, like basically a bar chart or a pie chart or time series graph. Maybe you're using ggplot2. And you don't like your data labels. So maybe you don't want data labels all the time. I often do want data labels, and I often do not like them. In epidemiology, and probably in a lot of fields, we often save codes. So like, for instance, we have FIPS codes for each of our states in the United States. And in Florida, I think, is 12. And so if I'm doing output, I might want it to say Florida and not 12. But I also might want a label to say, like if I'm doing output about Florida, say state equals FL, FIPS equals 12. I might want it to say this complicated label. So that's why I like paste command in R. Now, you're probably going, well, that just sounds like you're concatenating things. Yes, it is a basic concatenation. It's a basic way to assemble a vector in R. We have these vectors, right? And you can assemble a vector. You can put characters in it. You can put numbers in it. You can put values. Like you can call up values and put them in there. You can put variables. But basically, it's a way of concatenating a vector. But what I care about is this whole labeling your plot, right? So the little picture is I'm going to teach you how to use paste command in R for just making vectors, making them say what you want. But what I'm going to do is teach you the fancy footwork, the bigger picture behind how you can use it in data analytics and specifically data visualization. But before I go on, I want to invite you to my free online workshop, which is taking place Saturday, November 18th and Sunday, November 19th, and it's on Zoom. So you can attend from the comfort of your own home or office or wherever it's comfortable for you. Some people are in their cars, I think. But anyway, you're welcome. It's free. It's six plus hours. Basically, half of it is on Saturday and half of it's on Sunday. There's two group sessions. And what's the topic? The topic is application basics. So it follows this course, online course I have, which is a core course in my data science group mentoring program, my online group mentoring program, which you can join if you're interested in doing that and making portfolio projects. But if you just want to learn about applications and come and network, sign up for my free workshop. And this is the first time I'm holding it on the weekend. So I expect there to be a lot of people there because a lot of people work during the week and then the weekend hours might be better. Also, you should already have a link to download these slides, which you'll want to do because they have links in them. But look in the chat for the link to sign up for the workshop. So again, what the workshop is about is application basics. So how applications are built, who builds them, how they're designed. Now, some data scientists already know all this because maybe they learned data science from taking a business background. They have a business background or they have more of a computer programming background. If you're in healthcare or biostatistics like me, we don't really get trained in how applications are designed and built. And one thing we're not taught about is application pipelines. And right now there are a lot of analytics platforms that you can put into your application pipeline like R, like SAS, like SAS Maya especially. So if you want to talk about those things like integrating application pipelines and get knowledge of application basics, please come to our free workshop and I guarantee you'll have a good time. All right, thank you for that moment and we will return to the regular schedule program which is about paste and R. Okay, so see these resources here? There's, I've got videos demonstrating and I got code you can download there but I'm gonna just reiterate why the paste command is so useful in R. So the paste command is the main way you can catnate text together and you can put numbers together. You can put anything together but it's the main way you do that. So like in SAS, I think it's substring or string or something, this is a paste. So you need it to create labels. So a formatted label could be used in like a printed in a header of a report. Like so, you know, like I was saying, like report number of law, date of report, blah. Like you can assemble that whole header using paste and the values and stuff and then place it there. And also, like I was describing it you can place it on a plot to label things. So, and you can use all kinds of input to create this label which can be saved as a vector or even as a variable in a data frame. And I'm gonna show you that right now. So now we'll enough talk and we'll go back, go over to R. Okay, here we are. So if you go to that blog post, you'll find there's this simple code here. Let me just first go over what's going on in this code. So I first started with a really simple case. And the simple case is where I'm entering all of the arguments of the paste and I'm hard coding it, right? So I thought of like a phone number where I grew up in Minneapolis, we had this area code 612. So I thought of a phone number like 612 comma, so this is three numbers here, 781 comma and then 8888, I just made this up. So this is like, let's say you have three numbers here and you wanna just put them all together. Like you wanted to say 612, 781, 8888, like all together and that in being a vector. What you'd probably wanna do is put this paste command and then put a closed parenthesis, but you can't do it because you have to set the set. So see set here, that stands for separator. Now, remember how I just said I wanted to say 612, 781, 8888 and just be mushed together? Well, you have to set a separator. So how you do that is you just say equals and these are just quotes right next to each other. Like there's nothing in between them, which is your secret code to R that just says don't separate it with anything, okay? Now, if I just run this code and I wanna just do control R to the console, you see what the value is. See, I told you that was gonna happen, right? But we wanna save this like in an object. So we can save it in a vector here. So we're gonna save in this vector called phone number here. So control R, okay. So now we have this vector called phone number. So I can run phone number and you know what it looks like. It's gonna look like that. And the class, so what is this? I told you it was a vector, but we can look at it and see this is actually a character vector, right? All right, so the purpose of this demonstration was to just show you how to actually use paste. But now we're gonna get a little fancy, okay? So what are we gonna do with this next one? Well, we're gonna make two changes. One is we're gonna make one of the arguments actually be letters. I just made up TV TV. I don't know why I did that, but you'll notice when it's letters you have to put quotes around it, okay? Other thing we're gonna do is change the set. We're actually gonna make that be a dash. So now what we expect is this is gonna say 61271TVTV and it's gonna have dashes in between, okay? So we can run it to the screen and then good, it came out the way we thought. And then we can run it like here and it's called phone number with dashes and we can look at it and there. Okay, so imagine you assembled some sort of vector like phone numbers with dashes and you wanted to put that on everything. Like you wanted to put that on the header, you could do that. It's nice because now you have this whole thing assembled, but what's really cool about it is when you put values of variables in it. So what I did was I created a fake data set here called line items. I don't even really like this data set, but it was easy, so create it. So I'm gonna read it in and you can get it on that blog here. So I'm gonna read in this data set. So let me just show you what is in line item. It's not even a real data set, but it's just to mimic. Remember the Northwinds data set? Those of you who are old like me, that was a demonstration data set that was handed out with like Microsoft access. Well, there was a table and it looked a little like this. It had line items from like reports or something. So here we have line item to ID report ID, report order, cost or OID and you've got these costs over here. Okay, so this is a data set. Okay, it's a data frame. Well, one thing you might wanna know about this data frame is what is the maximum in the total cost? Okay, and if you did that, this is how you run it. You say max and then line items, that's the name of the data frame and then underscore then total cost. So let's just run this to see what the maximum is. Oh, it's $293.88. Now, this is where the power of paste comes in. Let's say that you were running reports, okay? And you wanted to put this maximum cost in the report header or like a report label. So see how I made this paste here, it says paste and then the first argument is the maximum total cost is and then there's a space and there's this comma here. And then I just literally put in max line items total cost, you know, this thing here. And then I put exclamation points and then the SEP is nothing, right? So now that we have this data set in here, we can do this. Okay, so I'm gonna run it and create report label and then let's go look at report label. I've got this in the way now. So you see here, now the, so you can see what's cool about that but then there's other cool stuff you can do. Like you can just prepare labels and put them in another column. Like this is called line items, right? So let's say I go line items, new label, I'm just making this up, okay? So I'm imagining I wanna maybe make a label for that might go on like a scatter plot or something that I was gonna say. So maybe I do paste and I'd first wanna say, like maybe I wanna say the report it's it. So I'd say report number comma and then see, this is like a, see this report ID, I could do that. Okay, so, and then I'd say line items report ID, right? So that should say that, like right mushed up against this thing and then we'll say comma and then cost, we could say, right? And then we could put this line items to cost, right? We just put whatever the cost was. Okay, I better see how confusing this gets. Let's see, here's this is cost, let's see. And then line items to cost, okay? And then we can't forget set, right? Is nothing, let's see if this works. Okay, it looked like it works. So let's go look at just this field that I just made. So this is what it says, like here's report 26, cost is this, report 27. And let me actually show you the whole data set, just kind of, actually I'll just show you the top of it. I'll say line items and that'll go like the top five rows. I think this is how you can do that. Oh no, this is number five, row five. I gotta do like this, one through five. So you see these top five rows here? Here's the report ID, it's one, two, three, three. And here's the total cost. And you see here how I've put them together into this label. Now what the idea is, is I've actually saved this as a new label. So I could go and make a plot and refer to that as the label. But what can get kind of hairy is that like, let's say I want this to be sorted in the order of report ID. I can't really sort it in this, like things get goofy here. So you end up having to sort it by an actual value that you're probably graphing. And you know, things like fill, like if you wanna fill it in ggplot, you've gotta actually refer to these original values in here. But the label is what you can choose to just report. Like this is what you can choose that shows up on there. And I'm sorry, I'm not getting more deep into how that is. But I did, I can go back and give you an example here because I found one on the web. It's actually kind of hard to find them. So this is an example just from this, our charts, so what's going on here is this person was doing this tree plot and they had like, they wanted us to say, this is group nine with 41 people in it, group three with 50 people in it or whatever. But they didn't actually have this label assembled in their data like I just did. So instead of doing what I just did, they called it on the fly. So here's their ggplot. So here's a label, see this paste. So group must be this group here, group nine, right? Like that, they must have a column where it says group three, group nine already. And then value is 41, right? And then the step is equal, I don't know if you can see that, but in quotes, there's a backslash and then an add. And those of you who use RLi will know that that means like a carriage return. So basically group nine and 41 are separated by an enter, which is why it's on the next. And I've never seen that before. But you won't probably find me calling a label, a paste label like this in a ggplot, like calling paste in here. What I'll do is I'll assemble a plot data set beforehand and create the label field. Why? Because I just want to make sure it looks the way I want. And also I can be picky, I can change it. Like if I want one of them to have, let's say baseline or comparison or whatever, I can just modify that row. Generally in ggplot and R and base R, but I generally use ggplot and R for plotting. I always make a plot data frame. I always assemble a plot data frame just to serve the plot. If you're used to using SAS, that's a new concept for you. And so it was a new concept for me when I got to ggplot. In SAS, for instance, in SAS we're used to putting, like LifeTable is a great example. You want to get a Kaplan-Meier, you put that in, and it just takes forever for that LifeTable to run if you're using PC SAS. And the kind of data that maybe LifeTable is not a good example, you'll like proc unit variant. You give SAS variable, like a continuous variable, and it does proc unit variant, it does your moments and your summary statistics, and then it gives you a box plot. Whereas in R you can do a lot less, like you can just ask for the box plot. You know, like R is just much more lean. And sometimes, and when you're graphing, you're really only graphing the summary statistics. And so you can assemble them beforehand. Like for example, if you want to put error bars on a plot, you can just calculate what the error bars are for your plot data frame. I even have a blog post on how to do put error bars. It's like a line in your ggplot to code. And so I always have the error bars already calculated and as fields in the plot data frame, just like with labels. You don't have to, you can do this, but that's my style because then that way I can make sure that I did it right. I see more people are here. Let me just remind all of you about my free workshop. If you're interested in applications, like what I was just talking about, the space in R. So why is that so important? Well, I'm always automating things. And so if you're gonna try and automate things, like Daniel's here in the chat, so he knows, he's always automating SAS. He's always making SAS reports, I guess, because most people who are doing what you do making SAS reports. And so when you automate R for reports, like you're gonna have to use a lot of paste. And when you, and probably it's a good idea if you're gonna automate plots that you do that. Well, what do I mean by automate that? Well, you gotta do ETL, you gotta figure out your display. You gotta integrate an application pipeline. And so that's why that's the theme of my November workshop is integrating application pipelines. Like let's say you do use paste because I taught it to you today. Where is it in your pipeline? What are you pasting? What are you trying to produce? You got a dashboard going on? Or are you, maybe you're delivering an online newsletter with an image in it? I mean, who knows, right? So that's what our online workshop is about is learning about applications, learning about linking them up, making application pipelines. The kind of things you have to think about when you do that, especially for analytics, right? Because if you're a data scientist and you're sitting around on an application pipeline, you're probably wanting to analyze the data coming through the pipeline. So how do we set that up and make that work, right? Because that's probably your job, usually is mine. So if you're available Saturday, November 18th and Sunday, November 19th for an online Zoom workshop on that topic, please sign up and I'll put the link in the chat. You'll get access to this online course application basics, which is part of the core courses for my online data science mentoring program, which if you're interested in that program, you make online portfolio projects. So I teach you about basically how to present, how to do analytics and present your results. It's hard on one hand, but easy on the other. Like analytics itself is like survival analysis and stuff, it's super fun. But the whole issue of like, what are we doing and what problem are we trying to solve? And what are your recommendations after you did all this? That's actually hard, right? That's like harder than survival analysis at the end of the day, right? You know, it's easier to explain your survival analysis than it is to explain a basic analysis about some complex thing that led you to make some decision. And so that's sort of what the whole mentoring program's about. But this course is a core course in it that teaches you about applications because if you're gonna solve these big problems, you gotta understand sort of how applications are built. After I learned all of this and I learned it from colleagues, I learned it from working in IT departments where I felt very much a fish out of water. I didn't understand any of the terminology people were using. I didn't know what was going on. But luckily I didn't have any trouble telling people that I didn't know what was going on. And so I had really good mentoring and I was taught and now I put all that into this course for you in this workshop. So hopefully you can show up and participate with us. Thank you very much for coming today. This is just a short little thing and just to teach you about our... Well, you can ask any questions if you have questions about our SASS or more, feel free to ask. And you can always also, if you want a free consultation with me, you can always sign up for that. Just connect with me on LinkedIn and let me know that you want one and I'll give you a linking and sign up. But please keep in mind that I'm gonna do these Tuesday lectures and I record them. So hopefully maybe you're watching this recording, maybe you can sign up or come to register for the other ones. I'm trying to cover different topics that people are asking me questions about when they meet me. And so thank you, I'm seeing some thank yous here. Thank you for coming. And I hope you enjoyed this presentation and I also hope that you sign up for our workshop and I get to see you again. Have a good week. Thank you for watching this video, which is part of the Public Health to Data Science rebrand program. If you are interested in joining the program, please sign up for a 30 minute Zoom interview using the link in the description.