 Great. Okay. Great. So as you heard, the recording is in progress. You know, you don't have to have your videos on. But I think we can get started. Oh, wait. So what would you like to introduce yourself, dummy? Oh, I'm sorry, dummy. I missed you. And any other TAs after that as well. I just have a little bit of background noise. So it's dummy. I'm a graduate student at the University of Victoria here in Canada. I used to work with the Ministry of Health. So that was my first interaction with her. That was last year. And yeah, so I've been using since 2019 actually. Nice to be here. And Julianne, would you like to introduce yourself? Yes, thank you. Hi, I'm a fourth year PhD candidate at Northwestern studying health and biomedical informatics. And I use our every day to study health outcomes after stroke. And I've been using it since 2018. Great. If you have questions, just as reminder, put them into the chat so we can have lots of, there's all the TAs and everybody can interact with them. Ted just posted again in the chat, the link to the RStudio cloud. And so we're going to get started. I will start and then we'll hand it over to Daniel. Okay. So can I get a thumbs up if you can see the screen, great. You'll have access to all the materials after the, or they're all available to in our studio cloud, but they'll be available also on GitHub afterwards. So just a little bit of background. If you, if this doesn't fit you, maybe you can run away and not spend three hours here. But why should you come to this workshop? We'll talk a little bit about why you might want to be using R. We'll write some of your first R if you've never written R or also if it's been a long time, you know, if you haven't heard the word tidy verse before, you're going to learn a lot of stuff today. And then we're going to show some things and not dive too deep into them, but to just kind of give you, get you excited and give you a flavor of what's possible in R and the R ecosystem. So I think we can skip this for the R workshop because we're all here. Okay. I'd like to take this moment just to pause to say if you are having issues getting things set up, please message in the chat sooner rather than later. And then Daniel, do you want to take it from here? Do you want to keep me or have me keep going with the introduction kind of orientation? I'm not going until after our studio cloud. Perfect. Okay, so let's blow this up a little bit. I love this graphic. Allison Horst is, you know, a very prolific. I don't even know what the category to put her in with like amazingness and she has this, she's a very good illustrator. And I think this graphic encapsulates a lot of like what it is on learning R and when you start, it can be really frustrated. You're like, I can't like, which window am I supposed to be typing into? Like, I don't get it. Then you start to have some progress and then you start to be doing like kind of like more difficult things and you're going to hit the valley of despair of like, oh, I don't get it. Like Ted said, like, oh, maybe this just isn't for me. I, you know, I didn't learn how to program when I was 16. So I've never learned to program false, false, false. And then one thing I think is amazing about the art community is the, the, how the high quality of just random people's blog posts showing what they've done, the people on Twitter sharing work they've done offering to help each other. Like I, I posted ones like, oh, I'm thinking about making a package and I have like 20 different people in my DM saying like, you need any help, you know, reach out to me. And then you're just going to get really comfortable with always learning. Ted, Danielle, you guys have little other little tips, thoughts. So the, the R stats hash hashtag is amazing on Twitter. I've met a ton of people through it and everyone is super helpful. So if you have any, if you have any questions and you don't think you can't find the answer, oftentimes I will ask it on Twitter with the R stats hashtag and someone will get to me, which is pretty amazing. Great. So just giving you a little bit of background on our studio cloud for this workshop program using our studio cloud. Why? Well, our studio cloud is like a fantastic way to teach R and also to, to get people who are new to art up and running without a lot of the friction points that can happen with local installation issues on your local computer. So there's nothing that you need to download with our studio cloud. And this, just as a reminder, you don't need to use our studio to use R, but I do and lots and lots of people do. And it's a very friendly way to interact with our, the good thing about our studio cloud and our studio that you would download onto your computer is they look very, very similar. So even if you're learning things on our studio cloud, those are going to directly translate into it once you would have it on your local machine. And both are in our studio are free to download to your computer. Okay, so at this point, I want everybody to a lot have already logged on to our studio cloud. If not, do so now we're going to paste it again into the, to the chat to go there. Either please raise your hand in the zoom session or put a message in the chat when the speakers are speaking is a little bit difficult for us to look in the chat box, but TAs are going to be on the prowl for the questions. If some if TAs or other instructors happen to see like a really important question, please like just interrupt the speaker so we can address that and clear it up for everybody. Just one thing, if you're feeling uncomfortable like messaging the whole group, feel free to message the TAs individually because they are really good and you know, they can help you so. Excellent. Okay. So you, if you click on that link to get into the project, you're going to see something that looks similar to this. You don't let us know in the chat. But once you see that that intro to our medical data, just click on it and open it up. It does take a few seconds to load. And while we're waiting for that. I just want to show you the, the orientation of this base that you're going to see. So these are the four panel pains, I guess, that you're going to see on our studio, whether it's on your local computer or on our studio cloud. And so going clockwise starting with pain one called source. We'll talk quickly about what, what each of these do. So source is generally where we're going to be doing a lot of our work. That's where you're going to be writing your code. And we're in this special file, which we'll talk about a little bit later called our markdown and you can actually run your code in there as well. Let's see the output up there. If you're creating new files, you're going to see that in there. You're going to see that in the top right corner, you're going to see the environment. And that is going to be when you create something from your programming. For example, you read in an Excel file and name it something. You're going to be able to look over there and see it as the R version of that object. I guess I should have been more conventional in my naming of numbering of things, but then we're going to go to four from two into files. And here's where you're going to see the files that are associated with your project. And you're going to see there's kind of like tabs, much like a browser situation where you see things like plots, help and viewer and we'll go into that. And some more detail. But for there, we're going to open something up soon in the files. Areas. So be looking there. And then the final place here is the console or terminal. Both. And this is where you could just directly enter in code and run it. When you're not like kind of. We're just doing kind of quick things or running scripts and stuff like this. I don't do a lot here because it's easy to lose your work out here. So don't worry much about it as you're starting out in the art. As you get more advanced, you'll see what the advantages are of things down there. Okay, so now. For. Very quick orientation to our markdown. I want everybody to go into this. Files area right here. And look for the intro to our for medical data workshop. And we're going to open up the. The RMD. And open that up. Something like this. And make sure it's the RMD and not. HTML or anything else like that. It looks like. The little icon around it is like a little piece of paper with a red circle. On it. And. I guess. I don't know if you're there. Could I get some thumbs up in the, in the. In the reactions just to see if people are getting there. Great, great. Okay. And well, this, the document that you were seeing that you've opened up from that is not the same as this. Image here. I want to talk about just some of the very generic. Components of. An R Markdown file. So you can imagine an R Markdown file is a combination of code. And tech, like. Text. And there's like finer details to things about that. And so the big three chunks you're going to see at the top of a document is going to be something called. YAML and we're not going to really talk anything more about this. But just know that there's ways to make your. Documents super customizable. With multiple authors. Like how you want citations formatted just like crazy stuff, but for right now, you don't really need to know anything more about YAML for the rest of. Today. And then you're going to see these gray. Sections of the document that are. Offset by these. Backtips. And some curly brackets here. And this is a code chunk. And it means that you have put code in there and then you can execute that code by hitting the little run button, which is the triangle over here. And then the final component is. This text. Which is. It's very similar. It's really just text as if you were. Piping in a word document and email. You can use a markdown syntax. R has its own little dialect of our markdown, but very compatible. Are very similar. I think I've said all of this, Danielle Ted, if you have other things to. Say. Jump in now or else I think we're going to have Daniel start talking about exploring some data. Okay. I. I'm going to stop sharing and less Daniel, do you want me to share my screen of the R and D? Are you going to. I can do it. Stop sharing. Hopefully this works. Okay, so. What Mara had said that hopefully you have the intro to our for medical data workshop. R and D open. If you scroll down a little bit, you'll notice that there is like a published version of this. A lot of the things she was saying was already covered in the first. Up until. About 100 line 147. So everything that she showed that was rendered all nice and pretty was essentially the document that was rendered up until this point. Right now, since there are pieces of this code that we are going to work together and fill out. You won't have that. You won't have the ability to render this without having errors happen because we literally blanked out and invalidated invalidated our code. But hopefully by the end of this workshop, you should be able to render and get like a nice pretty website. View of everything that we're working on. So at least for me. And if you didn't add in any new lines or anything. At about 148. You will see your first. Our block that we're going to work with. And so. You'll notice that these back ticks. So on a standard US keyboard, it is the key to the left of the number one. It is a back tick. I believe if you're on like the. AZ ERT Y. As a D keyboard, it's like. Where the number six is, I think it's somewhere around there. But the between these two sets of back ticks, you'll notice that in the R studio cloud or in our studio, the background is going to be a little bit different. It's going to be grayed in. And this is your visual cue that our studio is understanding this block as our code that you can run versus regular prose text that will be displayed in like a nice, pretty format. And so in this set of these blocks. You can run or, you know, our is a glorified calculator. So you can run this code in one of two ways. On the right hand side, you can click this little play button. And you'll notice that it will take the code that's in here, which is right now one plus two and it will return three at the bottom. And I'll make this a little bit bigger. Since if you are planning to follow along. We are going to be writing a bunch of code. So if you do want to write notes to yourself, you can use this hash symbol. So that is shift and a number three on your keyboard. And you can write a comment. And this will be considered code that. R won't really execute, right? So like the text here is a simple calculation isn't actual code. It is a comment for you. Since we are using our studio cloud, what you do want to do at the end of this workshop. And we'll mention this at the very end as well. You do want to select the. Our markdown document on the right hand panel, click on more, and then go to export and actually save this document to your local computer. Because this our studio cloud system. Is not going to be there forever. So. If you do use this document to take notes to yourself. It is a remember to export it. Another option is if you see up by. Daniel's avatar and their right hand corner. They're a little left to that is something called save a permanent copy. And if you click there, you will have a copy of the space for yourself. Yeah. That is another way. All right. So throughout here, because this will end up being rendered into like a nice, prettier document with screenshots of things. There are going to be portions of here that aren't really going to be our code in the sense of code that we're going to have our run, but it's going to say something like include graphics, which if you do want to run a piece of our code like one plus two, what you can do is either have your cursor on it or. Press this little play button. If you do, you can select the code that you want and hit control enter, and it will take all of the code that you have selected and run it out the bottom. And so there's multiple ways you can run these little code blocks. And you can see if I ran this little code block that put in include graphics, it's going to say like here's the actual picture and it renders that image underneath. So that's the basics around just navigating this space. So what we did is the format of teaching that we're working with right now is we're trying to give you some instead of typing everything from scratch. You should be able to have this document with you with most of the stuff filled out. And so you can follow along without typing everything manually. And in some sense it will save us a little bit of time. And our main job right now is just to get you exposure on that little first hump of the R learning curve. All right. So as most people who work with data. Who aren't like quote unquote like coming from the data science world. We're probably working in Excel files. Right. So we are very, very often are working in Excel because that's, you know, that's the thing we used to look at our data. We can do run off one off calculations in Excel. Like that's usually if you've never programmed in a programming language to work with your data. That's what you're doing. So in our, there is a library called read XL. And this is a library that gives us a function to load up Excel files. So if we sort of select this block of code, hit control, enter, you'll see on the bottom left hand side here, it will load the read Excel library. In our current setup that we have right now. There is on if you go to this files panel, you'll see that there is a data folder and there is an Excel file called smoke underscore complete.xlsx. And you can see that little portion here of going into the data folder pointing to the Excel file is right here where I have selected in line 172. And so because we loaded up the read Excel library, we are now given this read underscore Excel function. And if I just select this little portion here and run it, you'll see that it will load up this Excel file from our computer. So try this out on your own Excel files. If you'll notice that if you have an Excel file with different sheets, there is a way to load up a specific sheet. But in our, if you have multiple sheets in an Excel file, you essentially have to load them one at a time. There it's not like in Excel where you can open up an Excel file and then have all of those sheets available to you at all at once. That is part of it as just good data practice like one file, one data set. There aren't, at least for this class, there aren't like these big giant batches that get loaded all together automatically. You have to do it very manually. So you'll see here that if we just run the read Excel file at the bottom, it had printed out a nice pretty version of this Excel file. This is useful. We can load up an Excel file, but as we're working with data, we don't have, we don't want to reload this Excel file every time we want to do a little calculation, right? So if we look at the rest of this line, we're going to create a variable called smoke underscore complete. And then you'll see here, this little arrow symbol, and it's a good practice to put a space between both sides because it is literally the less than dash symbol. And you don't want to confuse yourself with this dash as being a negative number. In this case, it's pretty obvious that there's no such thing as a negative read Excel call, but part of this is good coding practices to put spaces around that arrow. So we're going to say smoke underscore complete gets this entire read Excel line. So if we click this little play button, one thing you'll notice that it no longer prints out the Excel file that got loaded. But if we look at the top right-hand panel, there is a variable that got created for us called smoke underscore complete. And we got a few simple statistics about what's in here. It's telling us that we have 1,152 observations, also known as rows, and 20 variables. So that is 20 columns. So we've looked at or we have loaded up this data set. The next thing that we want to do is to just to make sure that this thing got loaded up correctly. So there is some text underneath about how we can do this. But there's a couple of ways that we can first to explore whether or not our data set was loaded up correctly. There is another function. You might have heard the term tidyverse. The tidyverse refers to a set of packages primarily maintained by our studio, the company. And it's a set of packages that all relate with working with data and they all integrate very, very nicely with one another. There is a within the tidyverse, there is a package called the player or. That is how you pronounce it the player. So just like how we loaded up the read Excel library to load up our Excel files, we can load up the deep layer library to load up just the set of packages related to processing data. You're going to see a bunch of red text here. Not every portion of red text that are gives you is an error. That is something that a lot of newcomers get very confused about. They say like, oh, I got this. Is it an error? That's sort of one of the things I wish are like sort of changed that errors and messages actually don't show up as red because it's not actually a problem. But you'll see down here, it is attaching the. Deep layer library and you're going to see some things about. Functions being that the deep layer library over. Overwrote, right? But for this class and for the most part when you load up deep layer, or if you load up tidyverse, this is pretty common. All right, so we loaded up our Excel file. And one of the easiest ways that we can look at how this data set was loaded was if you look at this top right environment panel and you can switch to other panels, but for the most part, we're going to be just clicking on this environment panel. If you click on the word smoke underscore complete, or for me, if this is the first time you're using our studio, click on the little spreadsheet icon. Do not click this little green arrow that will do something to me, but if you click on any other part of that line, it's going to open up the Excel or spreadsheet view of the data set that you loaded, right? So you can see everything that's in this data set. You can scroll all the way down to the last row to see all of your columns, et cetera, et cetera. Right. So just because you're using a programming language, our studio does a very good job in sort of, you know, getting you a view that you're sort of used to. So this is a really good thing when you first load up a data set, regardless of what data set you use, and I do this all the time as well is, especially if you are working with a data set that, you know, you've never seen before, most of the problems are going to happen, like within the first couple of lines of your data set. So one thing you want to make sure of is did your columns get loaded incorrectly. It's very common that sometimes people take notes in the first couple of rows in an Excel file or in any data set. So, you know, your column names are going to be like those rows, those notes, and then the first row of data is going to be like your actual column names, and then et cetera, et cetera. So everything will get all messed up. So one of the questions. I just wanted to quickly address a question that came up in the chat. Somebody asked, oh, we don't have to install the package. We can just load the library. And that is true because I, I set up the RStudio workspace. I installed all the libraries packages that we would need. So you don't have to do that now. When you're doing this in your own space, on your own device, you will have to install packages. But just to keep the friction low, I installed all those things ahead of time. So hopefully we'll have to deal with any of that. Yeah. And then if you ever, so the actual message in chat, like you actually knew the command to install packages, but if you ever forget RStudio in the bottom right hand corner, there's a packages tab and you can click on install and type away. So that's also an option if you don't remember the actual command. All right. So at least in our data set, we can now see that, okay, the first and the most important thing, our column names look correct. They got loaded in correctly. And the first row actually looks like data. And these are things that might happen depending on, like if you're working with people who aren't using a programming language to analyze data, there's going to be like weird quirks, especially if you're in the Excel world, like you're going to end up with like a random, a bunch of empty columns here and then like a number because someone decided to put an off calculation somewhere. Or at the very bottom of the data set, you know, there might be like the sum or some average calculation of the entire column. And so those are things that you want to be very mindful of when you load in, especially someone else's Excel file, right? So make sure that those off calculations that probably happened aren't there. And then this is really good for you as somebody who is now trying to use R in part, part of their workflow to don't make those one-off calculations. If anything, put them in another sheet so you don't load it up as part of your actual data set. I pasted a link to this wonderful, basically this article by Kara Wu and Carl Broman about spreadsheets and good practices for formatting them. It's surprisingly really entertaining, so it's a good guide. Yeah, it's entertaining because we've all done this. And they sort of call you out on it. So. And another thing you'll said about, you know, oftentimes the beginning of your Excel spreadsheet that you'll see things that are, you know, like comments or stuff like this. But then also it's very common in a lot of medical data, especially if you're getting outputs from like enterprise data warehouse that comes in like an Excel spreadsheet, that like the very end of your document will have like a time stamp and like this document is confidential or something like this. And so when you try and load it, you'll start to get errors at the end. So I really recommend what you Ted shared is, is we'll save you a lot of pain. All right. So another way instead of just actually clicking on this, if you don't, if you're not using RStudio for, for some example, for whatever reason, or if you just want to quickly look at the first couple of values for all of your columns in like a very quick view, the player gives you this function called glimpse. And so if we run this block, you'll see that it sort of gives us like a transposed version of the actual spreadsheet. So like flip the rows and columns. And so this is a really good way to get a quick, just so you can scroll down and not scroll across for things. It gives you the number of rows. It gives you the number of columns and it gives you all of the column names because we're in the tidy verse world. It's also giving us what is the type of data stored in that column. So if you, for example, get something Excel does this a lot, where if you put in like a, like a number, for example, all of a sudden turns into a date, you know, like this will actually tell you what did our load this data as, right. It's also really common for zip codes to be loaded in as an actual number and some zip codes start with the zero. And then now you're missing digits in your, in your zip code, right. So this is a good way to sort of start like figuring out what might be wrong with your data set. And so we don't, I don't think we have that problem here, but these are the types of problems that you might show up. So CHR stands for character. So like actual strings. And you see here a year of smoke is showing up as a character, right. And so we, something, you know, later on when we're looking at this data set is like, Oh, why did this thing get loaded up as a character? It could be loaded up as a character because there's like, NA the word, like for missing value was showing up in our data set, or maybe somebody typed in missing and my SSI and G. And then now that whole column is read in as like characters, right. So you can get a lot of information about which columns might have just by looking at how this data set got loaded in. And then not only that, you also got a view of a couple of values of your data set as you're going through this. And you can see it's much easier because I can just scroll down instead of going to another tab to like see my data set. So let's take a couple of, let's take a minute. There's three questions down here. And if you want, you can type them into chat, like just say one and then the answer. But using this output, you can start answering these questions like how many rows of data do we have in our data? Somebody have an answer. There's multiple ways you can go about looking at this, right. So like, yes, there's 1152 rows in our data. There's a couple of ways you can do this. Again, you can look at the environment panel here. You can look at the output of glimpse. One of the other comments was there's this N row parameter. So just like read underscore CSV as a function, there's another function called N row for number of rows. And then yes, there's an N call for number of columns. And if we put in smoke complete, it'll give us the actual value back, right? So all of what's really nice about using a programming language is all of these things that you can visually inspect, you can write code for. And this is really great, especially if you're working or trying to get data that has a very specific structure. Like if you're working with county data, there's a fixed number of counties like today in any different locations. So your data set should have X number of rows or X number of columns, one for each county. So the other question is how many columns are there? So what are some of the column names? So there's some solutions out there. So you can see the number of columns. You can get that information from Glimpse. You can look here in the top right corner in the environment panel. And then some of the values Glimpse also gives this to you. And then if you want to, you can click on the actual thing and you'll get the view of your data set so you can get a sense of what's going on. And then the other column is like, can you tell what type of variable is stored in a column? There's a couple of ways using Glimpse. You'll see it right next to the column. If you are using a function that is coming out of the tidyverse, just typing in the column, the data set itself will give you a small printout that will fit the console. And it will give you the number of rows, number of columns, the column name, and then underneath the actual column type. This doesn't always happen and this only happens because we're working with a tidyverse version of a data frame. And so this data set that got loaded is called a data frame object. And so that's sort of the nice thing with the tidyverse is, it sort of allows you to do these quick blances of things like a little bit faster instead of typing the code for it. All right. So the next thing is we just loaded up a data set. How do we just get a very, very quick glimpse or a very quick view of what's going on in here? We've already checked that our data set was loaded in correctly. And so there is another function called skim that comes from a library called skim R. So a lot of our libraries just have the letter R in it. R is a very punny language. And so we can just like loading up all of the other functions. There is a library called skim R and we can skim our data set, like to quickly read through our data set. So the skim R package has a function called skim. And this is really good if like you don't do anything else with your Excel data analysis, you know, try loading this into R and then running the skim function just to make sure that things are loaded up correctly. Your data set is formatted correctly, but it also gives you a lot of statistics for all of your columns, which is something a little bit more complicated to do in Excel. And so if this is all you do, I would consider that a win as well. One of the questions was, does it cause problems loading into many libraries? The short answer is no. There isn't a problem when you load up too many libraries. Eventually, when you start developing packages on your own, the name of the functions that you write in your own package will matter because if you remember from the very beginning, when we loaded up dplyr, I guess like that text isn't there anymore. It said something like this, like the following objects are masked from package stats. Package stats is there's a stats library that is default in R. And so it has its own filter and lag function. And because we loaded up dplyr, it overwrote the stats filter and lag function. So in some sense, if you work in the tidy verse world because it's all maintained by our studio, and they are very conscious of the stuff that they're building, you don't have to worry about loading too many libraries from the tidy verse world. If you start loading packages from a bunch of random, like quote unquote random people, you might start seeing messages about functions overwriting one another. In that sense, you have to be more mindful, but it's not going to cause a problem. Okay, so skimming our data set. Just like using glimpse, we get some bit basic information. Here is the data set name. It's called smoke underscore complete. It's got the number of rows, number of columns. It also tells us, you know, how many of our columns are characters, like just regular strings, how many of them are numbers. This is really good, because if you do expect all of your data to be a number, like if you're reading in sensor data, and it's all just a bunch of numbers. If you end up with something that's a character that might be kind of like what is going on here. A lot of medical data sets will have a data set that it's all encoded and then a separate code book. So you might have the data set itself is supposed to be all numbers. And so this is one way you can check, like is everything a number without you manually scrolling through everything. There's a section here called group variables. We'll talk about that when Mara talks about dplyr stuff. But down here, the really nice thing, and it's a little bit more difficult to get this type of output in Excel, is for every single column in our data set. In different portions. So characters are things that are like our strings. It'll give us a couple of descriptive statistics, like how many of these character things are considered missing. We know that just by looking at the glimpse or that quick view of our data set. Right here. Some of the character has N a the string, like the characters and followed by a. So this is usually R sees this as a missing variable, but it's because we see it in quotes, it is not being treated as a missing variable. So in some sense, we can see like this small discrepancy between, okay, this is this is read in as. A value called N a not missing value that are understands. So we'll see that like, okay, the number of missing all of these being zero kind of suspect, because we know that there's a problem there. The completion rate. So this is out of a percentage really. So it's saying that we have nothing missing and everything has a value. And then in here we have. Some other sets of. Descriptives. I believe this is min max in terms of like number of characters. So if you see something that's like extremely long, probably someone put in a paragraph or like for something that's like an other response or something. Yes. And it's extremely suspect to have no missing data in medical data. Just, just, you know, like if you look at like gender and pregnancy values, you expect the males not to be pregnant or like an N a value there, because that doesn't apply. Right. So a lot of times, especially if you're getting this stuff out in the wild, it's very unlikely you have nothing, everything as completed data. So the next block. And let me make this screen a little bit wider. So this prints out a little nicer. The next block down here is a different set of summary statistics. And it's different because we have different types of data, right? Because our understood this as a number. When we have number values in our data set, we typically look for different things. So we have here age, act diagnosis, days to birth, cigarettes per day. Those are usually numbers. We get the same basic things of like, it's a number of things that are missing. And it's a number of things that are missing. It's missing. And it's completion rate. But because it's a number. We usually want to look at the mean and standard deviation of these things. And so using glimpse is a very great way. To see. Stuff that you probably already want to figure out when you load up a data set. Right. And so a check diagnosis, you'll see that this is. The unit there. That's, you know, maybe clearly this isn't year. As the unit. It's the actual, it's age at diagnosis in terms of day. And so if we want to figure out how old they are in number of years, we have to do some kind of calculation to really dividing by 365.25, which Mara will show you in a little bit. And we get like percentile. And that's what we're going to do. We're going to look at the values. So we have quartile. So what is the lowest value, the 25% quartile, a percentile, the 50% tile 75, 100. And at the very right hand side, we actually get a. Ascii histogram, just to show you. What is the general distribution of this data set? Right. And so you can see for a check diagnosis. We know it's skewed to the left. It's, it's like in the sixties, right? So this data set are mostly like the people between 60 and 80 days. So for example, cigarettes per day, just looking at the mean and standard deviation deviation. It's around two cigarettes per day. And so most of this data is going to be squashed to the left hand side. So that's right skewed. And that is what skim is allowing, allowing you to do is very quickly get a view of your data. Set. If we, If we go back up to our read underscore Excel data set in our studio, if you put the cursor between the right before that opening bracket and you hit the F1 key, the our studio will open up the help pages for your function that you have the cursor on. And so this is here, we're using the read Excel library and we're using the read underscore Excel function. You can see down here, this read underscore Excel function takes in actually a lot of other ways you can tweak this function. The first thing is path, which is that data set that we loaded. But you can see here, you can say sheet equals. And if you want to load up a different sheet, you can. There is another thing here. So na is a empty string. But if we were to say na is equal to the string na will now we now would have fixed that little problem. Of all of these things that were characters are now actually being properly read in as na. And that's because the Excel file itself had na, the character is being read in. And usually people assume if it's blank, it's missing. And so you'll see this a lot in health data sets as well where 99 gets encoded as a missing value. And so that's like missing. I as the person putting together this data set understand that this is missing. And it's different from I as the person forget to collect that data. So you'll see 99s or 88s a lot for like other random codes. You'll other, you'll also see. I believe range is like how you can set like where your data set actually starts. So if you have a whole bunch of metadata in the beginning of your Excel sheet or at the end, like if you get data from like the CDC, like the first sheet that gets loaded is just like, here's the terms of use, right? So if you're loading from the first sheet, you're usually loading from a different sheet. And so that's how you can, if you, for example, put in a data set into read Excel and all of a sudden there's something wrong with it. You can put your cursor right before the opening parentheses, hit F1, and then use the parameters here to help you. And if you scroll down a little bit more, you'll see the actual read, the actual definition or the, of what each of those parameters does. So you can specify and read that as well. So here and a is a character vector of strings to interpret as missing values, right? So if I was loading up a health data set and I know that in the code book, it says 99s are considered missing. I would also put in 99. And so when that gets loaded into R, it also gets treated in as a missing value, right? And so these are different ways you can modify how the data set gets loaded. And if you can fix it during load in time, it'll save you a bunch of headache. So because you don't have to process it manually, you're using this function for you to load in. And so, yeah, I will post this little thing in, oops, into chat. Whoops, whoops, whoops. For that bit. And so I believe, so do we have any questions? So hopefully this got everyone oriented into just loading up your data set into R. And if all you do is take a data set that you have right now and just try loading it in, try it, look at it through glimpse and skim. If you've never programmed before, I would consider that a win because now you have a very, a little bit more programmatic way to spot check your data set. And you get some kind of descriptive stats out of it. And then you can slowly start picking up more our skills. But getting started is usually the hardest part of all of this. So there's a question from Jason Toppin. So I know you can use read CSV for Excel files. Is that a good idea? So. The answer to that is usually no. So I believe, okay, so if you do use a function that's like read underscore CSV, it's going to treat, it's going to assume that one it's a plain text file and it's the limited by a comma CSV stands for comma separated values. Right. So Excel files one are technically aren't really plain text, it's, you can open it up as a plain text thing, it's not. And so if you try to read an Excel file using read underscore CSV, it's probably going to error out. I'm going to say that like the other side is usually okay, like Excel will open up a CSV file and you can use like the text columns feature or most of the time Excel understands that it's a CSV file and it'll open up. But usually the other way doesn't work. So you do end up opening up what you think is an Excel file with read underscore CSV. I almost certain that they just changed the extension on you. And so you didn't really get an Excel file to begin with. Just one quick thing to mention is if you do want to use read CSV, you can convert the Excel file, you can see it within Excel, you can save it as a CSV file and then use read underscore CSV to read that new version of the file. Yeah, and one other thing before we take we sort of switch to the next section. If you ever forget what how to load up a data set, for example, in the environment panel, there's this portion here that says import data set. It's different from the import data set like up here somewhere that I know exists but I never use. But this portion there's part that says import data set. So if you have an SPSS SAS data set Excel. If you forget the function or you don't know the function to use, you can use this system as well. So we can say from Excel, it'll give us a nice little pop up, and then you can actually browse to your Excel data set, right. So we have data Excel, and it will give us a preview of what it will load. And so this is useful because if you have multiple sheets, you don't have to rely on your you typing it, you'll notice that right here it's going to write the code for you on the side. And so if you have a range like OK Excel sheet from like let's say I want a one to. I don't know D D five. You can do little subsets of your data set right here and you'll see that it's going to write the code for you. And so what you can do is copy this block of code. You can also hit import as well, and it will do that for you. And then you can simply paste this into our and you have that code block for you. So that that works for a bunch of other things as well. So don't feel like you have to memorize every single function that we're showing you. There's a lot of ways to get help, especially from just getting your data set into our and that little import function is really useful. And hitting F one is also really useful. All right. We are getting close to the top of the hour. So I think this would be a good time for us to file break. I'm gonna. Ted Daniels you have anything quick to say before people go and then we can restart in five minutes. No, we'll take about five minutes. If you have like weird setup issues, just plot them into the zoom chat. And we can either help you set things up right now, but if there's a more general question, we'll answer it when we come back. Okay, so a question came up in the chat during the break about how do you get data, for example, from GitHub or other sources into our studio. And there are tons of different ways to do it. You can do old fashioned ways where you're downloading into your files and then reading it in. But then there's also a whole suite of different functions that can pull data in if you feed it like a URL. There's a package I like to use called Google sheets for that pulls stuff in from a Google sheet that I have. So there's lots of different ways. You don't have to have everything locally stored on your device. Yeah. Okay. Yes. So before we kind of get into the plotting, I just want to again, you know, it's all about kind of words of wisdom. Can everyone see my screen, by the way? Sorry, I need I'm just trying to get to this. Yeah, I can see it. Okay, great. So I'll send the link out to these. These are, this is just some slides about errors and debugging. That again, you know, we want to encourage you to like, you know, keep on trying. So number one thing is, you know, learning R is not easy. So kudos to all of you for, you know, wanting to do this. So again, like the key is really what to learning more. And to find it out like the source of errors is to not beat yourself up. The number two rule is like use Google and don't feel bad about it. Because we all use Google because sometimes there is just like a very obscure error. And like, you know, I will often cut and paste that error into Google and see if I can find an appropriate help. But this is just a cute and other cute. Drawing from Allison Horst that just, you know, talks about kind of the, the debugging process. So sometimes you really, you really got that you think you got this. But it is, you know, there's all of these other kind of, again, it's kind of like that roller coaster of kind of going up and down. So I just want to talk, Daniel covered this a little bit, but I just want to talk a little bit about understanding the difference between warnings and errors. So oftentimes, like what Daniel was showing you, you, you get a warning and a warning is just an indication that the data or arguments aren't quite what the function expected. So oftentimes you can run the code, but you should definitely verify the output of the code. So the difference between the warning and an error is that the error means that a code can't execute at all, given what you've, you've, you've kind of put into the function. So this kind of gets into why these can be very difficult to understand. So Googling is standard practice for errors. So again, you know, if we're, if we, I know most of your clinicians, so we can talk about levels of evidence, right? So there is kind of an order, there's an order of levels of evidence in terms of Googling as well. A lot of the times this is kind of the order that I look in. So I go in, in terms of our studio community, if I have a tidy verse question, they are great. I can usually search like the search the forums there and I can usually find an answer. Stack overflow is also good. That's kind of my second, second line of Googling. So this is a website that has a lot. It's basically a knowledge base. When people encounter errors, they ask questions about it and hopefully someone has answered it. So that's also a great resource. So if you are doing bioinformatics, another great source is bio stars. And then kind of the last thing I check is the packages GitHub page. So, you know, the R street, the tidy verse has great documentation. And, you know, if something changed in one version to another, like that's where I can find out about it. Just, you know, just a plug for social coding. I think one of the hardest things, like when you're starting out is like being vulnerable and working with other people. But I will say that it will improve your code a lot. So we all have blind spots. And, you know, if you're working with someone else, you know, they don't have that kind of, that kind of goggles, those kind of, you know, code goggles that you might have so they can usually find something that's like a misplaced parentheses or bracket or misspellings. And then the last thing is usually the area you're looking for is at the bottom. So there's usually be a bunch of errors generated, but the one you're interested in is usually the one at the bottom, the last one. And I think that's all I have. So I will stop sharing. And take it away, Mara. Okay. Thanks so much, Ted. So, I am going to share my screen. Can I get a verbal cue that I, that you guys can see my screen? Yep. So we're going to start with the looking at doing some plotting. And I just want to draw your attention to a handy little feature that I didn't know about for a long time when I was using our studio. If you see down here where my cursor is, there's this little box down here. And if you click on it, it has our studio is interpreting different lines of your code to like provide you a table of contents so that you can pop around more easily. Because sometimes with these very large documents, you are, you are really seeing why this is just so too difficult to scroll through everything. So I'd like everybody to come down to part two, the plotting our data. So we're going to take that data that we loaded with Daniel, and we're going to make some plots. So the first thing we're going to do is just produce a histogram from that smoke complete using something called geom histogram. And what I want to emphasize is we're just going to make some plots right now. And then we're going to talk about the underlying structure of like where these plots are coming from. So looking down in this code chunk right here. I want you to find and we're going to define all these terms. But I want you to go and put the Cigarettes per day variable. That's the column name we have in our, in our data set. And I want you to put that into this underscore area completely replacing this. This is not valid code. It's just a place marker that where we want you to put in that cigarettes per day. And then once you've done that, I want you to hit that run the current chunk, chunk. And I want you to see what you get. And then I cannot see the thumbs up right now, but if people could give thumbs up if they're having success with this, and somebody give me some verbal feedback. Oh, I guess I might be able to see it in the chat. Okay, yes, great. I'm seeing some thumbs up. And I'm seeing a no. Okay, so give it just a few more seconds. And then I'm going to show you my, my answer to it. So I'm going to go in here and I'm going to assign. Cigarettes per day to that X. And then I'm going to run this chunk right here. One of the questions that came up was why didn't the plot go to plots? Could you just make sure you describe that? I mean, it's because it's in the RMD file and it's previewing it for you in there instead of in the plots. Correct. Correct. A lot of people. And it's hard for me to say this necessarily are described this because I didn't know what like a code script was for like the first six months I was learning are like the only way I ever interacted with our was in RMD files. And so I was very used to this kind of like I type in my code and a code chunk, and then I see the outcome of when you're interacting more with a script or in the console, you're going to see a lot more things like being sent to like the plots package. Or I'm sorry, tab and instead of being plotted out over here. Okay. Okay. Going on to the next staff unless if, of course, keep putting your questions. Great. In the chat. So we're looking at this plot right now, but there's a few issues with it. I'm actually going to try and make this. I made it big to make it easy to see, but it's almost too big. So just a few issues. The, the cigarettes, the titles are like, I mean, they're fine, but they're not super pretty. And there's no title. So let's try and make our graph a little bit more descriptive. And so we're going to work on just putting some titles into different things. And then also we're going to introduce something called a themes function and see the outcome. So I want you all to go down to the, the choke code chunk called beautify plot two. And I want you to put in any title you want for the title. And then I want you to take a theme classic and plop it into this bottom underscore there. So I'll give folks a minute here. Otherwise I'll just, I'm pretty impatient. And so it feels like, when you're teaching sometimes the, you think you've waited five minutes, but it's been about five seconds. And definitely in the, give a thumbs up if people are getting it, or I know if you're not in the reactions. And once you've put in all those things, go ahead and run that chunk again. And then we're going to go ahead and run that chunk again. And then we're going to go ahead and run that chunk again. And then we're going to go ahead and run that chunk again. You about 20 more seconds. And then I'll put it in with what I had. Okay. So I'm just going to say, this is a great title. Tastic X axis title. And I'm just going to run it then. And so you're seeing here where I put my title. I'm getting a title here. I'm getting my Y axis title over here. And a really nice. Fantastic. X axis title. And just looking back, we haven't changed anything about how we're. Displaying the histogram data itself. It's just like other components. Of the plot. So you can hear, see here with this theme classic. It's a lot more white space. And so there's all sorts of built in custom themes. And you can also even create your own themes that your organization, you know, like the economists uses are, I believe that they have their own. Package and there's different universities that have it to all their, their fonts and colors and everything like that. So going on. You're going to see. A message, not an error, a message. Of the code that says stat been using bins equals 30. And so you'll see that here if you. Click this red text again, which is not an error. It's just a message. I think Daniel Ted and both talked about that. And. I don't know if I'm wrong with that or Daniel correct me if I'm wrong, but I'm pretty sure like, I can't remember where this comes from. If this is just the default or if it's something about the data that. The Gigi plot. It is, it is the date. It's the default. Which I am not a big fan of, to be honest, but. That's what it, that's what it does. So. Yep. So. This. For meaning that. So we want to play around with that a little bit to get a different number of bins, meaning. Sometimes it's easy to get confused because sometimes people refer to bin width, which is like how many values are captured in each bar of the histogram. And then the bins, which is the number of columns that you could potentially have here. So let us. Decide that we want our bins to be something else in width. So in here under this, and this underscore. Go ahead and. Put into there and then run your code chunk and see what it looks like. And well, everybody's doing that. I'll just make a note about the. See if you'll see this. Red little circle with a X through it. And this is the R studio. Interface telling you that. It feels like something is incorrect. Right there. And so it's kind of guiding you to where you might want to make changes. So are people having success with that? I'm going to put into there. And run the code. And then how easy it is to be able to interact with your data and very quickly. Change. Your plot. Without. Without all the difficulties that creating plots and something like Excel. It might feel like. Oh, I have a lot of control over things. But it ends up becoming a very manual process. And so one of the great things about creating plots. And one of the things that you can do with GG plot is that you can really make. Complicated things very quickly reusing code. Or you can make big changes with this tiny tweaks. Of the code. Yep. Just as a quick note, like so. We're recently are publishing a paper. And I did. I generated all of the plots using GG plot code. And so we had a lot of different types of. I had a lot of different types of plots for the design. And so I did a lot of the details. And the generalization. So our collaborators wanted all of these things fixed. So I was able to give my students to GG plot code. And they were just able to easily modify it and format it. As, like our collaborators wanted for our figures, which is not trivial when you are doing it for 24 figures. So let's, one more thing we're going to do here before we get into under pulling the curtain back a little bit on ggplot to understand what's going on. For here. I want you to put the variable gender down in this basic grid area, and then run the code. Oh, great. That's a great question. What is the first line of the code chunk me. So the curly brackets. And then are means tells the computer that we're running a code chunk in our, we're not going to talk about it here but there are other options in our studio that you might be running different codes. I'm sorry different for running languages so you can run, you can use Python or SQL bash in here. And so that's just a way. Yeah, it's just the, the denotation that this word using our, then when it says facet right now, that is me giving a name to the code chunk. And so, there's some minor, not super important things right now, but you for when you name it. That's what creates this code chunk name down here so you can easily jump around. Otherwise it just gets a generic chunk 17 and wouldn't have a name right here. I recently learned that I shouldn't be using underscores in them. You can use dashes. If you're going to be like cross referencing different things or if you're writing a paper all in R. And you want to be like refer to a figure or a lot. There's this wonderful other packages where you can just say like refer to facet, and it will like auto label whatever figure that is and things like that so really fancy options. And so we're going to put gender down in here and run that. And so you're going to see here that we have now faceted our plot, pulling out the data for females and males in our data with two plots next to each other, all the females, all the males. And this is possible for any number or you know, any variable that would be a reasonable thing to do to understand the differences in different groups. And you, you've probably seen this a ton in different journal articles or things like this really. Since you've seen GG plot you'll start seeing it everywhere in articles, scientific articles. Okay. It is almost 430 right now and we're going to go on to gg plot to talking about the fundamentals of this package. I think Ted has that head out in a little bit. So I was wondering if anybody wants to any of the instructors or keys want to mention anything before we starting on the next section. Now, unfortunately, we had a scheduling snake booth so I do have to drop off I will be back in an hour so anyways, it was good to see all of you, and I will see you later. Thanks. I would like to add one another way besides at the bottom of your screen to jump to code is you can use the little lines up near the blue eye. Would you be willing to show where that is yeah you can open the little lines right next to the table of contact. Yeah, looks like an eye. Yeah, so if you click on that that's another way to jump to anywhere in your code. Yeah, and then a really nice thank you so much was that Molly I is hard for me to see. I don't know if I'll just it's not Molly who said that, but another nice thing in that when you're in the R studio ID is that an ID just stands for integrated development environment I can't remember. Yes, that's right. Yeah, and that's just a fancy word of saying is like the graphical interface of interacting with everything. It often if you hover over things, it will tell you like what it does and also the code shortcut for things. And I must admit I'm like not, I was not a big like keyboard shortcut person until I started coding a lot and like my life is so much better. And I, if you do a Google search for just like you plot cheat sheets or our studio cheat sheets on our studio puts out a big collection of cheat sheets for very popular or big packages that I use all the time and I actually think a whole bunch of them just got a huge facelift this summer and and they're super useful when you're after you've gotten a little bit familiar with the package to jog your memory about stuff. So, going on. Now that we've made a couple of the plots. I want to talk a little bit about what Gigi plot stands for. And so Gigi plot is a library which is actually called Gigi plot to town is the is the packet chain, and the Gigi stands for a grammar of graphics. And why this is so important is it's really breaking down graphics into very specific components constituents to make it easy to create graphics to create graphics that can be understood across packages. And this is just an amazing improvement on things before where everybody kind of had their like own package that did some graphic stuff, but wasn't compatible with other graphics. Well Gigi plot to is a is a great package. There's tons of other packages that build on to Gigi plot to and compatible with Gigi plot to to make all sorts of different things using this consistent way of describing things. So, looking down at this graphic here sorry it's a little hard for me to read up here and look down at the graphic. We're going to talk about some of these different specific parts of graphics and learn a little bit more about how to make them. And then just as a note, there's a lot of different ways that you could write our code or Gigi on plot code. There's some sort of like for very formal ways and some more casual ways. And we're going to focus a little bit more on the formal style just because it's, we expect a lot of you this is the first time where you haven't seen very often. So it's a little bit more wordy than you have to be but just to try and be very clear about what what is going on but as you become more familiar going to start. The short ended up and yeah Daniel just made a good point in the chat. Gigi plot to is the is the name of the package in the library that we have loaded and Gigi plot that you've seen here is the function that starts making the graphical object. Okay. So, we're going to break this big big chunk of code down to understand the component parts. And then we're going to build on from that. So this is a full example with tons of these comments that you might remember Daniel talking about that this hash and then this text after it that is marked out as green. These are comments that are not code they're not Ron is code, but they're there to help guide you if you're leaving notes to yourself, featured you or other people who are using your code. And so right here, the very first part of our function, Gigi plot is that we have, we're telling the function I have data that data is called smoke complete. You know that's still loaded up over here that we did that earlier, created that object from our Excel spreadsheet. And then I'm telling Gigi plot. I want you to do this mapping in my data. Look at column, age at diagnosis and assign that to the x axis. Then I want you to take cigarettes per day and assign that to the y axis. I want you to take the variable disease. It's another column we have and base the color of the plot on that. And then I'm telling it. Now that I told you how to map this data. I want you to make a geom point meaning a lot of people, I think a word that's more familiar is like a scatter plot for this. And that is then taking the mapping and adding what's called the geometry. The alpha is not very important but we'll talk about that a little bit later. Something that we're familiar with that we did up in the prior exercises is we were giving the titles of the plot of the x axis of the y. Here I'm saying that my legend for color is disease type. And then remember how we faceted it. And so we're going to facet it based on gender. And then we're going to give it this theme. Okay, so go ahead and run that code chunk. I don't think anybody, you wouldn't have had to do anything for the code there, but we're going to now walk through step by step of creating this plot to see how each of those steps builds up into the final plot. Any questions at this point. Oh yes, great point Daniel. It's easy to miss, but you'll see this little plus sign. It's just a little bit more cluttered here because of all the comments that I put in afterwards. So this is how we link different parts of our arguments together to get to the final thing you'll be a little bit clearer down here, as we step through. So, yeah, changing everything together with that plus sign, and that Daniel jumped in please to, I feel like that's a very gg plot thing like I don't really see that a lot of different other. Um, so that plus thing is part of how our interprets the code when it sees the plus and knows to look on the next line. So it's the same thing as doing like three plus two. Yeah, yeah, totally, totally. Yeah, that's true because like if you're if you're typing in the console, and you do like one plus the console spits back at you like plus what it's expecting more, more code. Okay, great. So now we're going to start in that process of rebuilding that prior plot. So first step for I want you to put smoke complete in that space to us to tell gg plot that the data is that smoke complete object. So, as soon as you do that, run the code chunk and give a thumbs up if you don't get an error. I'm throwing it in there, kind of a boring output no well there's no graph there. Because we haven't told, we've kind of just like initiated a plot like getting ready to make a plot because we haven't told gg plot like what goes to x what goes to why how we want that plotted or anything like that. And so I saw it in the chat. That you know what is aesthetics. I don't want to belabor things too much, but just think of it as something to look on the graph on the plot, basically telling it math, this variable to this characteristic so I want something to go to the x I want something to go to the y and I want something to go to color. So, um, what I want you guys all what I want you all to do is assign agent diagnosis to the x variable right here. Cigarettes per day to the y variable right here. And then disease to color. And once you've done that I want you to run the coach up. And then folks wouldn't mind giving thumbs up as they're getting there. So I can use that non verbal or that non verbal feedback to kind of pace. I love all these new different reactions. I'm going to start filling in. Yeah, that's a really nice way to think of it as assigning your data is like yeah, getting your graph paper out to start everything. So I'm going to run this coach up. Okay, so I have something now. But I don't see my data. And that's because I haven't told Gigi plot how I want my data to be plotted in a sense. So imagine, maybe I want online graph or I want a bar chart. I might have given Gigi plot function, the names of the variables they want, but I haven't told it how to lay it all out yet. So, in our next step here we're going to add to the squiggly spot right here or to the underscore spot. So I'm going to go to the geom point, and then run that code trunk and when you're, you see that output, give me some thumbs up or checks. I haven't seen any thumbs up yet so I'm just going to give it a few more seconds for folks. Okay, they're okay everybody's given thumbs up now. Great. Somebody asked a really nice question. Let's just fill in the anything in the parentheses of the geom point. That is because Gigi plot has a algorithm of how it interprets things. Unless you specify something in this area, things from the prior lines get inherited to the next argument or next function here. So, even though like I have assigned color here to disease, maybe for some reason I actually don't really care about disease. I don't want my data to be differentiated by color. At that point. So I could say, oh gosh, this is going to be embarrassing. I don't know if I know any hex quotes, but maybe I can just. Live coding guys that you can see here previously I told it, tell me color things by disease and I think there's three different diseases. But then in the next line of code. I'm saying forget about that I just want everything read. So it it ignored the prior layer because I specified something else, but if I got rid of that, then it goes back to how I anticipated, or how it was previously. Okay, so this graph is looking really great. I just want. Yes, that's a good answer Molly that. Yeah, the Gigi plot doesn't want us to. Have to say things again and again it's just it fills in the blanks and how it thinks it should. So, looking at this plot right here. I want you to go back and look at that full example plot. And see if you notice something different, other than the fact that he obviously the fascinating jumps out at you. And then if people notice something that's different, throw it in the chat, I'm curious. If it's noticeable. Yeah, great. So, here you can see that kind of we're running into some over plotting issues, because of how our points are and how many data points we have, we are everything kind of like pasted all over each other so it's kind of hard to see what's going on here. And this could actually be very dangerous and some data visualizations to somebody's like. Oh my gosh look at all these blue points in this area or something, but it can just be that like that layer was laid down on top of the red so it's harder to see the red that is to see the blue. And so, we're going to play around with something that can help it's not a total solution at all. But we're going to specify the alpha argument. And with alpha argument is is the opacity of the point and so it's a very from as a value from zero to one. So I want you to try a few different values in this space right here from zero to one. I believe if you put something out of that range is just going to give you an error. And then once you've done that give me a thumbs up. Great, we're getting fast here. I'm going to try an alpha value 0.2. And you can see that things are less opaque. But you can still see that there's this problem where it looks like the red data had been laid down first in the blue on top of it so we're obscuring some of it. And then you can get it in danger. And so now I'd like people to try to put on some informative labels to our graph. And a big thing to note here is, if I run this right here. I'm actually getting those underscores here, because this is just interpreting this as a string of that I want to call my plot so many underscores. And if I tried to not put in quotations, the quotations are telling it that that's a string. R is then the R studio is like saying whoa whoa whoa that's not what something I expected there to be and it's like warning you of like a pre error. Well we're here there's like, when you're in this situation of building lots and lots of plots. There's this really nice feature where, let's say I know I want all my plots to be called something. And I want to reuse that title over and over again or I want to reuse that x axis title. Now that I've assigned these to an object. I can actually clock them down here and it's going to use them as well. So thumbs up if people were able to get titles list as they expected. Perfect. Okay. Okay, great. So as we've been going through, you've been seeing that we've been like rewriting everything. But let's say you just want to know how you want things to be mapped you know how your alpha, but like you're playing around with a few other things. You can assign that to a very or a new object, I called it plot underscore five, then you have that object in your environment over here. And you can just continue to build on to it and modify it so you can totally imagine that maybe you want in your paper you want this plot somewhere, but then later on, you're going to pull out. And you want that faceted graph. Well, you don't have to rewrite all of this code. You can actually just like put in that extra line of facet grid. And you will be able to reuse all of that work and that that's a little bit of what Ted was probably talking about is, once you've written these things modifying them just becomes a lot easier. In mass. Okay, so I want folks to in this underscore here, put in the variable gender in there and then run the code truck. And then throw those thumbs up. If people are having success. I don't know if those were some tears I saw or hopefully just a miss this press there. Great. Okay, so I'm gonna throw this in here and run it. And so now we have faceted things. You can see how the what's up here is is perpetuating down here still. Okay, great. Now we're going to play around with themes and when then we're going to take a break. Yeah, so I want you to take some of these different themes and plop them on to your plot five to see how things look differently. You can vote for what your, what's your favorite look in the chat. Oh, I see I have a extra parentheses there I don't know if that was in your racist document as well but make sure you delete that. So I'm going to pick theme dark and void and we'll see what those look like. So there's the theme dark, here's the void that kind of just takes everything away theme Excel. Yeah, I actually had that in the gg themes, Danielle but I had something break and so I stopped. So, you can see there's like a lot of themes that are like out of the box for gg plot to, you can make a theme as I mentioned you can use other people's, your organization might have a theme that you can use a lot of like journals and newspapers or news organizations will have a theme, or people like that theme and then they'll make a similar theme. So there's a lovely package called gg themes, which you know that gg is indicating that it's a compatible. In the theme or the style by grammar graphics as gg plot to. And so you can see these plots made in different styles and so if you've ever seen the economist, you'll notice this distinct light blue, the Wall Street Journal style, which I guess I just haven't looked at a physical newspaper in a long time. And then the Tuffy style as well. Okay, it is definitely time for break. So we'll start up again at five o'clock. I'm going to stop recording right now. Pause. Welcome back from the break everybody. Moving on to our next part of the workshop for wrangling data. Before I start sharing my screen. I just wanted to talk a little bit about some of like the fundamental reason of why you want to use our for a lot of this work and medical data. And there are a lot of things you can do with other software like you can do a lot of stuff in Excel, but the reproducibility and to be able to redo your analysis is just outstanding. Whereas if you're doing things in Excel, you can easily forget that you did something. I'm working on a data set right now of 150,000 different clinical notes that have both like raw text and then structured data. And if I miscode something in that mapping. I can find it like, great, I can, I have that raw data and I can just rerun my entire cleaning and analysis and graphs from it without having to do any like horrible manual digging to find that problem. If you're if you're still struggling with like why does it matter why do I get to do this, like please reach out. I'm so happy to talk and tell you about like all the different horror stories I didn't seen and why this is like the way to go if you're doing doing research in the medical field. Okay. I'm going to start sharing screen again. Okay. So, what are the issues is that we gave you this a beautiful data set the smoke complete and it's like, it looks really nice. But rarely is that the case right you'll get data and it's all sorts of like, maybe there's 100 columns and you need three of those columns. Maybe it's two rows and you only care about, you know, patients between the ages of two and two years and four years and so that would be a very small chunk of that. So we're going to talk a little bit about the different ways you might process data that you're going to get. Just as a quick aside, there's a lot of different ways to manipulate and wrangle data. And what I'm going to be showing here is a lot of tidy verse. It's a whole ecosystem of our packages that are organized around this idea of tidy data. And one of the TAs or Daniel could maybe find that wonderful paper from Hadley with him on tidy data and put it in the chat and much appreciate that. That was actually what got me into learning how to program was having these issues all the time where I was trying to like deal with data that just wasn't organized in a way that lent itself to analysis. And so when I started to understand the concepts of tidying my data, it like brought together a lot of different things that like I should think about how I collect data so that I have my analysis in mind of like what I want to do. We don't always have that convenience right sometimes it's just the data. That you're given and you have to work with it but if you're in the position of where you're thinking about I want to do this data or collect this data I want to do this project. You can think about what that shape that data will look in and how you're going to do your analysis on it. Later which I think is a great mental exercise in planning your project and really thinking about the kind of analysis and your kind of collection. Okay. So we're going to select just a subset of columns from the smoke complete data set. So, if you guys could look in here. Oh gosh. I just want to share my or see my screen I just don't see the normal, the green ring to make sure I'm sharing the right thing. Yes, I can. Okay, good. I want you to take the variable names, gender and days to death, and put those into the underscore places and then run that and then start to throw those thumbs up when you're successful in that. So I'm going to do that and order does matter. And we'll talk about that in a little bit. But, so I said, I'm telling are. I want you to use the function select on the data that's called smoke complete. I want you to take columns called gender and days to death and I want you to show me them. Sorry, oops. How did that. Oh, interesting. The, to the, these are the columns that that I want there. And then this. The next thing we're going to show you is called introducing the pipe. And so I'm going to show this a different way to write this that gives you the same end product, but it allows you to string together chains of functions to really put an advanced wrangling, you know, many steps together. And there's some like speed and memory issues that come up as well but they're important here. So, I want you to run this chunk of code there shouldn't be anything you have to add. So we run that. And you're going to see that it looks the same. And so what the pipe is it's from this package that I never say out loud because I can never say it correctly. Dan, do you want to jump in and say it correct. McGridder. There's, we mentioned that ours like a funny language but like, yeah, the this package, which we loaded earlier says that this symbol does this special thing where we give it data. And then this is functionally saying like, and then of that data select gender and days to death columns. And so it does that, but then you could string on more and more functions and get ever more complicated results. So select is picking columns. And now we're going to meet filter. So stepping back, I mentioned, you know, like, why are we using this wrangling that data. Imagine you've gotten a data with, you know, 100, 100 columns. That's, and you only want two of them. And so we've used select to get the columns and one filter is filtering your data. Condition on some condition. So here I'm taking filter and I'm saying, I want you to the data is the smoke complete data. And I want you to look at the column BC are patient barcode. And if that inside the cell, you know, like let me show this. Object we have here. Of course, it must be really. So what we're saying is, if the value in this cell is equal to something. I want you to return those rows where that conditions true. Okay. And so you can imagine this for all sorts of things. So this is just seems like a single person's label, and we're getting those back, but you can imagine any sort of condition where like, I want everybody who days to death is less than 200 or greater than 1000. And you can do many different things like let's say you only want people with this stage tumor, who have lived more than 70 days. So there's lots of complexity that you can add there. And then the pipe alternate way of doing that is that you take your data, say, and then you're piping so you're telling are like here's my data. And then I want you to take filter and filter that column on being equal to that. And we're getting the same, same, same information back. I'm going to pause there because, like, that might seem simple, but honestly select and filter are a huge, huge workhorses of data wrangling. And when you're first meeting them, it can sometimes be hard to remember what is select do versus filter, and then filter admittedly has kind of like some ambiguity in my filtering in or am I filtering out. And so I usually think of it as like, you know, keeping the rows that meet this condition. Daniel any, any thoughts. Yeah, that's about right. Like you can read it as like smoke complete then filter the row such that BCR patient barcode is equal to TCGA 18 whatever. Yeah. Okay, so now we introduced the pipe a little bit earlier and you might think like well, why is it important if I have the other select function why would I need to use the pipe. And it saves you a lot in these like intermediate steps that like let's say you have that huge, you know, file of 100 rows and 30,000. I'm sorry 100 columns of 30,000 rows. You might not want to keep making different intermediate copies of everything. You don't want to like select and then filter. You just want to do it all on one step. So that's what we're going to do here is we're saying okay smoke complete as our data. And then I want to filter. And to only return rows where the condition of year of birth is equal to or less than 1960. And then I want to select agent diagnosis and gender sort of return just those two columns. So that's the scene that's chaining of multiple functions on on some data. We're going to shift a little bit into talking about a function called mutate so I really encourage you right now, if you have questions about filter or mutate, or I'm sorry filter or select to put those in the chat right now. So the next function. I'm going to talk about is called mutate. I think this came up earlier when Daniel was talking about when we were looking at the glimpse or the scam of agent diagnosis. And it's this huge number like it's 24,000 or something like that. And I just, like, that's not how my brain works of thinking about how old somebody is when I was in residency in the NICU. Yeah, we totally young. You know, this is a 14 day old infant. But I think once you're getting out of the month phase you're, it's not a great way to keep the age of somebody for easy readability. So let's say we want to make a table. And we want to have age at that at diagnosis, but in years. So we're going to use the mutate function to create a new column in our data. And then we're also going to round that new column to give us years that look friendly to human readable you know not a million different are 15 decimal places. So, let's select or run that code chunk. We're going to walk through this. So I'm creating a new table and assigning it. And so I say, I want you to take the smoke complete data. And then I want you to mutate. And the first argument that goes here into mutate is the name of that new column that you're creating. And so in this case, it's age at diagnosis years. And then trying to give a friendly column name, so that it's easy to know what's there. And so what that is equal to is age at diagnosis, which we currently have, which is stored as days, and then we're going to divide it by 365.25 days. So I'm going to create another column for creating two new columns here. And age at diagnosis years rounded around. And so I'm using a function in here to take this. Function or create this variable from the other variable I just created the other column I just created. And the round function. This number here is just indicating how many. You know, spaces after this after the decimal point you want to play. And then I'm selecting just those three columns from my data to show me, and then I'm showing that table one. So there's a lot of stuff that I just put down there. Give it a pause. You totally could do it so somebody asked in the Why not do it in one step and you can totally do that. Like I could go in here and put the round function right there. But maybe you need it for some other. You want this level of precision work. But absolutely you could do this in one step. And just to. Demonstrate that I can do this. Have this column anymore. So there we go. Same thing. Same thing. So you can actually. When you're creating this new column. Yes. Exactly for Daniel's comment there that. It can be really when you're doing these analysis and you come back to them like four months later and you're like, gosh, I can't even remember why I created that or why I did all those things. It can be very helpful to be explicit about what stuff is happening. Okay. So we only have 40 minutes left. We're getting pretty close. And so I'm going to pass it over to Daniel unless there's any questions right now and. Oh Daniel I can't hear you. Hello. Okay. I was just getting things set up. All right. So. We, so we sort of just went through like a small quick example of like processing our data, doing some kind of calculation with it. Yes, there's many more things that you can do with Entityverse like another super common thing would be like. We're using. Select filter and mutate are going to be used a lot in any type of data processing. The other thing with this course today is we're going to be using the data processing. So we're using the data processing. We're using the data processing. So we sort of just went through like a small quick example of like, processing our data, doing some kind of calculation with it. Yes, there's many more things that you can do with Entityverse like another super common thing would be like recoding variables if you need to. So I think with this course today is more just trying to show you like just try to demystify the programming aspect of a little bit and just show you certain things that you might find useful. But there are a lot more things that we won't have time to show you today. So the next part that we're going to cover is getting frequency counts in. I know in a lot of public health context, two by two tables are super popular. Almost everything runs off of two by two tables, especially if you take like a intro the class. So let's get ourselves reoriented on like looking at different views of our data. Let's go through at the very beginning when we first loaded up our smoking complete data, the scheme data set. And so this was like a great way to get for each different type of variable, the number of missing or some kind of descriptive statistic for the numeric ones you actually get some sort of descriptive statistics out of it. So another thing that you can do is when you are creating two by two tables. The body of the table are going to be frequency counts. And so frequency counts only really happen or really only make sense when you're working with some kind of categorical data. Right, so like, yes, you can count to age. If it's like a whole number in your data set but like if it's some like blood reading, like a CDC count, you're not really going to count the different value unique values in your data set. That's when you, you're treating it more as a continuous variable and not like a descriptive discrete variable. So how do we create frequency counts of things. So there is in tidy verse a function called count. That's great. What you give count is the column that you want to count your variables. So we can take our data set and then use the pipe simple so smoke complete, then pass it pass that data set into count, and then give count the column you want to count. So vital statistics is probably something we care about in our data set. So that's another thing count things that make sense to some question that you're trying to answer for vital statistics you here see that you're dead. And so another thing, especially if you're in like the two by two world or in the epidemiology world, where things, a lot of the things that you're looking at is like a binary variable like did this person get cancer yes or no, or is this person a smoker yes or no like that binary aspect. I love like some larger piece of analysis, even if you don't know the stats for it, like to do the statistics. It's a really good idea to run count off of the variable of interest right so you could end up in a data set where like, there's a number of 100 people, 99 of them, like, don't have the disease and one person has the disease. And it doesn't matter what data set what model you create. If the model just says, No, this person is fine. It's going to work 99% of the time and that's going to look really good. So this idea of like the, there's a statistical term for this and it's called class imbalance. So if you like don't know anything about statistics, look up the column that you're actually using like as your outcome and run a frequency count on it that is going to help you and the statistician you work with. Like figure out what's going on with your research question. All right, so that is my spiel about using counts. How do you count multiple things so typically in a two by two table you're counting two different categorical variables. And so in count, all you need to do is pass it the other column or multiple columns you want to count by. So in this little block of code, you'll see that we don't really get like the two by two view we get this in more it's called quote unquote like tidy view. So each row represents some combination of all of the unique values in our data set. So, originally we had vital status there was two values, the disease there were three values for it so two times three six so we have six rows. It's a different row for each unique combination, and then we get account for it. This is useful. Especially if you're trying to process this data further. If you're used to looking at the, like two by two table itself. The easiest way I found it to do is we have to leave like the tidy verse world a little bit. And there's a different type of our notation that we will show you. And it's so everything up until now is being done using like the tidy verse series of functions right so we have to load up the plier to use select right. So that technically, if you wanted to pull values out of a column, you don't have to use the select function. You can in our without doing anything. If all you want care about is the values of a column. As like a numeric like a sequence of its values, you can use the dollar sign notation instead of using select. So if I just select the, if I just highlight the smoke complete dollar sign vital status, you'll see that this print out looks a little bit different from piping smoke complete into the select function right so if I can contrast this with smoke complete then select these two things print out on the screen differently one actually is a one column data set of values and then using the dollar sign because it's just a single column, it is like a sequence or technically it's a vector of values. There's a little difference in the base our notation and tidy first notation for a lot of things, but this notation ends up being really useful, because if we say, take this column, and the disease column, and pass that whole thing into a function called table. So we now get the little two by two or this is really like two by three table that we're most used to seeing right. And so now if this was something like, if this was something like this treatment effect disease outcome, you will be able to do your sensitivity PPV NPV like two by two table calculations off of one other thing that you can do that's pretty useful. When you table things is you'll notice that it's the same block of code and we're taking this table output and passing it into a function called add margins. All it's going to do is literally add the sum summation margins for you so even if you are going to manually calculate like sensitivity specificity, positive predictive value and negative predictive value. You're going to need the sums of like those rows and columns as part of your calculation so even if you, you know, don't use our for anything else and you're using our to, you know, give you the set of values that you can plug in. Use our for that. It's great. So that is sort of using like a different syntax notation and leveraging the rest of like the our ecosystem or to help you with things. I saw someone put a hand up but then I went down so I don't know it was just a pause for your amazing. Oh, okay. Okay, so the other thing and you'll hear more about these talks. More about generating tables and the our medicine talks. But I don't think I actually have table one. Oh, I don't think I actually have this data set. Generated up. At the end of wrangling, I think. Okay. I know what's going on. This is, I think this is that. No, no. There we go. I was like following along and not running code along before. So, um, we showed you a little bit about this piping the raw calculations into something that prints out a little bit prettier. What you can do is, let's say we have our table one data set. And this is something that you wrangled and processed and did all the calculations in our, there are a whole slew of other functions to help you make this data set. Essentially like publication ready. Right. And there are a few talks that over the next couple of days that will point to different libraries to help you with that one of them is literally called table one. And so like that whole table in every medical journal that is essentially like these are the descriptives of our patient population. There are, once you get your data set and all those frequency counts made, you can pipe it into the table one function, and it'll generate the table with like the proper collapsing of rows and columns so it, it like formats pretty on a, and like what Ted said GT summary is also a really, really useful one. There are a bunch of libraries and tools that can help you with generating tables and making them pretty. And so this is really nice because we've all been in that situation where we're almost done writing the paper, and then all of a sudden we get a new data set, we have to regenerate all of our figures and tables. And the more you can keep that stuff into our, and all you have to do is find tweak things at the very end maybe the less likely that you'll end up with data that's out of sync with whatever you published right so there's also going to be a lot of talks about using our and our studio to generate PowerPoint slides and we've all been in a situation where like you're giving your talk and then you stare at your results as you're talking and then you say to yourself and or out loud. That's, that's the wrong figure, right so we've all been there. And so our is a, there is that learning curve, but because we're using a programming language, the data set updates, it can update the entire sequence of things generated from it. Yeah. Ted, do you do you want to cover the janitor package I guess that's really the last thing left before we sure. Sure, do you mind just kind of scrolling down and said, Oh, well I guess I don't have, I don't have the studio. Okay, all right cool. So, I will talk. So this is a package it doesn't seem that useful when you're first starting out. But as you start to work with other people. You will realize that there are certain things that drive you mad about your collaborators. So your collaborators. So if you just run this code block here. So this is a similar data set to what we have, but do you notice some weird things about it about especially the column titles. And, and to some of you who's like never worked it are and like always been an Excel, you might not even realize that it's like something painful. But looking at those spaces in the column names can just wreak havoc because it's kind of like a special character. When you're doing analysis, but when you're in Excel right like it's all the time you're going to find. Like symbols, dollar signs, backslashes because Excel will let you do whatever you want to do. Makes it hard though for analysis. Yeah. And kind of the example we have here so you know say you have like you're doing a multi center study right and they're everyone is supposed to report the same columns, but you know there's one group that they basically decide to do things their own way in terms of the naming. So this this function that I'm going to show you from this package called janitor basically gives you a way to standardize names. So if you go down here so there is a function within the janitor package called clean names. So what you can do is when you pass basically we're going to take our bad column names for that the data set with a bad column names, and we're going to basically make a data set with good column names. So, basically what happens is it gets rid of things like capitalization. It transforms spaces into things like underscores. This is this is kind of the default behavior so based on how you like to capitalize things. This behavior is called a snake case. So everything is in lower case, and spaces or white space is transformed into an underscore. So when you're kind of when you're working with like kind of multiple groups. This helps to kind of standardized things. So your code, your, your, your code hopefully will work across these data sets. So it's, it seems like it's not that useful when your first kind of starting out. But it's, I would say this is like the number one thing this and the case one statement. The two things that I recommend that you start kind of looking at next, because they will make your lives easier. So case one we don't have time to cover it, but it may let's you make kind of makes you make let's you make kind of a continuous variables categorical, for example. And it's it's just such a useful function that, you know, you'll probably use it almost in every like the player, kind of pipeline that you build. That's pretty much it does anyone have any questions about about janitor. Oh yeah, and it does remove, remove empty is super awesome as well. So janitor is not part of the tidy verse but it is such a great package that it's, it's just, it's pretty nice. It also has this function called table is which is spelled T a b y l. And that works within a tidy workflow. So for example, you could, it basically lets you do these kinds of cross tabs. Oh and thanks for pacing that Sylvia. So these are, these are kind of these good kind of next steps. And it's there. Like, as you kind of go on you'll start becoming like a data ninja, you'll be able to kind of start manipulating things. And, you know, it, it becomes it feels like a superpower. I'll tell you anything, anything else that people want to add about janitor or anything. I mean, yeah, it is a superpower to like somebody sends you an email like hey, could you run that analysis on these numbers and you go into our and like, look at that data and pop it out and you're like yeah here they are 15 minutes later. It feels really great. Yep. The best. This is kind of the high, high level of data ninja ninjutsu, but is being able being able to do a lot of these things in real time. It is, it feels awesome. And we do have a little bit more time so unless Daniel or Ted you have other stuff. I was going to just show some stuff about rendering documents. And so we have been living in this. It's called RMD, which is like an arm work down file. And it allows you might have noticed this like happy little knit button here. It allows. And if you click on it, you're going to see some different options here. What I want everybody to do is I want you to go over and click on this file here the one that's the looks like a paper with the circle and RMD called knit previews and click on that. And you're going to open up. Oh, I forgot to install a package guys. So hopefully it will prompt you to install that package, but it might not. So this might mostly just be you guys watching me, but so I'm going to show you the base file is this. The same thing we showed the little YAML header the code chunks text. And the are allows you or this. This is what it's called allows you to take that base file the RMD and put it into different formats. So you can knit it to an HTML file. And can you guys see my new file that just popped up. I don't think so. I think I see the net preview. Oh, you do see the net preview. Oh, no, the pop up didn't show up. I think I am sharing the wrong type of screen. So you should now see this is what pops up. And it's taken this file and needed it to an HTML document and unlike a word or PDF and HTML you can have like interactive documents. You can see here I created an interactive plot that you can like zoom in on and see what the data points are. You can knit a PDF. You might want to put some of your stuff into a PDF for very many reasons, but right like a PDF can't have interactive parts because it says if it kind of were printed. And then I can knit that to a Word document. And this is great right because I might do. My document or do my analysis, but somebody else doesn't use are and they want to look at the file in Word. And there you go. Like, I took that and translated it into this with my pretty little graph. So that is a bit much. But the way to like, take your analysis, your code, your text, and transform it into these shareable documents, like I put everything up on my GitHub pages. I like, when I before I had children I've logged all the time. I try to blog there as well. But it just makes it so easy to share and they don't have to be our users. And I think Daniel hit on something that was key. You don't have to become an R pro and use it for your entire workflow. You just need to use it for what like is great for you. And then, like, when we showed that roller coaster of R and you're like ups and downs, it's because once you get comfortable with one thing, you see all the other cool toys that everybody is using and you're like, oh, I wanted to wear that. And so you're just constantly learning, expanding the circle, but like strengthening those core skills of like wrangling your data, doing your analysis, creating reproducible stuff. Yeah, I guess one thing we do want to talk about is learning communities for our. So just want to mention our for data science is a really wonderful one. So like, you know, they're just very, they're just a very welcoming community. So if you think it's just, let me see if I can find the address. If there's anything else anyone else wants to add, I will pull up our for data science. Somebody asked in the chat about, are there good resources to learn how to write loops for repetitive data wrangling steps. Yes, and there's also tons of other good techniques to get away from this. Yes, yes. So you can look in the arm for data science. If you want to do loops. I find if you want to do a certain process in tidy verse. I often find like the best way to like get the exact results I want is I type in my question or a little blurb about it and then put tidy verse, and then I'll get like a tidy verse flavor answer. Whatever is a great package. I don't. I'm like having an existential crisis about thinking about how data gets stored. And so I have to learn more about how lists work, how I use them in my own data because right like once you get really into the next data, not everything is a beautiful, like, you know, flat file like there's complicated things different kinds of variables. And so like per per is my next thing to start going to the trough, our understanding about. So it's interesting because had Hadley and I actually had a discussion online about whether people should learn row wise or for row wise is kind of per light, and it's in deep layer. And he said a lot of the tasks that you want to do and per you can actually do in row wise. I actually have a tutorial about that. So just want to point people this. Well, we should all kind of plug our resources. The TAs the TAs have done like some amazing things as well. So, I'm just going to point to my ready for our course. So if you want to kind of continue on this is a free course that's available to you it's all done in our studio cloud. And I know Sylvia has stuff and Daniel has stuff as well. So I think if there's no other big questions about the resource material. I was thinking about turning off the recording so people can just like chat if it's our like come on to video and stuff so it, I think we're all in agreement with that so I'm going to pause.