 Thanks Brian. Just so I'll be on camera, you guys can also be in my little room here. But I plan on turning that off during the course, just a heads up. This course is actually going to be run locally. It's meant to provide an introduction to R for people that are familiar with SAS. So we'll be going over some useful packages such as like formatting some frequencies and we'll go through like a basic logistic progression example at the end of the course. So for right now, I'm going to put this in the chat. And if you guys wouldn't mind just show my screen. Hopefully this is the right one. Okay, great. So if you go to that link within the chat, you should be able to see the help account here. Pretty much if you go here, click the code button and download these zip files. That'll get you set up with all the material within the course. So if you go back and kind of just extract that where you would not. I'll put this here on my desktop. So here is should be all the files that you would need. We organize those. There's two main CSV files. The one we'll be working with the most of the class is called COVID. So thank you Dr. Higgins for providing that. It's a good repository. And then we just have our basic PowerPoint here. If we go ahead and open that up, we'll get started here. Just as a head up, we are going to be working locally with this. So if you guys all want to go down to slide six, that'll get the installation started for your R studio. And you can download R from the repositories there. Post locally on Cran. Estimate statements. Can you be a little bit more specific with that? So sometimes with R, it'll provide an estimate, but SAS will not. So there's a little bit of an overlap there, I would say. Can you provide a little bit more clarity, I guess, before we get started? Like a contrast statement for getting the LS means or something like that. Usually instead of doing an LS mean in SAS, I'll use the estimate statements to get it a little more specific. So generally when dealing with summary statistics, you can get most things out of a summary function. But if you're looking for some advanced statistics, I would say, depending on what you're doing, I provide, I guess, a reference to the site package. So you should be able to use the site package a little bit for some of those estimates. I think that would be a good resource. Thanks. Okay. So just as a heads up reminder again, let's kind of get this going with the downloads. It should take about 10 minutes to download. Essentially, you're going to be needing to download R and then the R studio for the IDE. You can download those anywhere and then help navigate to the directory where we put these files. Okay. So you guys should all be able to see my screen. This is SAS to R for medicine. Again, this is supposed to be providing an introduction to R for SAS users. So we'll go over some basic functionality there. My name is Joe. I'm a data science consultant here for Procogia. And we focus on a lot of different data science solutions, mainly with like our server studio installations, some data engineering in the cloud. And we also have some people with more statistics background. Just for what we have in the agenda today, we have an introduction. So we'll go to the intro of like what R is, kind of how that relates to SAS a little bit. We'll go in through some basic reading out of our data, kind of focus a little bit on some data wrangling and some packages there. There are some GG plot examples that I provided in an R file. If you're looking to kind of look into more visualizations, I guess we only cover a little bit of a basic scatter plot in this course. But for some people that might be of interest to them. And then we'll go through some data exploration. And then finally, we'll just go a little walkthrough with just like a logistic model and kind of establishing a base fit there. So just as a reminder of like a code of conduct, this is a product as early as possible. There's going to be no harassment between the course participants as well as the instructor. So there's like a no tolerance policy here at R Medicine. If you're looking for the complete link for the code of conduct that's provided in the link below. So feel free to explore that if you wish. There's also a limitation for no screenshots or recordings or any type of photographs as well during this presentation. And I think that carries over to all presenters and instructors as well. So, all right. So the introduction. So essentially what is R? R is an open source programming language that was developed essentially just for statistical analysis. And RStudio is kind of the most frequently and freely available integrated development environment. So that's known as an IDE for some people that aren't familiar with that. Essentially, that's just a more friendly user interface, right? So we can run R through like a terminal or a command line, but kind of executing our code and somewhere that's a little bit more more friendly to the eye is ideal. And just kind of like a quote, if programming in R is like driving a car, then you can think of R as an engine under the hood. And essentially RStudio is like the steering wheel, the accelerator, the brake pedals, and the dashboard. So what you're going to be doing in RStudio, it'll kind of like, it'll help you autocomplete some code that's very useful when you need, as well as like you're ever working yourself with like dashboards or something like that. You can build those in here and the Shiny package kind of will automatically provide like a template code for you. So there's a lot of like additional functionality as opposed to just coding in like a terminal or something like that. But it is possible. Okay. And again, just the installation link. So I'd estimate it takes about five to 10 minutes here. So we'll take like a brief pause before we move along here to see how the installation is going for everybody. Is anybody having any issues with the installation from CRAN or the RStudio IDE? Are they unsure about some binaries that they should be installing? Feel free to reach out here. I'll give it another few minutes for that. And just as a follow up question to that, everybody was able to download the GitHub repository files to this course. So no issues there, I'm assuming. So they say silence is golden, right? Give it about one more minute here and then we'll get moving along. So SAS and R, how do they compare and how are they different essentially? Kind of how the utility of SAS works is a lot of stuff will be manipulated through the data step. So the data step will be used for in file statements kind of formatting your data if you're not using like a PROC format. And then that leads into the next bullet point of like procedural analysis where it really emphasizes the PROC statement, right? So that analysis can be done for like a PROC univariate, a PROC means multiple functionality as far as like the procedure statements go. You can use it for reporting. But there's a limitation with that, right? So SAS is a commercialized software. And what the recent trend has been is like more open sources than accepting acceptance or acceptable. And that leads to more innovation, right? So people working together in a community allow us for a quicker, I guess, to market solutions, right? So there's, I would say there's not as much of a stringency on like the user acceptance testing, but there is a little bit more of like a communal acceptance, right? So you'll see more machine learning based algorithms essentially in R before you will in SAS, right? So SAS I believe is launched via. So that's like a machine learning based solution. However, that's still relatively new, right? So they still may be working at their kinks as opposed to having an entire community working on a package where it may have bugs, but it has been resolved since then. And SAS will be kind of lagging behind in that sense. So the reference to R leading into that is R source data and what are known as objects. So you might have heard of like object-oriented programming. Essentially, we're going to store like all of our attributes within a given variable at a time. And that variable can store like strings, it can store numeric values, it can store like a data frame, pretty much anything that you need to kind of reference can be stored in an object. And that allows for very user-friendly functionality, especially when you're defining new functions in R. And again, kind of just highlighting that point, right? Like R is open source. So you can kind of work together a little bit more easily because all the tools that you need to interact with, say like somebody that you're trying to work on some code with, they're just open source, they're readily available for free. There's more people involved in just like this community structure as well. So I think there's a little bit more of a shift from SAS to R in that sense, as well as like statistically kind of like what was brought up earlier with like some least squares estimate on some fit, right? R will provide you that as opposed to like SAS not providing, say like an estimate for something, say you try to overfit your variables in like a given model or something like that. You just throw them all in there. It may say nothing's really significant, right? But R will still provide you like with an estimate of like what that weight would be for like just like a linear trend or something like that. SAS may say that it's like uncalculable. It'll just provide you with like some blank dashes. So it allows the user to interpret what's given as opposed to the software to interpret what's given. Okay. So why R? Again, faster computation time, right? So you're going by a row by row of state. So in SAS, you're given observations, right? And then those variables are determined by the fields, right? So, but you're going essentially horizontal when you're processing data as opposed to column by column, right? So say you store everything in like one variable. You can process that one variable faster than you can process a bunch of variables kind of horizontally down the line for a given observation as opposed to just defining that variable and going row by row as a variable. So computationally, it's more efficient because again, the R stores the values within memory and those values are vertical as far as like their calculation goes. So the additional collaboration technique is R Connect as well as Git version control. And again, like mentioned earlier, the shiny dashboards, right? So when you're done with your analysis, do you need to process that through like a third party, say like BI or Tableau some instance like that, right? So the functionality to continue to progress with your code and develop on that and allow more public viewership of like what what analysis you've done, right? So you can just essentially provide an HTML link and say, here, go to this dashboard. If you're interested in this analysis, somebody doesn't have to necessarily open up any of your code. And it allows for just like more managerial insight and collaboration amongst your analysts or your data scientists. And then the third bullet point, advanced data science techniques. So some more well known machine learning algorithms are available in R, but they're still not readily available in SAS, right? So neural nets is like a primary example of that before via kind of implemented some type of solution for like convoluted neural nets. But there may be like more of like, I guess, some like user specific window functionality when like defining like a cross validation within a model, right? So SAS goes by like statistics for the most part, it kind of lives by that. And what the output is, you kind of have to live with as a user, right? So there's no finicky solution, I guess you can say, comparing like say like an inferential statistical model as opposed to like machine learning model. But if you ever want to implement some more advanced like data science models, right? So they may be available in Python, but then the next language might be R, right? It's not going to be SAS as time progresses. So SAS is going to be kind of continuously playing catch up in this ever evolving data science world. And then kind of like the fourth bullet point, right? It's cost effective just to kind of have an open source software, right? So just a single base license for SAS, this may be an old estimate was about $8,700 for the first year. And that kind of goes down with time. However, it's still an additional cost and that's only for a single user license, right? So it does add up, right? And you could have some very powerful server setup for an open source system for that kind of cost for one user at least. And then again, as well as like the increased talent pool. So I myself, a little bit about my background is like I've gone through a few degrees here. And I continuously see the evolution of more open source teachings as opposed to the SAS based teachings, right? So one example that I recall of recent was kind of trying to get like residual estimates from like just like a basic logistics fit. And those estimates were kind of hard to replicate in SAS. I needed to kind of go through like three iterations of like pulling those values out, right? So with like a proclogistic or something like that, you have the option output, your standard deviation of your like residuals. However, like with R, it's just kind of like addressing that column that's already given in the output. And you can kind of tailor the output to continuously always reproduce the same summary statistics, I guess you can say. So functionality wise, how does SAS work and how does R work, right? So SAS goes through data steps and kind of how R works is like you're given expressions, defining your data and you can kind of manipulate those with functions. Again, SAS goes through procedures. Same thing can be expressed within functions. It goes through some macros and you want to kind of like set up some procedure that you're going to be iterating through several times in the future. Again, expressed within a single function and that function can just be called in the script at like a header file, right? And then SAS functions, right? So there are like some basic SAS functions, right? So like if then some do loopings there, those are just basic R functions. And then the output delivery system is kind of like a big one. So output delivery system kind of is replicated through what's known as like R markdown. So if you ever wanted like a summary table, right, you would kind of output that to a given file, right? So I think the new cloud-based solutions for like a free tier are limited to only HTML. Now it used to be a word, PDF, or like a rich text format and then a PDF file as well as HTML. So the freeware is, it seems like it's being limited a little bit further for students as opposed to like R markdown where it's like we're continuously evolving kind of how we output text-based solutions, I guess, for more reporting senses. Okay, let's check. Okay, great. So R packages, what is a package? So as a package is in a brief sense, it's just a shareable collection of code that can be used to perform a specific function or any type of task, right? So say you have a bunch of like functions that you already have defined. If you want to just package all those functions together, you can make your own your own package of packages. It's a never-ending cycle that I guess you can set. And who can make a package, right? Anybody can make a package. Packages are generally made publicly available via the CRAN network. So the CRAN stands for comprehensive R architecture network. And that's where you'll be downloading R from. So your binary files from there. Essentially, you can access most packages that are publicly available through the CRAN network. There are also private packages, right? So there's an interactive functionality in R that allows you to kind of just download things from external repositories. And you can make private packages if you want. That can only be, I guess, shared between peers. So packages can remain private if you wish. But for the most part, again, we're a community-based group of programmers, I would say. Okay. And then again, just like from a high level, SAS procedures are expressed within R packages for the most part. And the functionality within the packages kind of help replicate what you're looking to do in R as opposed to SAS. Okay. So I'm just going to hit a quick pause here and just do a double check for the local installation. This is kind of where we get into the IDE at least. So we'll go through a brief explanation here. But we'll give it a quick pause just to make sure that everybody's all set with their installations. No hiccups there. Everybody's looking at the files. Oh, yeah. And then again, just when you're on the GitHub, you just click the code, you can download it as zip and kind of store that locally where you would remember where you put it. Essentially, you'll load all the files through this little viewer pane here. Okay. So basic navigation of the RStudio IDE. So if you have R installed, you should be able to install RStudio. And don't mind the yummy pasta recipe. That's just some paste code that I pulled. So essentially, the script will be loaded here. You can see we have R Kitchen analogy here. And this is where we create scripts. You have the option to manipulate kind of all these panes, how you would like. However, the default layout looks like a little something like this. And this is where scripts are loaded. So defining it on like just like a pasta recipe sense, scripts or recipes, and they record how we do things, right? So when you're looking to write anything, when you're looking to kind of like save your work for reasonable access later, that'll be in your R script here. You also have the option on the below pane here in the console. So the console is like where kind of like the cooking of the code happens, right? So you can execute code within this console here without actually writing anything within the scripting window up here. And that's just like for some basic scripts that you would write. So I find myself writing just like a lot of summary statements within the lower left-hand portion here just to kind of view some of my data sets and see how the output was generated before I finalize it within the script up here. And just like for an analogy there, you can cook here without using a recipe, but you'll struggle to remember exactly how to create a dish in the future. So it's better to use a recipe, right? So if you're ever going to need that code again, it's just the best practice to put that within the script that you're about to save. Just just like a general sense, a general rule there. So going up to the lower right or upper right quadrant, so like quadrant run on like a Cartesian plane, this is the environment. So in the environment, this is where it was sort of like your local variables that you defined within your script. It'll store any data sets, any vectors that you have, pretty much anything that you're going to be using within your code that's already imported and read into the environment will be placed here. So you'll be able to see like kind of like a numeric type, kind of like what that variable is made of, and just like a basic summary of like what's going on in your code. So that's, it's pretty useful. And kind of like when you're done with your code, a good practice is to kind of clean your environment. So you don't plan to use any of those variables you have overlapping variable names within another script. You kind of just click this little room here and it'll kind of clear everything for you. And just the analogy there is you can put ingredients, which is in the sense data, and finish your dish, essentially your model output here to use why you continuously cook, right? So say you have like some output of a model there, and you can't remember like the reference of it because you're down the line like 104, right? So you can like see the name over here to reference for later use if you're looking to compare like divergences on like the given model or something like that relative to like your predictive values. An example. And then the viewer pane in the bottom right hand most corner, this essentially is where you'll see your files. And you can see we have a few tabs here. So we have files, we have plots, we have packages, we have help, we have viewer. We'll go through a little bit of an example in a second there, but just in a given sense, your viewer is kind of like where you see all of your file structure, right? Any type of file that you're looking to interact with, this is kind of where your set given directory will show what's within that file pathway. And then kind of just like a brief overview is like a packages or like tools for the saucepan. If you're looking to reference like a package in the future, you go out and you buy one that is already that someone is already like made. And essentially that's kind of referenced through the install packages. So anything is if you're looking to do anything in R, somebody's probably already like done what you're looking to do. So redefining packages, it's not the best use of time. As long as that package is kind of reliable and well maintained within the community. It should be fine to use. There's no real harm in it. It's been a few years for some of these. So yeah, and just as a little bit of a side note there, every time you want to kind of like use a package, the reference is the library can end up here. But we'll go in a little bit more detail in a second. So I want to pause here. I just want to go through and make sure that everybody is able to open up their instance of R. This will kind of give me a whole new. Okay, great. So this is kind of what the default layout would look like. I have just like a base setup here. And you can see we're missing one of the windows here, right? So if we wanted to create a new file, we would go to file and then we would go to new file and you can see the shortcut there and all the options that you have to kind of interact, right? So we have a markdown file here, HTML, CSS, so forth. But for this sense, we're just going to click R script. If you guys want to follow along, that's great. Feel free to, but this is just going to be a little bit of an example. So if I wanted to kind of just like install like a basic package here, I would like reference that up here and then like I can just like type in the package, like dplyr would be an example. So I spelled it right. And then you can see okay, probably just the spelling error. Okay, then you can see like I used the tab there for like the autocomplete, which is very useful. So I mean, this is how you would install a base package. I mean, I can install that, I guess. And then to reference that package, it would just look something like this. So those are just like high reference things. And then in the terminal window here, you just have like kind of like a general console or they can write R scripts in some jobs that you have kind of like scheduled out. And this is kind of like the file structure in the lower right hand corner where my mouse is. So you have files, you have plots, you have packages, you have help, you have viewer. If you ever need help with like a given package, this is super useful. There are shortcuts available such as like, I mean, if you're in here, you can do like a double and you can just do like a dplyr and it will like bring up the help menu for dplyr. And you can see like that pulled it up within this help window. So there are shortcuts to kind of like find help within R. And if we're looking for just like the introduction to the package itself, it'll give you an introduction and you can see like there's a Star Wars example and we're kind of going to be going through the Star Wars example here. So forgive me for keeping that with this course. I like Star Wars. Okay, so for a brief introduction for the course, what we're going to do now is we're going to just make sure that we have everything set up. So I'm just going to clear this out. The shortcut to clear out a console is Control and L. That's how I clear out the console. I think it's very quick and efficient way to do so. Be careful of what you're doing. Also, if you're looking for shortcuts in the console, you can hit the up arrow and it'll kind of go through what was installed earlier and you can see I missed an S and install packages. So just a few tips there. And what we're going to do now is we're going to navigate to kind of where we stored that data. So mine is in my desktop and you can see like we want to kind of know the pathway where we're going. So for this portion, I want to go to my desktop, which is already defined within my home reference. So you can see I'm in the home. I'm on my OneDrive and I am in my desktop. And you can see here's the list of files that we have for this course. All right. So what we're going to do now, once we've navigated to where we put those files, right, so if you're looking to kind of like, the home is kind of like the default home for a given computer at any time. So like if I wanted to just kind of go where my files are, go on my OneDrive, go back to my desktop and then go within the folder that you stored the data. So once we're in this folder, what we're going to do is we're going to set that as our working directory. So we're going to click this more button and then we're going to go to set as working directory. And you can see that within the console, what it printed out is our working directory that we have. And this little tilde here just kind of is like a shortcut for the prefix for what's defined as home, right? So my home in this sense would just be my home. And then we kind of work down through the layers of the root. Not super important. What's important is establishing your working directory within these folders. If you ever want to like see what your working directory is, and you can see like, before I even completed this code, it kind of returns the functionality or the function that we need. And pretty much we'll just reference that and it'll print out or the working directory that we have. Useful tools for later if you need to reference them. Okay, keep this up over here. All right. So we'll back to the PowerPoint here. If anybody has any problems, just feel free to kind of go to the chat. I wouldn't sound very familiar with Zoom. So like try and keep this chat open as best I can for you. All right. So just getting started. At a basic level, ours calculated, right? So we can see like it does perform addition pretty basic. If we just want to put that in exponential operands, multiplication, division, the modules operands. So essentially like the remainder there, right? And then the order of operations, it'll kind of just follow that, right? So those are all like the desired outputs that we would expect. So assigning variables. Assigning variables in R, essentially it uses that little arrow symbol. So you're going to be pointing at what you want to assign. That's kind of how I remember it. And again, these are all known as kind of as objects, right? So X is now an object, Y is now an object, Z is now an object, right? But what we store in that object kind of defines what each structure is. So in this sense, if we want to just make a few assignments and store some calculations in a given variable, call this X in this sense, and we want to store like 32 times 4. We can just store that value. We want to store like Y and give it the value of 3. So those are both integers in that sense. And so you want to create like a third variable that's derived from a calculation from the two previous variables and we'll call that Z. And you can see we just have X over Y. We assign that to Z. If we just output Z, which is just putting the letter Z and executing that line, it'll give you a printout, right? So this is the output from what you'll see in R. And it just does the basic calculations that you would expect from the given calculator. And it's pretty intuitive assigning things. It's not super challenging, but if you need to do assign anything, essentially you're going to be needing to reference that little arrow. And it's always the arrow pointing at something. That's a good way to remember it. Okay. So some more complicated data structures are known as vectors, right? So you can think of a vector as like a collection of items, right? So it can be a bunch of strings, some numerical values. So this can be like integers, real numbers. And you kind of use this command known as C and then whatever's within the C parentheses there, that's used to concatenate arguments together. So if we look at this pop variable here, what we're doing is we're taking this function C and we're combining 234567 and 10,000, all in one reference variable called pop. So it'll concatenate all those arguments and it'll store them accordingly. And again, how do we work in R? We work in columns as opposed to rows. So all of those will be observations in that sense. And then again, same thing, same thing below. So we have area square mile. And we're kind of just taking those arguments of 83.78, 46.87, and 503. And you can see the different decimal precision there. It doesn't necessarily impact any way of storing the variable. So it'll store it as is based on the assignment for user defined variables. So then a third example here, we just have city. So city, this is an example of concatenating string arguments together. So for this sake, we have Seattle, San Francisco, and LA. And all we're doing is we're taking these three values, concatenating them into a single object known as city and restoring them for later use. And so you want to just see the output of those. So you can see like pop, it defines everything as a single given integer in that sense, because that's the way we define those values. Area square mile, we noticed that 503 has outputted similar to the two previous values. That's because it's going to take in the argument as 503, but it'll output the argument in a printable fashion that's similar to the first argument. So in this sense, 83.78, it'll output it to two decimal precision to follow that as well. And then just just our output from city. You can notice we have a little bit of spacing here that just that's based on the size of the string that's passed through, right? So we have different sizes here going on. So it's going to kind of like maximize the spacing to kind of fit those all evenly on the given printout. So whatever has like the max distance, it's going to use that and it's going to say, okay, well, these all need to have X amount of bytes together in order to kind of match length, right? So sometimes spacing occurs and that can be fixed if you're looking to do some reporting and that's just using like a basically justify command or something like that. And just as a heads up, if you need any references on like any of these packages that are referencing this course, they're all local on CRAN. They're made public and available. We don't have a lot of time to cover some of the examples, but a lot of the packages that are provided within this course are very useful for replicating what you do at SAS. And there are some further examples of like modeling. So there's like a model sum function that you may see that would be useful as well for future use. So I encourage you to continue to explore some of the packages provided here. Let me just get a quick sip. So some more complicated data structures. This is where we're going to get into kind of like data framing. So there are two types of main data structures that you'll see for numeric values. They're data, well, just for values. So they're data frames and they're tipples. So data frame is essentially a collection of vectors. And these all have the same length and this is the most useful structure for like any type of data analysis. So most of the time when you're seeing some R segments, those will probably be replicated. Those will probably be represented via a data frame. The output of a tipple is a little bit more, I guess, structured just because it does emphasize what's known as like tidy analysis and tidy storage structure. So what a tipple is, it's like a specialized version of a data frame just designed with that tidy analysis in mind. So equal observations for a given row, equal amount of variables for a given data set. So if you have one value within a given data set, that's not necessarily a tidy value for a variable. And you can have long printouts. You can have shorter printouts. Essentially, it'll be a formatted data set of equal length and equal width for a given observation and a variable subset. So the two main differences between these, a data frame and a tipple are kind of like printing and sub-setting. So tipples have this refined print method that shows only the first 10 rows and basically every column that would fit on the screen shows up. So they're going to be like carryovers where you see like not all the variables that fit on a given line within a given print window. So some of those variables will get carried down. And you'll see like a truncated data set that'll print out those main observations, the first 10 rows. And then the remaining variables will say like 10 variables not shown or something like that. And it'll kind of give you the list of the variables that don't show. So in that sense, tidy data is very nice to have. It's kind of like a standard. But for the most part, I find myself working a lot more with data frames because I generally like to see that output, just as like a use case scenario. And just again, tipples are very strict about how you sub-set. So if you're trying to access a variable that doesn't exist, you're just going to get an error as opposed to like a data frame. It'll just put out nothing. It'll say no. So those are kind of like the two main differences between those two, I guess, data structures. And you'll find yourself working more with data frames as opposed to tipples. But for like reporting instances, tipples are kind of ideal just because of how neat they are. So this is just a basic example of like a data frame. So you can see in this example, we have cityinfo.d for data frame. And you can see like how we name variables here is a little bit different too, right? So like putting in a period and then d is like, I wouldn't say it's a common practice, but it's a preferred practice in some instances. I find myself when I'm programming, I define my functions as like .f and then my data frames or my tipples is like .d and .t, just so users kind of have an idea of what they expect to see. I've seen that before and I kind of just picked up that practice. It's not a recommendation one way or the other, but just a thought when you're naming your variables. And you can see here, so we have cityinfo.d. And what we're pointing at, right? Again, we're pointing at this variable name. We're using the data.frame function to concatenate a pop area square mile in city. And then we're just going to print that out, right? So this final command here will print that out. So you can see we have like the first few observations here. We have pop, well, we have all the observations. It's a small set. We have pop area square mile, and then we have city. And you can see it again as well, the area square mile, how did it print out, 503.00. So this is just a basic example of how you make a data frame. And what the output is kind of, what the output would look like for a data frame. Contrary to the data frame, we have a tipple. So the assignment statement is very similar, right? And you can see I did leave off the t here as a practice. But we're going to name this variable cityinfo. And we're going to assign that using the tipple command. And we're pointing at the variable that we want. And we're going to take those vectors that we created earlier, pop area square mile in city. And then we're going to output that, right? And you can see there's a little bit of a difference here that we notice, right? So the first thing is, right, we define like the structure of the data frame, right? Three by three. We define the type of the data, right? So we have double here, double here, and then we have a character here. And then what do we see for this output, right? We have sig figs, right? So in this sense, we have three and three here. So how your data is outputted does matter in a sense. But for the most part, the scientific community seems to prefer tipples and the, I would say more so, like the data science, like non-GMP, GXP environment kind of prefers working with data frames. So just two quick differences that you'll see there. Just how they print out, right? Okay. So now we're going to get our hands a little bit dirty. Let me just see, I think I have an example one here. All right. So we're going to do now. We're going to go to our studio IDE here. And we're going to open up exercise one. So we'll walk through this example. For the given later examples within the course, I provided kind of some reference code above that you can kind of look into. So completing those, I'll give you a little bit more time, about five minutes, we'll say, just to kind of look over some code, the very basic examples to kind of utilize some of the packages. But for these, we'll just walk through them pretty quickly. Just so you get an idea and get your hands a little bit dirty with the keyboard. All right. So what we're going to want to do here is we're going to create two variables called x and y, right? So we'll just like name those right now. The shortcut on a keyboard to make this little arrow, at least for me, I have control the windows button, and then I have command alt. So I could click the command alt and the kind of like the hyphen symbol next to the plus minus button on the top of the keyboard one, two, left of the backspace can replicate that. So if you just click command and then the hyphen, it'll produce that. But if not, you can kind of type that out if you want. So what we're going to do is create a variable called x and what we're going to store in there is just the values four by three. To execute a line, there are two options. We can highlight the line and we can kind of hit the run button. So you can see like we have one variable now. The other option is to kind of just like go next to the line and hit control enter. And then you can see like it only executed a single line. So we hit the control enter button and you can again see the replication there. So those are how you assign it. And you can see like all the output is here as well as like going back to our reference variables, right? What is this value? What is this variable and what is this value stored in? So we have x here that we just assigned to and it's given the value of 12, which is the correct output for four times three. So we're going to go through that second example here. Sorry, wrong one. And then we're going to do three divided by one for this example. And then again, control enter. So it looks like we have those two values assigned correctly. We're going to create a third variable here called z. And we're going to assign that the value of x times y. And then we're just going to hit enter. All right. So I can keep going back and forth to the command buttons. And you can see like now we have a variable x, y, and z, z, or z. And the values are 12, which we expect from that, three, which we expect from y, and then 36, which we expect from the multiplication of x times y to z. And then if we just want to print the output of that, we just print z, then it will give us our output within our window here. All right. So if you want to save it, now that we've worked on it, or at least created the most basic variables that we could, and you can see like on the outside of this currently, you can see that it's red here. If we just want to click file, and then we can go to save. And then you can see this turns back to black once no more changes have been committed. So what we're going to do now is we're going to create our first vector. And our first vector is going to be called myfirstvector. So we're going to go through here, and we're going to assign it the values of four, three, two, and one. How we do that, again, is if we want to store multiple arguments, right? We use the C function here. So we're going to assign that four, three, two, and one. And then we're going to hit control and enter. And you can see we have our myfirstvector here. It's a numerical data type. It gives you the references for your values. So we have value one here, value two, value three, and value four. Some languages will reference the first index of an array of zero. Some will references one. You can see here we have it as reference. So that's important for later use. And now we're going to create a, we're going to continue on with this, and we're going to create my second vector. So we're going to walk through here. And then we're just going to reference it as five, six, seven, and eight. And then we're going to hit control and then enter. And then you can see we have my second vector now. Okay. So now we're looking to create our first data frame. So recall that we used the function data dot frame, and then open parentheses there to kind of define those structures. So we're going to call this value my first data frame. And we're going to use the data frame function. We're going to have open parentheses here. And then what we're going to do is we're just going to pass in my first vector. And you can see exactly what our just did. So we have my first data frame. And then we have my first vector. For now this has no assignment. It's in orange. The pink value means it's given a value and it has current assigned. So we're just going to select that, right? And we're going to keep going. We're going to let it complete. And essentially, so what we're going to want to do here is pass in both values. And you can see we have our columns there essentially defined. So to pass in multiple arguments, remember that we need our C command to concatenate the arguments. And then we're going to print out this value of my first data frame here. And you can see again, the printout does match. So we're going to do the same thing here. We're going to define a table. And we're going to call this my first step. And we're going to keep going. There's our table function. Oh, right, right, right. So something that needs to get done. Sorry about this. I forgot. So what we need to do is we need to open up the packets installation. And we're just going to highlight all these values. And we're going to hit control and enter. And then it's going to need to install some of these for a second. So those might take a little bit to run. It's 422 already. Oh, man. Let me see. And we have a little bit of a while. Okay. Sorry. I intend this course to run about two hours. These should only take about a few minutes to install. The second you have all these installed, those would be the references for the class. And then going forward, we're going to need to utilize the library function to reference those. And you can see the reason I forgot about that at this point is because we need the tidyverse package, tidy data, tidyverse, tibble, tidy data, just a basic way to remember that. And pretty much all these should be done if you install those at this point. So within exercise one, pretty much at the heading column, like at the header of the file, what you're going to need to do is you're just going to need to reference like the tidyverse. Yeah. And then what that's going to do is it's going to pull it in. It's going to show a little bit of a reference there. Okay, great. All right. So now we're going to go back down and continue on with our function here. What we're going to do is we're going to kind of define these, right? My first tibble, my first vector. And then we want to concatenate that with my second vector. Boom. All right. And now all we're going to do is we're going to print out our final value. So this is going to be my first tibble. And then we're going to output that. And you can see like we have our double value and we have our observations here. A tibble is going to be 8 by 1. So essentially like how that'll work is it'll concatenate the values into a single observation. So these are like on like. So we have our first data frame here. And then we have our second data frame here, or our second vector here. So it recognizes the similar values and it'll pass those through as a single given row, or a single given column, right? And if we just want to save that, we'll close this out. Hop back over here to the PowerPoint a little bit. All right. So this brings us into reading in some data. Reading in data from SAS can be a little bit tedious. There's a few ways to read in data from SAS. Essentially you can use like there's a newer option for like versions beyond like 9.4 of SAS. Essentially you can kind of utilize like your lib ref function to incorporate like an XLS engine. This point is one advanced way to do it. The common ways to do it is the PROC import, which you can see here. This is just a basic example of the code that I've read into. I generate this COVID data set. So you can use the PROC import statement, or you can just kind of import your data utilizing the data step, your infile statement, and then adding in some input variables, specifying either the length or width of the columns that you desire. So yeah, this is the basic example of like how SAS code would look, just to use the PROC import data function, right? And this will utilize the database management system of CSV, and it will output the COVID data file as a SAS 7D file, right? In R, it's a little bit different, right? So you can read in data from a local file, you can read in data from a URL. You can load in data that's already predefined from a built-in R package, which is very useful. And some of the examples today that we're going to be going through will include some of the data from built-in packages. So it's useful if you need to quickly access data, as well as the medical data repository that I pulled this COVID data from. It has a lot of freely available data, and you can see like when you connect GitHub as well, you can kind of pull in a lot of that data that you wouldn't otherwise be able to see, I guess, which is weird to say, but it's useful if you're exploring for data that you need. It's difficult to find some test data sets sometimes. All right. So for a given example, say we have a data set called MyDataSet, right? And we have a few options to kind of like look at this data directly. We have the MyData statement, and I'll simply like print out the data into the console. And again, for like tibbles, the output will be like limited rows and columns that fit on a given screen, right? So kind of like that print is where the difference is. View MyData. So essentially view data. What it'll do is it'll pull up in a separate tab within your R console, your RID. And that'll bring like an Excel-based like spreadsheet format, right? So it'll just open up a new window. The most common example that I find myself using is the head function, right? So you can see two examples there. We have HeadMyData, which is the example data set here. It'll look at the first six rows of data so that includes like the header. So essentially it'll print out the first five rows including the header. And then the head, MyData, and specifying your observation rows. So similar like how you would use like a first obs function in like obs, like those interchangeably in SAS, you can define like the subset of data that you want viewers to look at. Pretty similar with like the head and tail functions quickly, I guess. And you can segment out data as well. Not with the head function, but similar. So this brings us to like principles of like tidy data. So the principles of like tidy data is essentially like every variable forms a column, each observation forms a row. Those are pretty basic standards. And then each type of observational unit forms a single given table. So those are the principles of tidy data, kind of why tidy data. We're looking for a little bit more of like consistency, right? So code writing, code written with like one tidy data set analysis can be easily applied to another, right? There's no more fiddling around with the functionality between data sets. And most of the functions that work with vectors, so like storing variables as vectors and comms that kind of just like make sense for tidy data uses. So the tidy data, the tidyverse packages include like dplyr, gdplot2, et cetera. And those are all designed to kind of like work with tidy data, right? So we saw tidy R as part of the tidyverse package right there. Yeah. So continuing on with like some tidyverse packages. So like a collection of packages designed to work with tidy data. Some of them include like dplyr. So we'll see that a little bit for like data wrangling, gdplot2 for data visualization and stringer. So that's essentially for like substringing data a lot, manipulating more string-based data sets. And then like just like basic installation that we kind of covered already is like the installation install packages, tidyverse and that only needs to get done once once you have like those binaries installed within your environment. You can kind of like reference those using the reference below like library and tidyverse. But that has to be referenced every single code, the library tidyverse part. Okay. Let me just get it. Sorry, there's no AC. So the haven package. The haven package is actually part of the tidyverse package. So again, there's packages with packages within packages. It's pretty useful. What this will do is it'll essentially read in your SAS files, your SPSS files or like static files, essentially with like different methods, different functions within R. So there's like a collection of functions there and it'll read in different data types for you. Kind of the downfall is like the outputs or tables. Those are easily manipulated for data frames if you desire. Again, I prefer to work a little bit more with data frames as opposed to tables, but that's like more of a preference for like how your data structure versus unstructured, right? So the two basic commands that come with the haven package are read SAS and that will support like your basic SAS 7VDAT files. So kind of like those data sets that you import using the crack import statement, right? And it'll accompany kind of like the uses of the record of those like value labels. So kind of like the use cases of each variable, I guess you can say, right? And then the write SAS functionality is something that's pretty useful, right? So the only downfall is it's a little bit experimental. So kind of like the overlap for SAS and R is I don't want to say like new, but it's still like a development process, right? Like the interoperability between programming languages, right? So I think that that period's kind of just beginning a little bit. But yeah, the write underscore SAS and then the two parentheses there essentially write your SAS data set or your your data set out to a SAS file if you wish to move it back to SAS. So just like a basic example of this. And this will kind of like incorporate be incorporated from like the COVID data set that would be utilizing. So you can see like I have my object here of COVID dot SAS. Again, good practice to kind of label things like that. And then we have our arrow pointing at what we want to store. Then we have read and then underscore SAS. And then we'll just like import this COVID dot SAS 7b.file in to our COVID SAS object. And then we can just print out that object. So we'll just do the head function from earlier reference that object. And you can see like we have our table here. And this is kind of like the example of like truncating the printout of the statement, right? So as you're working with a little bit more of like real data sets. And this is just an example of data set that was like obviously anonymized for the use cases here. So none of these are real people. But we have our data types again. We have we see our table, right? It's six by 17. And then you can see this is kind of like what I was talking about earlier. We have dot dot dot with 10 more variables, right? So that continuation down below the printout. And we can see like result to character, demos, character, age is double and so forth. And so, okay. All right. So now we're going to go through a little bit of a programming example. I think this one's example two. Okay, great. So what we want to do here is we want to have our reference. Forgive me for not having a reference. And we're going to just reference our library that we need. We can clear the output here. Just so we have what we have. All right. So what we want to do here is we want to reference the COVID data set. So essentially what you saw on the slide, that's what we're going to be looking to do. But we're going to be looking to read in this data set essentially from the beginning. So we're going to call this data set COVID. We're just going to use the read dot CSV. And then we're going to give our file name, which is already defined within our directory. So we don't need to include kind of like this entire path that we saw above, right? So if we again, if we do like our get working directory, forget our entire directory, right? Like I don't need to include all this anymore. What I need to include is like what the file name is considering we're already working here, as you can see from the get working directory command. So in this window, you can see there's the COVID dot CSV file. And in this sense, it's just going to be called COVID dot CSV. And then the only option that we want to add in here is the header option. So the header option essentially says is there a header row for this data set, right? So there are other options like separation where we can like define like, like, is it eliminated by this? But the data reads in pretty neatly. So there's no real sense in doing that for this course. So this is the command that we need to do. And this will assign our data set to COVID. So we're just going to execute that. And you can see like we have our observations here. Again, you can see like all the variables we have in previous analysis, but COVID is now imported with 15,524 observations with 17 variables. And we're just going to print out a few variables, we're going to print out a few values from this. So we're going to use the head function. And you can see like, here's what it looks like. And you can see like how this data is structured, right? So how does the printout tell you how it's structured? A data frame, right? Because we see how this printout occur. There's no data types of like the variables underneath each label, right? So I guess I did add in a reference. All right. So we can just add in the reference there for the haven as we have. And now again, given we're in our directory that we wish to operate in, we're just going to use the write command. So we're going to write this SAS file out. And then, I don't know, we can just call it like COVID dot or COVID underscore example dot SAS seven beat that. So we're going to need our pathway first. So like just starting over here, I'm going to name this like COVID example. And then we're just going to pass in the COVID data set and then reverse. Then boom, three times. All right. So then we have to look at the bottom. So I have this story about like size, we can start it by date. And then we see we have COVID underscore SAS dot SAS seven beat up. And now what we're going to do is we're just going to read in this data set. So we're going to read in our SAS file, we're going to assign that to COVID and then underscore SAS. And then we just want to reference what we have here. All right. Now we're just going to print out the first few observations or our COVID SAS data set. Now you can see the printout differences here, right? So again, the recess function will import this as a table. And recall that we are working with it up here as a data frame, right? So the different functionality there is defined within like what's what the utility of the function is itself in the package, right? So more institutions are going to be using SAS for, say that the differences is like how the data gets stored, right? So more people are going to be using SAS for like medical reporting, healthcare, analytics, things like that. But mainly it's used for statistical analysis, right? But kind of your use case for your data set matters a lot for how you store your data and structure your data. But for the most part, I would say more people are familiar with like SAS being very nitpicky about how the data is structured and stored and manipulated, I would say. I think our biggest pain point from SAS is kind of like formatting, at least for me. I don't think anybody doesn't have a date nine story to talk about. All right. So just like a quick note from Darren, he's from the TAS here. He said in our studio, the broom icon also clears the stuff in the console to clear it. And the environment went to also deletes the object. So if we want to just kind of play around with that, we can see it includes hidden objects here. And then we're kind of out of that. So that was a good point by Darren. Thank you. All right. So this one's still so, yeah, we're doing all right. So I think we'll take about a five minute break right now. If anybody has any questions, feel free to ask. We'll kind of reconvene at 445. We'll just just share there and let's call it. All right. So hope everybody can see my screen again. Hope you had a good quick break here. I think we're all set. Hope everybody is ready to dive into some dplyr and some pipes. All right. Let me just make sure. So from here, we're going to talk a little bit about data wrangling. So manipulating data with dplyr. So data kind of is like rarely ready to analyze, right? So wrangling data into the proper shape is critical for the task, right? So again, like the use case scenario of like tibbles, right? Carryover functionality of like similar data sets, but your data is not often very clean. So you're going to need to kind of clean that up yourself for a lot of analysis. So some useful functions that we have are kind of like filter, select, arrange, mutate, transmute, spread and gather. So pretty much like the most common two that are used are filter and select. Arrange and mutate are used pretty frequently. And kind of the bottom three are like more sort of like formatting things. Or the bottom two at least. Yeah. So hopping into a little bit of like a data set here, we'll talk a little bit about the pipe syntax. So recall earlier, we were assigning an object with this arrow indicator, right? So in this case, like a TV summary. And like for this example, right? Say we want to restrict like a World Health Organization tuberculosis data set to just countries that we're interested in, right? So in this case, like China and Afghanistan. And then we're looking to kind of like compute the incidents of like tuberculosis cases, right? So like annual cases per say 100,000 people. And finally like sort those results by decreasing incidents. So the code below is an example of like how to do that, right? But it's very hard to read kind of like what's going on, right? So that's more of like an in and out functionality of like what you're looking at. But like you kind of reference like incidents is like the negative that is sending sorts. So it's a little confusing. But essentially, like it's not very clean code is the gist of it. A better way to kind of like approach this code is using the pipe syntax. So like you'll see in the previous case, right? Like we call all these functions within each function. Here, we're kind of just referencing, let's do something and then let's do something else, right? So the best way to think about the pipe syntax in any use case is always to state yourself in then. It is like where you're looking to perform. So the pipe syntax, which is like the percent and then a little carrot of like pointing to what you're doing. And then a closing percentage sign. It's useful for kind of like chaining events or chaining commands together. And you can see like this event is a lot more intuitive. The output, because we reference this table, it will output the table as performed within the code below. So previously, we actually had to call that output within the object, right? So here, we're just kind of like referencing that object and then outputting what we desire. In this case, we're going to start with the who data set with table one on that first line. We're going to point to what we're looking at or what we're looking to perform next. So in then. So in this case, we're looking to filter the data set. We're looking at the country and we're saying a similar command for like most part of the language just in. So when a value is within the data set, we're looking for those references. So the references here are going to be Afghanistan and China. And then, right? So at the end of that line, we have another pipe syntax. We're going to mutate the data set. So we're going to essentially define what incidence is as a variable. And that was 100,000 cases or 100,000 times the cases and that those values are going to be over the population. So here, we're defining a variable. We're filtering for a data set already for Afghanistan and China. And then we're going to be arranging that data set and descending order based on the instance value that we just calculated. So this is a lot needed to read. And you can see like there's no real confusion about what's going on here. And this is kind of how like the sequential order of code would be working for most programming languages as well. So I do a little bit of PySpark and like this is how I sequenced my programming. They'll understand what you're doing, define any new additional variables that you need and kind of like output that is desired, right? These are the basic programming schemas. So hopping a little bit to a different subject, we're going to talk about formatting a little bit. So a common procedural format in SAS is known as like the PROC format, right? So there's multiple ways to do things in SAS and there's multiple ways to do things in R as well. Similar functionality for like the PROC format. What is the use case of that, right? So formatting in SAS can be achieved through PROC format. And then those formats that you define and reference in later data steps can be stored in the SAS catalog for later use. So in that sense, we can define the length of each variable through like the data step statement. And then we can go through like a lot of logical iterations of like if then statements to like perform calculations on given variables that we desire, right? So an easy example of that I just wrote for this is PROC format. We have values, we have the value one for a positive example of the variable class, right? And then that value should be referenced as positive instead. Then we have zero and that should be referenced as like negative, right? So recoding variables in this sense is kind of like the utility function for SAS or for the PROC format just for this basic example. And that'll kind of carry over to like what we're referencing here in the course. So the format or package, right? So all lower case, FMTR. And this helps format your data and it kind of replicates like a similar functionality that you would get from like your PROC format statements in SAS. This package is part of like the SASy package. So it's actually all lowercase. I just, I like that reference. And it provides some useful functionality with your data as well. So common functions are fdata, fapply, the formats, fattribute, so fattr, fattribute, the value, condition functions, fcat, flist. Essentially what we'll be focusing on in this course is kind of like the fapply and the fattribute. This is the basic functionalities for the examples provided. So the fapply function is to apply formatting to like any given data vector, right? So we wanted to like shorten like the length of like the output of like decimal precision, right? So we can define that with the fapply. And then like the fattribute, right, will assign an attribute of a given object in R. And it'll kind of like give it its own parameter called an attribute, right? So you can essentially assign an attribute of like format and then kind of reference that format within the object itself. So if you wanted to apply that format, you would just do like the fapply and you already have that format within the object itself is twisting as that sounds. And then the bottom point, the fcat. So we'll look at an example here of like kind of creating our own catalog for formatting and referencing that for later use as well. Let me just give it a quick see. So this is the formatter package. This is the fapply example. So what we're doing here is we're just creating a sample vector called sample.p. And we're concatenating those arguments here. And you can see all of the precision of this decimal points are different, right? So they vary a little bit here. And what we're going to do is we're going to take that value from the sassy package, the function fapply. We're going to apply that on our sample data set. And we're just going to specify that we only want to see one decimal point for the output, right? So it can have as many leading integers as possible. So the wild college symbol of percentage and then we have a specified floating format there. And you can see the output is 6.4, 7.6, 1.1 and 5.7. So that kind of like will round the data up to the appropriate decimal precision based on the data set. So useful functionality for quickly applying different formats. And there are like sapply and lapply in R as well. So those are also different ways to approach this. But these are kind of like common ways to utilize packages within R that can kind of replicate saved formatting, right? So the real thing is like saving your format, right? You can create a whole script for that. You can create a catalog that others can reference for whatever formatting they would want. Okay. So the formatter package also comes with the fapply example. So here's kind of what I was talking about earlier, right? So we're going to create a sample vector to format. We're just kind of going to take that same example vector that we have earlier. We're going to assign a format attribute in the sense. So we're going to call the attribute function. And then we're going to say we're going to take our vector here, sample V, and we're going to apply a format to it. And what we're going to put in that format is the wild card and then the .1 decimal precision as well. And then finally, we're going to take the fapply function and apply that to our vector, where we can kind of see that the output is, again, what we would expect of a single decimal point. And as many leading whole values as possible. And this will kind of bring us into the fattribute example. So if we want to send like multiple attributes or multiple formatting attributes to or just any given attribute, I guess, to both in a formatting sense, right? They don't have to be numerically based. So you can kind of see like on row two, so that line of code, right? We have format and then we have width and then we have justify. So those are all different examples of how you would format like a given data output, right? So what we did here is we assigned a sample vector of those numeric values again. We took the fattribute, so the fattr, and then we applied, we passed in the data set, or the sample vector that we have above, and the formatting we have our format for a decimal precision of, again, one decimal place. However, in this case, we're also passing in the width. So we're specifying the width of how long each output can be. And then we're justifying that output to the right. So we can justify to the center and have spaces on equal sides. But for this example, it's the extreme, right? When we call that fapply function to our sample vector, we see that there's a lot of padding to the left. When there's additional spaces within our width, right? So that kind of comes from all these attributes that we defined with the fattribute function. So you can see we have five spaces here. And then all the variables are shifted over to the right on their output. The formatter package also comes with a useful functionality for recoding. So you'd recall earlier on the PROC format example, we noticed that we were trying to recode some basic examples of positive and negative to zero and one. That can also be reproduced in the formatter package. And it's pretty easy to reproduce this to a whole data set if you would desire as well too. So the quick functionality here is useful. And we're going to take another vector here. We're just going to call this sample.b2. And then we're going to pass a bunch of springs to it. And then we're going to create a second vector that's going to be called our lookup vector. And this can kind of be a little bit more convenient, I guess, on larger data sets. But for this example, we're focusing on a smaller one. So we're going to take our lookup v2 example. And then we're going to, again, get all these arguments using our c function. We're going to define the legend that we have. So in this example, we're going to say a is now equal to group a with a space between group and a. Same thing for b and then c. So group b, group c. And now we're going to apply that lookup to the sample vector that we originally created using the f apply function. So we can see we have f apply. We took our sample v2 vector. And then we applied the lookup function that we have. And then it'll generate the output automatically for you right there. But if we want to assign that to a different object or something we could, there's different ways to go about it. But for this example, you can see now the vector output that we have from sample a uses the indices, matches up with the recording value and it'll output the appropriate group for the assignment with the new recorded values. Okay. So this one gets a little bit more complicated. We're not going to touch up too much on this in the course. But just for a programmatic example, the fcat. So the fcat will reproduce similar functionality to the SAS catalog. So if you're looking to store formatting attributes or formatting, well, there's different ways to format your data, I guess you can say, this is the way how you would approach it. So for this example, what we're going to do is we're going to create a sample data formatting catalog. So you can see like, I have it, I have it entitled as date.format, but it also has like a numeric format in there as well. And then we're going to reference the fcatalog. So we have our number format, which you can see here. So this is the single decimal precision with as many leading wildcards, just excluding the one there, but same output. And then we have date format, which is just like the standard definition of like how you would format dates here. For more like date formatting, lubricates like a good package to use for R. There are a lot of cheat sheets about how to format some of your data. But for this example, I formatted the data in the most common way that I've seen in pharmaceutical industry, which is the output at the bottom of your screen where we have day, day, month, month, month, and then year, year, year, year, which is this code here. So then we're going to write that catalog that we just defined out to our given directory. So you can see like we have write.fcat. So now we've created this object with our formatting using the fcatalog. So now we have a catalog defined within the date.format object. And then we're going to write that out to our directory. That's all this code says, we use the write.fcat function after we define the catalog, and we output it to our directory. If we're looking to use that format for a later use case, write the read.fcat format, just a basic line of code. It'll read in the example, and you can see there are two formats within this data set. So we have the single decimal precision, and then we have a type of date formatting as well. And these are the values for the given variables here. So it's entitled num.num underscore format, and then date underscore format for the two variables. So just a quick example here. If you call the sys.date function, it'll just bring you out the date. So this is when this code is written. And then we can apply that using the fapply. So we've read in our data, we've read in our catalog, and we're going to take in our system date, and we're going to apply the date format that we desired, which is again a month or day, day, month, month, month, year, year, year, year, or four years. And then now the output comes as a string here. So that's useful when you have larger data sets as well, and you just want to quickly apply that to a given column, and reference this for later use cases for some of your other work colleagues as well. Okay. So I think we're still okay on time. So now we're going to open up the example. I think it's called format. Okay, great. Yep. Okay. So and again, we just cleared our environment up here. So we have a nice clean window to work with. I'm just going to expand this out a little bit. So we have more room to see the code. And then again, let's make sure that we have our sassy package installed from the installation package.r. And for this example, we only need to reference the formatter package. So we're just going to reference that now. I'm just going to go ahead and just like execute some of these vectors that I mean. Oh, I've already defined it. Oh, we can just scroll down then. We just need the reference for the library. So we're going to read in the exercise dot v format vector. And then you can see like the precision here. So there's a large difference between like some of the values. So what we're going to want to do here is essentially assign this a given format, right? So we want to call the attribute command. So what we're going to do is we're just going to call that attribute. We're going to call in our exercise dot v. And then we're just going to call in the format. So similar example that we have above. We can just see our attribute command up here. We're going to be replicating that code down here. So we want to call that. And then we want to point to it. And then we want to say our percent. Actually, we want to open up some strings quick. We want to call our percent. And then we're going to say one and then period, and then two. And then we're going to reference our F. Then we're going to hit control enter for there. Make sure we're all good on their execution. And now we're going to apply that using the apply. So all we're going to call is apply. And then we're going to call in our our sample vector from there. And you can see the output defined below. We just have two decimal points for our desired output. And there are more examples here. This is the entire example throughout the slide with your formatting catalog as well. So if you need any if you have any questions with like the coding there are the references to write above. And then we're just going to save that again just because all right. And now we're just going to clear these clear objects and clear our screen. And if you guys have any trouble with like executing some of the code, feel free to reach out. I'm fairly confident that my TAs can help too. All right. So this will bring us into a little bit of data exploration. And starting off with like data exploration here is like all I have is like a box plot. And you can see there's a lot of variability between the values here. So not the best visual, but just for examples. So the sg plot can be used to define like a variety of plots. So that can be used for like histograms. I'd also use it for like rank. There's like a proc rank essentially. And then I'm going to be using that to like kind of plot my visuals after I get those out of there. So the main thing that I used to plot data I would say in SAS is proc sg plot. The basic example here that you can see to the right is we have our proc procedural statement, we have our data that we're referencing. And then we just have the vbox reference over our category. Then we run the excuse that command. So visualizing data in ours a little bit more, I would say intuitive, right? So the visuals in SAS are a bit limited for like what the outputs desired and the ODS statement can only do so much sometimes. But visualizing a data is a little bit more free-flowing. You're allowed to manipulate things a little bit easier. And that there is a little bit of like a valid structure here. But the main main package that gets used the most often for visualizing data is called ggplot2. There's ggplot1 as well, but the new standard is through ggplot2. And for this example, for the visualization data, we discussed something called like the grammars of graphics. And what that is is just like a layered framework to build visualizations. So you'll create your output like your your scatterplot say you'll apply a legend to the scatterplot. You'll apply like labels to the scatterplot. And you'll finally define like a title for the scatterplot. Those are all made line by line sequence, as opposed to SAS where you have your you have your header and your footer and you have like your title up top. So it's similar functionality there. But like you can define like font is a little bit more like free free-flowing nature, I would say there's no no limitations for the most part. And yeah, so this is implemented in our ggplot2. So we went through the layerization process of the visualizing. So like starting with the data set, things like access to use labels points on groupings are added one at a time. It seems not intuitive, but like I think I take a similar approach when I'm when I'm creating outputs in SAS. I want to make sure it looks good first, and then I'll edit the labels and references where I need to. Generally, the labels in SAS are your accesses of your variables and and are will follow that similar functionality before you overwrite them with like a label label function in like SAS and like a label function in R as well. Okay, so just to base an example of this, we're going to be looking at the Star Wars data set. So this is an example data set that's provided. You can just reference it using the Star Wars data Star Wars in semi colon. So all those packages that you've installed, they all come with like or most of them come with like data within within their packages as well that you can use for later. And for this example, we're going to be going through like some of the pipes and taxes the filtration there. So so for this example, we have the human droid, we're looking at our Star Wars data set. So the human droid being our object, we're looking at our Star Wars data set. And then we're filtering that down for species that contain the values human and droid, right? And then we're going to drop all NAs from height and mass. So to kind of perform that, the example below is first we specify the data set in the XY variables that we want. And you can see like we have our Star Wars plot now. And then we're going to take our GG plot function, we're going to reference our data, which is the human droid object now. So we have a subset of data set. And then we're going to apply some aesthetics to it. So AES is short for aesthetics. And for this example, the aesthetics are going to be our X variable, which is defined as height, our Y variable, which is defined as mass. And then we can continue adding like titles and axes and like other things like that. So we have we have our geome point here. So this will reference the actual scatter point plot. So without like the type of plot that you reference, we have our data and then we have the plot that we would be creating from there, right? So we saw like a V box from SAS. This is the same thing as like a V box reference, but geome point meaning scatter plot and this scatter point. And then the aesthetics we're applying to that is going to be color, right? So we're going to look at the color of each individual species. So that'll be the legend that we have. And then we'll just simply apply a title, a GD plot title. It'll be mass versus height for humans and droids. And then we're going to be applying our labels to our X axes, which are height and mass. And then we're going to finally print out our output. So you can see like this is what the visualization looks like. So I mean, there's not a lot of going on for droid. We have a good amount of clustering going on for human and some extreme outliers. But for the most part, this visual is pretty sufficient for like what you're looking at. You can go into a little bit more deep diving for visualizing in R. But for the most part, like I provided the GD plot script for you. So if you're interested in seeing how to perform any type of visualization, you can reference that code. All right. So proc freak. So proc freak, what can they do? And what are the limitations, right? So it has an abundance of like output statistics that you could possibly get. And it could be a sort of variety for functions. So proc freak, it's most commonly used to summarize data, right? So you get county accumulative function, you can run a chi square out of there, like from your observe versus expected. But essentially, it'll just provide a summary table with counts and frequencies and cumulative frequencies of given categorical variable. So for this example, I just written this little code of proc freak. We're referencing our COVID data set and we're just ordering that by the data that we have. The table and tables references, they're usually used a little bit interchangeably. But for this example, we're just doing gender by payout group. And then we're just performing that execution. This is kind of what the output would look like without any title or title or anything like that. So you can see we have our frequency, our percent, the row percentage and our column percentage, just from our payer group relative to our female relative to the gender. And then you can see we have subtotals on the bottom as well. So that'll bring us into the arsenal package. So this is another package that's very useful when doing a lot of like SAS to our programming. And it replicates a lot of different functionality that SAS produces. There's six main functions that it'll reproduce. And those are kind of listed below. So we have our table by function. So that essentially summarizes a set of independent variables by one or more categorical variables. We have our paired function. Essentially, it'll summarize a set of independent variables across two time points. We have our model sum function, which is used to fit and summarize models for each independent variable with one or more response variables. So that's like the multiple response output they're looking for. And then we have our freak list, which is used to approximate the output from SAS's proc freak. We have compare data frame or DF. That compares two data frames and reports differences between them. That's similar to the proc compare. I haven't found myself using proc compared since like, I guess, base level, very, very base level. So I don't know how useful that one would be, but it's there if you need it. But one that's very useful is the rate 2. And then the asterisk isn't part of the syntax. It's just like use this wild card symbol here. But that'll provide a bunch of functionality that allows you to output those data tables that you've created. So that'll replicate the ODS deliverable that you're looking for in our relative to SAS. So the ones that we're going to focus on mainly for this course for these examples at least are going to be the table by. And then we're going to do the freak list. And then we'll output that and see how it looks. So this is the arsenal package. And this is the freak list example. So again, we just add our library reference there. We read in the data set there. We're checking our data set. So this is what it looks like. And we've seen the COVID data set a little bit, but mainly what we're focusing on here are those categorical variables that we could potentially use. So continuation of the freak list example. What we're going to do is we're going to use the freak list function to define a table. So you can see up here, we have a table function. So we're creating a table in the background and assigning that to a given object. And then we can kind of call the freak list example to get the summaries of those tables. So you can see like we have the summary function in R. But the summary function is generally used to give summarizations of like data frames and tibles, right? Here you can use the little bit different. So just like walking through the code, we assign a table. What we do here is we reference the data frame that we're using. So in the sense COVID. And then we're going to reference the comparison, the comparator, I guess you want to call it, which is in this sense result. So COVID positive or COVID negative. And then we're going to kind of compare that those values against demo group and gender. So these are both two categorical variables as well as comparing that to like a result variable. What this will do is it'll help you kind of like find if there's any odd distributions within your data set, right? So when you're fitting some models like preliminary work, that's very useful as seeing how the data is distributed across multiple levels of your data. I think this is a great way to approach it. Maybe you will too. So again, we're going to assign the output of the frequency list to a new object. And we're going to call frequency list. We're going to call that table we just created from above here. And then for we have the option here to target how we approach NAs, right? So those values that just aren't reported, right? And for this example, we're just going to include those. And then we're just going to print out the output. And you can see this is kind of like what the output looks like. So we have a result, we have our demo group, and we have the levels of each group, we have our gender, and we have the levels of those groups as well. We have the frequency of every level and sub level of all these categorical variables defined in the frequency. And then we're going to have the summary of the cumulative frequency of each guy based on count. As you can see, we have 17, we have paid 25. And then we have the percentage of the total values in the data frame itself. So you can see like there's large clusters going on for patient, female, and male for this given gender group and negative. And then we have the cumulative frequency going on, which is kind of like a running total to the right. Very useful function. Okay. So we're going to go through a little bit of the programming example here. And then we'll kind of touch up on modeling a little bit of some data. I intended this course to run about two hours. We might go a little bit over, but hopefully you learned a little bit something here. All right. So let's close this. And the next example I have is Freak List Examples. This one. We're clear here. And we just want to make sure we have our reference. We want to read in this COVID data set again. We've cleared our variables. We'll just check the structure of it. And if we just wanted to check like a print out of these values from the previous example, we can just see how it would look within our window here. All right. So onto the exercise. So we're going to create a table. Essentially, we're going to call this like my first table. And what we're going to be doing is we're going to be referencing our COVID data setting. And then what we're going to be doing from there is we're going to be opening up some brackets. We're going to be referencing this. And we are going to take in the result. So essentially kind of like what you're looking at up here. You can kind of use that for reference as well. And what we want is we want the result and we want the responses based on payer group. So payer group. And we're going to use genders, the other category. And then we're just going to submit that. And everything looks good there. So for the second step, we're going to look to define our freak list. We're going to call that function now. And then we're going to call this my first freak list. And then we're going to pass through the freak list. And then we're going to pass in my first table. And you can see like this is all it will complete for you. And we're just going to leave the NA options. And then we're going to submit that. And now we're just going to print out our a summary of the frequency list that we created. We're just going to execute that. And this would be an example of how you would create this output. So essentially what we did with our table was we defined a table of first salt, payer group and gender. And then we defined a frequency list of that. And for this example, we've removed the NA options. So a little bit different from what we saw above. But we can see the output is very convenient, neat, and kind of useful, right? All right. So how'd that go over here? So this is the last segment I had for the course. We're going to go through one more example. And then I'll kind of walk through just how you would approach just like a general linear model in R. So there's different ways to approach modeling as far as like the procedures that you're going to be using. So I think in SAS, there's a PROC logistic. There's also a PROC GLM that can be used for kind of like logistic regression and stepwise. We're going to go through the GLM function in R at the end. And we're going to go through some stepwise using the AIC as the standard for model reduction in and out for each given parameter, which is pretty common in this case. So coming back to the arsenal package, again, they're very useful. It has a lot of functionality. Let's take a look at the COVID data set again and see like we wanted to test like an explanatory variable levels to response variable results. Like the following results would kind of like work, right? So this will give you like relative to the distribution of the variable you're looking at in comparison. Is there any significance from given levels or given variables itself? So it's a little bit more detailed, but it'll essentially fit a model. And like what you're looking at by this table by function is like we define output. We have table by. We have the result that we're interested in, which is in this case, results of the response variable here. And then this little tilde here generally is the use case for like on building a model in R. So to throw everything at it, we would just remove like patient class and gender and put a period. And that would throw in all of your variables in the data set. But for this example, we're looking at a table by, we're just looking at categorical variable comparison relative to like the distribution of each categorical variable. So we're looking at result. And then we're comparing like the initial patient class and gender values within the COVID data set. Then we're just going to print out the summary of this data as well. This is kind of what the output looks like. You can see it's a little bit different from the frequency list example that we saw. However, excuse me, something that you can notice is that we do have a level of statistical significance within these parameters here. The use case scenario for this is, why would this be significant? So we can see why gender is insignificant. The distributions are fairly straightforward throughout, like they're just level throughout the entire variable itself. But here, we have some subsets. So this little subset here, the inpatient could be used to define the significance relative to the entire distribution or we have 15.8% here for emergency. So those values that you want to highlight are the differences between the layers of each individual group and that will indicate whether or not there's significance within the variance when defining model as far as coefficient weights go. So for me, I think this is very useful for exploratory and looking at kind of significance of the levels of categorical variables that you see. Yeah. And then there's ways also to define the threshold, the confidence as far as the p-value goes as well. So those are within the package. I recommend, if you're interested in defining your own alpha, you can look into that as well. So the final thing from the arsenal package that I wanted to touch up on this class is something that's pretty useful. It's similar to the ODS output in SAS. It's the right to family of functions. So the right to family of functions include HTML, Word, and PDF for output tables. So again, only for output tables. If you're looking to kind of create an output from your R code, a better solution in most cases is going to be using Markdown. But if you're looking for quick outputs of tables and you're already in the arsenal package, this is just easy to reference and kind of output for anybody else that's looking to look at your data analysis. So for this, let's look at an example of the right to PDF function. And it's a very simple command. So we saw our output table from above. So we have the summary of the output for this table right here. And then all we're going to do is we're going to call the right to, and then the number two, and then PDF. We're going to define the output table that we specified from before. And then the pathway that we defined it to. And I mean, don't mind the parentheses here. It's the naming convention. You can name it whatever you'd like. And if we're looking here, we can see all the way at the bottom, we have the output table from PDF. So that's kind of like what it would look like when it's showing up in your given directory. And then now we're just going to go through a quick walkthrough of the table by. So if you want to just open up that code. And you can see I have a reference here for the library R Markdown. Just make sure it is. We have a reference here for the library R Markdown. The R Markdown package is going to be needed to create that functionality of the right to family functions here. So what we're going to want to do is we're just going to reference both of these. And then we'll read in our COVID data set. And then we can just print that out just to make sure everything looks good. All right. So what we're going to do here is again, we're going to replicate what we have going on up here. So if you need any code to reference from there, essentially, we're not going to be changing too, too much here. So what we're going to do is we're going to call this output two. We're going to reference that. And then we're going to call the table by function. So we're going to go through the table by what we're going to want to do is we're going to create a result. We're going to put in this tilde expression. And then we're interested in patient class and demo group. So we're going to pass in through the patient class. And we add an A plus sign to reference additional variables when we're modeling. Similar to how you would create a formula for coefficients. Just think of this tilde as the equal sign. So we're going to go to the demo group. And then we just want to make sure that we reference the COVID data set. And that should be good to go. And you can see we do have our output there, which is the list of three. Now we're just going to check the summary of the output two. And we're just going to set this text to equal the trail. And you can see we have our output that we desire. You can see the significance in both of these, which is great. That's what you like to see. And now all we're going to do is we're going to use the write to HTML. The reason for that is because we want to see what the output looks like maybe outside of the RStudio console, right? Or ID. So we're going to use the write to, and then we're going to call the HTML. And then we're just going to call our output two. And then we're just going to reference this as like output to underscore HTML. And once this executes, you can see when it does call it, you can see we have the R markdown reference here. So packages built within packages add for exclusive maybe like a little bit of a niche functionality. And this does provide that a little bit. But when you're writing some SAS code, something that's kind of like, I guess it was almost missing a little bit, right? It was just that quick same package in and out functionality. And the two packages that I've seen that kind of like do the most are the SASy package. That's a little bit more advanced as far as functionality goes. But the arsenal package is again very, very useful for just providing some quick insight on your data sets, similar functionality as far as like categorical data is concerned. And yeah, so what we're going to do is we're going to look at the output two here, HTML. You can see it has two dot HTMLs here. We're just going to hit the view and web browser. And this is kind of what the output looks like, right? So that's pretty neat. If you wanted to kind of like send this file to somebody, they could just open it. It's a quick reference for somebody else looking at your data on an alas for like pretty easy sharing, I would say. Let me just clear this. I don't know how teachers do it. Sipping water all day. So for a basic example of like logistic regression, and here we're not going to go through like the training test sets. We're just going to go through like how you would fit like a basic model of your data using like the GLM procedure or function, sorry. So like let's look at the breast cancer data set. I provided that data set for you, but you can just kind of like look on the screen to see like what. So Ann Reiner asked, and everybody can kind of see that in the chat. Sorry for calling out your name, but the output to HTML function. So they work with specifically tables. So if you're creating like any type of table, as long as you utilize that table function, so we saw a table and then the parentheses, anything inside that, you can actually utilize the write to family function. So write to HTML for this example, or like write to word or write to PDF. As long as it's within a table, it can be used. But if it's not, you're going to need to use some like different functionality for like those desired outputs. But pretty much like anything that you're looking to do, I recommend using our markdown first. That gets a little bit more complicated as far as like saving your results. But there are some simple commands. If you're not looking to format your output too much, that are pretty basic to use. And there's some nice resources on Cran for that. If you're looking to create like a full on report, I'd probably recommend using like the sassy package called reporter. That allows you to really, I guess, nitpick how you justify tables and stuff like that. But there's a lot of functionality for like outputting your data. This is just the most convenient for tables only. Thanks. So going back to like the logistic progression, this is again like the final example we have for the class. We're just going to look at the breast cancer data set. And we want to perform like a binary fit to the response variable called class, right? So in this case, benign or malignant, right? We can see that on the bottom here. And then we have like a bunch of variables. We have ID, thickness, cell size, cell shape, adhesion, just like a bunch of variables. I would say like looking at it right now, just from like a broad standpoint, right? We don't really need ID. That's a variable that we're not super interested in. Maybe even mitosis, I can't really tell too much about the layering of that system. If there's enough levels, right? Like we could kind of like dummy that variable a little bit. So add levels to like hopefully recreate like a little bit better of analysis out of it. But for the most part, we really don't need ID, right? And for like executing like binary response modeling in R, you're going to kind of need to do a recoding of variables. So like class is all strings, so benign and malignant. We're going to need to recode this to numeric values to zero one. So something that's useful when you're when you're modeling data is right, we want to print out them or we want to print out the data. And then we want to see like summary statistics of the data. If you want to see like more descriptive summary statistics of like data sets, I use the site package a little bit when I need some little bit more detailed summary statistics. So instead of the summary function, you would just use a function called describe and then just put in your data set within that. It'll create a similar output like a little bit, not better, I would say, but some more advanced statistics, just like kurtosis or Stevenus, you'll see. So this is a basic example of like how you would model your data. So what we're going to do is we're going to assign a general model called GLM fit. We're going to again point to what we're assigning. And we're going to assign, we're going to assign this using the GLM function, so general linear model function. And I know you're thinking linear model, this is a binary response, right, essentially we're building a categorical model. But kind of what dictates how you approach your model is this family function. So like we can do a family of like binomial, we can do a son distribution. I don't know. There's like a bunch of different like modeling techniques, I guess you can approach it with. But for the most part, the general linear model can kind of like taking a lot of different types of data sets that you're looking to model for like a single given response. And the same thing in SAS PROC GLM, same type of deal. So what we're going to do here is we want to pass in class because we want to start off with what's our response, right? And recall earlier we have our tilde and then we have a period here. So this would kind of just say we're going to throw everything at it. We're going to throw everything in the model, see how it works out. Not the best approach, but I mean, I know we've all done that at least one time. Just to get a basic fit. And then we're going to call our data step here. So we have data equals breast cancer for this data set example. And then essentially we're going to be checking the summary of that fit. So like what are the statistics that come out from this model, right? Just from a basic level. So this is kind of like what the what the output would look like from that summary statistic, right? So something that's kind of useful. Something that's kind of useful that you'll see is like we have our deviants of our residuals here. So that's a very useful statistics, right? We can see we have like a large spread between the min and max. I mean, it's not horrible, but not great, especially for like a zero or one classification. And we can look at our individual parameters right to our coefficients here. So here are their weights and the given estimate. And then we have our standard error. We have the z value and then we just have like the confidence level relative to z. So like how significant is this relative to the distribution of the response variable? And you can see on the bottom we have the significance codes. So like zero is given three asterisks, 0.001. So that's like a 99.9%. And then 99 and then 95 and then 90 for alpha levels. And then we have one. So it's kind of just like the scale you'll see. So we also have some additional statistics that we see. This is like the null deviants, the residual deviants. And we have our degrees of freedom. So we also have 16 observations that are deleted just because they were missing. And then we see we have our final output right here. So generally when you're modeling, right, we have our AIC score. That's used for kind of like measuring how effective the model is. A lower score is a better score, is the gist of it. And this is the amount of iterations it took to go through the scoring iteration of like this AIC. So this just built a basic model. This didn't go through removing any of the parameters. This is the best fit relative to the best weights that we've defined. That makes sense. And generally we want to see like low standard error, confidence in our estimate, the intercept better be pretty good. But we want to have confidence in most of our parameters or else we can kind of remove those. And we just want to check like the coefficients of our variable, right? So something that's easy to reference here is the command cof. So coef. And then we're just going to pass through the model that we just fit, known as the glm thing. You can see we have an intercept here. So we have our negative 10. And then we can see the weights of each individual variable. So like maybe thickness and my cosists are kind of related, right? Maybe they share some covariation there because the weights are relatively close. And then the same direction as well. So you can kind of like gauge how your model is performing here in the relative fit. And that's kind of all I had for this course. The basic example I provided in the logistic regression example, if we want to just run through that quick. Oh, there's a useful command in here called column names. Sometimes when you're iterating through columns, it's just useful to execute that. We have our head function here. Here are the summary statistics that we created. Here's what I was talking about from removing variables in the data set that were not really needed, right? So I just went through and I looked at it and I said, we definitely don't need ID because that's not going to do anything for our model. It's just like this value that's creating variance. That's not useful. So that can get removed. And if we just highlight this value, we can see that there's no longer ID in this model. And then something that you'll notice. Recall earlier, I mentioned that again, we're just fitting a basic model here. Recall earlier, what I said is we need the response variable to be binary and numeric in nature. Or else it won't classify these accordingly. So when I go to execute this code with this data set, just minus the ID variable, I'm giving this error, right? And as you would use SAS to kind of understand your issue, I like how the program executed, right? Sometimes you're given a bunch of warnings when it's iterating through, especially model building. You can see like the Y values must be equal or equal to or greater than zero and then less than or equal to one, right? So for binary classification, especially like a binomial, right, that needs to be numeric and how we would change these and recode these values. Yeah, that'd be useful. Kind of just go through some iterations here. So if I just want to reference like the class variable for the breast cancer data set, right? So I'm going to call the breast cancer data set to call a column. You'll use this dollar sign. And you can see it'll list all the columns that I have. So if I wanted to just kind of like look at class, I can just do that. And it'll print out all the values of the class. You can see like, if I just want to print out like the first few, right, or something like that, right? That's what it'll look like. So to reference a column, what you'll be using is that dollar sign. So again, just like the breast cancer and then the dollar sign and then whatever, whatever column you're interested in. So that's how you would reference a column. And you can see in this code line, I have the data set, I have the reference here. However, I have this additional code segment here. And this is within brackets itself because when you're going to reference a value within the column, you're referencing. You first reference the column, and then you reference the value that it's within. So you can think of it like, I don't know, if I'm looking at like a sandwich, right? Like I have, I have bread, and then I also have bread. And then if I'm looking at like the top of the sandwich, I have like tomato or something. And then like within that, I also have like a seed of the tomato. And say I want the seed, right? I need to reference this data set here. And then I need to reference what's within that. And then if I wanted to recode like seed to like no seed or something instead, I would need to kind of access that element. So in order to access that element and recode these values, we're going to need to add the second layer. So same thing, we're going to reference the data set the column. So data set and then column, and then where that value is equal to. So use the double equals there. And then for benign, it's zero, malignant for one. So pretty much I'll go through our recoding values. And then after we recode the value, this this class column is still a stringer character. So it needs to be converted into a numeric data type. So what we're going to do here is we're going to take the column that we have here with the recoded values from above. And then we're just going to say as dot numeric, and then that'll convert the entire column as long as we reference it appropriately. So we can do that here. And then we're just going to head and check this. And then we're going to fit our model. So now we have our model fit. We have our coefficients from here. And to execute an additional stepwise function, we're going to use this and then the piping syntax again. And then we're just going to call this function called step AIC. And then we're going to, we don't necessarily need to know how it got there. So we're just going to turn that execution off. So then if we execute that, we can go print this out. And then we can see we did improve on our AIC score by three. So not great. I mean, but we have a variables that are significant now, right? So that's kind of the important thing. You can't always go by a single given score. You want to see like effectiveness of the variables, the weights of the coefficients, which we can see here. And you can see like, I mean, there is a little bit of like similarity between these weights, but I wouldn't say that like just eyeballing it. Like, I mean, maybe thickness and mitosis that there's not a significant amount of like co-variation between the residuals. I mean, obviously a little bit of deeper analysis would need to get performed for that. But just like weights that are according would be nice. And it seems like we have an appropriate fit here. Oops, sorry. So if we want to check like our residuals too. So something that's useful is like, if I go to the GLM and reference the fit, then I hit like the dollar sign, that fit object has a bunch of values that like you may find useful, right? So we have weights, prior weights, degrees of freedom and the residuals, the data frame that is null, why converge values, the boundaries, NA actions, contrasts. They're just like a bunch of values that like we have, right? And then we just have like a basic ANOVA here. So you can kind of see that. But like the statistics that you're looking for are generally available, essentially right after you fit a model. If it's a good model, I mean, that's up to you a little bit. But for the most part, like people kind of rely on similar statistics. But for this model that we've already built, I've assigned this object of the residuals from this fitted model. And now what I'm going to do is I'm just going to look at the summary. And you can see there's something weird going on here, right? So like this minimum value, we have minus 496. But like the rest of the values, like I mean, maybe like the max that's significantly different from like where we are in the center of like this distribution, right? So maybe we just want to see that. And like a quick visual, like I always throw out just like a quick plot function. So like plot and then plot your residuals that you just have. And you can see like any for the most part, like this data lies like kind of DC fit, right? There's nothing, there's no title here. There's no like indexes, we just have observations and then we have like the fit of the residuals. So we only have like one real outlier. I mean, I wouldn't be so concerned, I guess with that is like a final output. One one outlier with this entire data set that's a relatively large data set of 15,000. Okay. So kind of all I had. Does anybody have any questions? Okay. All right. Well, I guess I want to thank you guys all for coming to this course. I hope it was informative. Hope you have a question. Thank you, Joe, for the presentation. They're in the process of transitioning from mainly SAS to R. And there are some days where I just want to do a quick, let's say proc, let me think, just a quick data state of function in SAS. And I'm trying to replicate that in R, but I just can't figure it out. So what resources do you have for someone that's needing to replicate some of the SAS skills in R? Yeah. So the two best, that's the right one. So these two books are like advanced R. So this book is like, yeah. So this book is free. There's also an R solution. So like both of these will kind of like cover the basic analysis that you need. Maybe more so the advanced book, but I can put this in the chat. There's more on this GitHub page. So if you like kind of explore like how the work comes like GitHub page, they'll give you like, like if you explore like this other books, this will help you with like your analysis for, maybe that doesn't even work, but this will help you with your analysis if you're looking to kind of like replicate anything in like R. Okay. That's like relative to SAS, because it's not going to be a one-to-one thing, right? So I think there's a lot of false perception on like kind of replicating analysis directly from like SAS to R. Do you have an idea of like what you were looking to do, I guess? Some things are not just analysis. So for example, today I was trying to just do a regular if-else statement in R. So I had to Google for a couple minutes to really figure out the exact way to do that in R. So I don't know if there was a book that existed out there that had, oh, you already know how to do this in SAS. This is how you would do it in R, just something like that. Yeah. So I mean, there's like the functional operands in this book. So that might be like kind of useful. I think like you really don't need to read the introductory one. So you can see like I've covered some of this stuff that's in this book, right? So like the functional programming run if, like, these are just basic functions that are in the advanced book. But I would say this book, you can quickly reference it. It's free available online. I think for the most part, you're going to probably use this a lot if you're looking for any R books. But I think is it this book? Yeah. So I think this book is super popular. And you'll probably use this a little bit more. So like a book I would recommend. I don't know. I can just show it to the camera, I guess. Okay. So the book is called An Introduction to Statistical Learning with Applications in R. So this is more of like for the statisticians. But like this has been one of the most useful books I've ever had, I guess. Let me just put the name of that too. But the reason why I recommend Hadley's book, obviously he wrote R. But it's also accessible on a computer relatively quickly, right? So like whenever I find myself like not understanding like some type of functionality, right, you're going to run into errors. I run into errors and I'm performing sass, I run into errors and I'm performing R. It happens. If you need to know why, like your ability to bug is kind of like your main tool, right? But there's not a one-to-one direct conversion, I would say. I would find things that you find yourself using the most. I would probably try to get like a little bit more in control with that. Like if you're creating a lot of looping functions in R too, I would just create like a templated script that you can reuse for the most part. Because rewriting it, and you can see like rewriting it, it's going to create some problems for you, I think, especially if you're a new learner. Thank you so much. So those are the authors. Oh, right, yeah, yeah, sorry, Darren's right. So Hadley was the like the main contributor, one of the main contributors to R. He wasn't the one who officially wrote it. R derived from S language, I believe. And I guess he can't attribute that to him. Thank you. Thank you, Darren. I like Darren a lot here. But I hope you guys had some fun. Are there any more questions for the course? I hope this is kind of insightful for you as well. All right. Yes, with that, call it a day. Thank you all for coming to the course and enjoy the rest of your week. Thank you all for my TAs to appreciate it.