 So, welcome to the R workshop. We are going to be talking about introduction to R today. So the term R is used to refer to both the programming language and the software that interprets the scripts that you write when you're using it. And although the learning curve may be a little higher than something like Excel or other statistical software, the results that you create in R, it doesn't rely on remembering like what order you've clicked on things or what you've written. So instead your commands in R are going to be completely documented and reproducible. And so you can see the what we're using today for this workshop is called an RMD file, an R markdown file. And you'll be able to document your code in here as well as run the code. And I'll tell you a little more about it in a minute, but this is all in order to support reproducible science. So today we're using RStudio, which is a free app, a computer app that allows you to access the resources of R. So you've downloaded both RStudio and R, and then they work in concert in today's workshop. And there's information here. If you haven't yet installed RStudio or R on your computer, you can use these links to do so. And in order to use RStudio, you need a computer that's running Linux or Windows or Mac OS. You can't use a Chromebook or an iPad. Okay, so once you've gotten to RStudio, let's take a look and get you oriented here. So you'll see once you open the RMD file, you will have four windows. You have the editor window here on the top left. Below you have the console window. So this is where you'll see scripts running. This is where the actual work happens. Here is the environment. So once you create objects and variables, you'll see them pop up here in the top right. And then at the bottom is where help is available, packages, plots might be there sometimes, and then your files are also visible there. And then if you want to, so right now my RStudio is in the default color mode, but if you like, you can go to tools, global options, and then appearance. And you can change the way that your RStudio looks. There's various colors. So whatever is comfortable for you as you're working through, you can make the font size bigger. Great, so I'm going to cancel that actually, but I do want to make the font size bigger. Global options, appearance, and let's just actually zoom in to 175. Oops, too much. Let's go 125. There we go. So hopefully this is a bit easier to read for you. And you can also use the command or control plus to make the writing bigger. That's a shortcut. Great, so the document here, so this was provided to you with the zip file that we sent you before the workshop. This is an R Markdown doc, and it allows you to work in that reproducible way. So you have the code, which are in whatever you're calling code chunks, and then you also have descriptions of the results that are in the same file. And so the gray sections that have single quotes in them will delineate a code chunk. So it looks like this, but I'm actually going to scroll down and you can see when you create a code chunk, it will be this gray color. So that means that's a code chunk. And then you can see we've also written text outside of the code chunk that is just to describe what we're doing to you. So you can do the same thing. Comment on your code to help the future users, you or others understand what you're doing by adding these explanations. And so you can add the code chunk to an RMD file using code insert chunk, or you can type option command I, option command I on a Mac or option control I for Windows. So and once you the another cool thing about the RMD file is that once you're finished, you can do what's called knitting it. So you can click this knit button, click that down arrow, and you can knit to HTML, you can knit to PDF or knit to word. And that will create a document that you can easily share with others that includes your your code chunks and your explanations of your code. And I'll demo that at the end of the workshop. And but how the knitting only works if your code doesn't have errors in it. So we'll just make sure that you have not left any errors in there. And then another thing about these are markdown files is that they're spell checking, it'll give you a little red squiggly line to let you know if something's misspelled. And then our studio allows you to execute commands directly from the code chunk that you've created in your document using control enter, or command return on a Mac. And then if you place your cursor on the line and the code chunk you want to run and hit command enter, it will just execute that line. I'll show you. Or you can execute code in the console. So like here, we can do one plus two. And that will tell you the answer. But if you type code here, it doesn't get saved into your document. It's lost once you close the R session. So but if you type code into the code chunks in your rmd file, it will be saved when you close the sessions. So that's another way to promote reproducibility. Another, but while we're there down in the console, if R is ready to accept a command, you'll see that here on the bottom left, there's a little carrot, a little blue carrot prompt. It will, as we saw, execute a command and then go back to the carrot to show you that it's ready to take another command. If R is still waiting for you to provide it with more instructions, there will be a plus sign in the console. And that'll tell you that you didn't finish your command. So that may be that you forgot to close your parentheses or a quotation. And if you're unsure what went wrong, you can click inside the console and click escape. That way you can start over and figure out where you went wrong. Okay, so now that we're sort of oriented in the system, let's start with some really simple calculator practice. So we have our code chunk here. And we can either click the green arrow to run the entire chunk at once. Or as I'm going to, you can put your cursor next to one of the equations. And then I'll put I'll click command return or control enter. And it will just run that line and tell me one plus two equals three. Now let's try the green arrow. We'll click there. And it will run all the code in this chunk. So it's saying 16 times nine using that asterisk is 144. It can do more advanced functions like square root of two, 20 divided by five using the slash. And then you can also use decimals, 18.5 minus 17.21. That'll equal 11.29. So we did our calculating calculator practice. And then we're going to create objects. So to do that, you write a variable and then you assign it a value by using this assignment arrow, which is this backwards arrow. You can type that in manually with a backwards carrot and a minus sign. Or there's also a shortcut, which is ending in. Oh, we'll talk about it below. So typing alt at the same time as the dash will make a little assignment arrow or on the Mac it's option and same as the dash. Cool. So once we execute a line of code like this, you'll notice that a new object will appear in your environment. So let's give it a try. Okay. And so we want to mention that in the past, you've been able to use equals for assignments, but it was a little more complicated. There are slight differences in the syntax, so it's best to use the arrow. That's our best practice nowadays. No, that's an assignment arrow. Yeah. Yeah, you're saying we're assigning x the value of six. Yeah. Yeah. Right. As we get used to the syntax of R, these are different ways to make it work. So another thing is that R is case sensitive. So if your variable is cat, but you try and with a lower case C that you try and run cat and uppercase C, you'll get an error saying that cat doesn't exist. And then you also want to name your objects in a way that's explanatory that you'll remember, but it's not too long. So current temperature versus current temp. Do you really want to type out temperature every time? And then finally, you can't begin an object's name with a number. You can end it with a number. So you could put clean data too, but you want to think about thinking about reproducibility for yourself and others. What is the meaning of clean data to as opposed to clean data? Do you mean clean data version too? Made on a certain date. So all of this, as you're thinking about creating variables, think about what is going to be easy for you and others to understand. The name can't contain punctuation except for the underscore and or a period, although periods not recommended. And you shouldn't name your object the same as any common functions that you might use, for example, mean or SD, which is standard deviation. And then just using that consistent style will make your code clearer to read for your future self and your collaborator. So let's try this in our code chunk here. Let's assign x the value six. And then you can see up in the right hand corner, you now have x equals six in your environment. And then once the object is created, you can use it. So try on your computers running these lines of code below 2.2 times x and four plus six. So you could either do command return, or control enter, or you could click that green arrow. Are you able to do this? Here. Do you know I have a laptop in the library that allows me to download the R student? Okay. Pause it, Cloud. Pause it, Cloud. It's like a Cloud version of R that you can use on the library laptops. Okay. And then we can also overwrite an object's value to give it a new value. So in the code below, we create a new object y, and then give it the value of x plus six, right? So since we know that x, as it shows here, is six, we expect x plus six to equal. So now we have y has the value of 12 up there. But then what happens when you run this assigning x a new value of two? So here we're reassigning the variable x the value of two. So this leads us to exercise one. What is the current value of y? Is it 12 or is it eight? And create a code chunk. Well, actually, we've already done that for you below. And insert some code to display the value of the object y. Actually, we've already done all of that. Or no, it's on mine. Maybe it's not on yours. I'll give you just one minute to do. Create a code chunk. Yeah, it's line 194. Put the same on the student. Oh, thank you. Now, I'll do a degree of five. That's true. There's multiple ways to find out what it is. But what you're saying is it's not going to update. Exactly. Exactly. If you want to update the value of y, you would need to run y, this assignment function or equation. And then when you do that, you'll see with the new value of x two, it now equals eight. So yeah, you need to run things in the right order to get the results that you're looking for. Perfect. Now we'll talk about working with different types of data. So a vector is the basic data type in R. It's a series of values that can be either numbers or characters. Oh, let me show you one thing. If you want to make your R studio bigger, as I have done to make it more readable, you can click control plus or command plus on a Mac. And that'll make the entire thing. It'll zoom it in. Press command plus. Might make it easier for you to read. Okay. So let's go on to our vectors. So here, you can tell your building of R, you can tell R studio that you're going to build a vector by using this concatenate function, the C with the parentheses. So we want to assign the variable temps with this vector 50, 55, 60, 65. So I'll run that, or you can use this little arrow. And then when I type temps, it'll print it out. So here we have the vector 50, 55, 60, 65. And you can follow along here. If we want to make a vector of characters, you need to use quotation marks to tell R that the value you're using is not an object you already created, it's just a string of characters. So here we're making animals, cat, dog, bird, fish. And then we'll run animals and see, okay, now it's inputted into R. You can see it in our environment up on the top right. An important feature of a vector is the type of data that a vector stores. So let's run these two lines of code and decide what type of data the vectors contain. So by using this class function, you can see that temps is numeric, because we had just numbers in there. And then animals, which we used those characters, strings of cat, dog, etc., is character. So this is a quick way to just check on your data if you're curious about it. So let's do exercise two, create a vector called DEC that contains three decimal value numbers. And then check what type of data that vector contains. So first, we create the variable DEC. And then we use this concatenate function to create a little string or a vector of values, one, two, and three, four. And then we'll print it out. Okay, good vector is there. And then we'll use that class function to see, yep, numeric, perfect. And then another possible data type in addition to numeric and character is logical or Boolean value. And that's true or false. So, but we also said that vectors can only be numbers or characters. So if true and false don't have quotations around them, they aren't characters, but then they must be numbers. What numbers do you think true and false would be? Heck, yeah, yes, exactly. So we've made this logical, this logic vector. And let's look at the class logical. But yeah, under the hood, those are reading to R as zero and one. And with true and false, you have to type them in all uppercase like that. You're trying to do an equation with them. But in exercise three, let's see what happens if we try and mix different data types into one factor. So let's run this entire chunk using the green arrow. Now we're getting some hints up here in the global environment. And let's let's take a look at the class. Yes. Okay, so it says character. Even though three of those are numbers and only the four, which is written in quotation marks is a character. So you can see that R has actually coerced all of the other values to be characters. And then let's try their logic. Okay, that one's also character. Interesting. Let's try number logic. Okay, that one's numeric. That one makes sense to me because we have the three numbers and then false, which we said is a number underneath the hood. And finally, let's do number character. That one also says character. So we can see that there is a hierarchy and some data types get preference over others. Which ones got preference? Did you note what did R try and coerce everything into being? Right. With any of these vectors that had characters in it, it wanted them to be characters. And then with the logical, it coerced it into numerical. So this is just something to be aware of as you're creating your vectors. Make sure that you have them all the same data type and know that R will try to make sense of them in its own way if it's not all the same data type. Cool. Greta, you're up. All right. So we've got vectors. We've got values. We've got variables. But now we want to combine things. We're working towards being able to import data from a spreadsheet. But before we get there, we need to build that up. So the next thing in the hierarchy is what's called a list. So a list is a set of vectors, but it is doesn't have very many requirements on it. So the elements of the list can be character characters, numbers, vectors, matrices, they can be of any size and shape. But for right now, I think we've got some vectors that all have the same size and that's okay. So we're going to make a list of three of them. We're going to make a list of animals, temps, and logic. And then we're going to actually make the list. I use a Mac or a Windows and not a Mac. So excuse me for if I hit the wrong button. So we made our list of vectors. And now we'll see over in our environment pane, we have a new set of objects. And they will have a blue circle with a white arrow in it. Because you can click on that and you can look inside the list and see the first few items in that list. So that and that's called data. And so that goes above values. We can also look at it in our markdown file. And there we go hit the wrong button again. And that will print out the items in the list. And again, each vector has four items in it, but we have three vectors. So the first vector gets a label of one, the second gets label two, and so on. If we want just the first vector in our list, we can do double square brackets with a one. And that will just give us the first vector in that list. So there's actually a nice little graphic that we did not make that we got from somebody, probably data carpentries that shows how we can get, dig into lists and how lists can look differently. So let's think about a bookshelf or some sort of shelf that contains you know, maybe your garage shelf. Because in garages, we have items of lots of different sizes and we've got lots of different shelves that we want to store them all. So we've got some sort of shelving system, and we've got stuff stored in it. And let's think of each shelf as a vector. So in this particular shelf, we're going to call we're going to name our shelves because we're really super organized. So our the first shelf that we have our system of shelves is called a and we've got four shelves in our shelving system. On the top shelf, we have three items, one, two and three. On the second shelf, we just have one item that's called a string. On the third shelf, we just have one item that is the numerical approximation of pi. And then on the bottom shelf, we actually have another little shelving container that contains two items in it, number one, negative one and negative five. So if we want to look at just the top two shelves, and keep in mind that they are part of a larger shelving system, we can use single brackets. So a single bracket one colon two gets us the first shelf in the second shelf, but we know that it's part of a bigger hole. If we want the bottom shelf, which is itself a container, we can do a single square bracket with four. And again, it knows that it's part of a bigger shelving system. And it keeps the structure the same. But if we don't really care about that, we just want the bottom shelf. And we use square brackets, and that completely gets rid of the idea that it became from something bigger. So you see in the graphic, now it just looks like one shelving system that has two numbers in it. How do we get the values out of that? We can do another set of square brackets. So we see the double square brackets with four. So that gets us the bottom shelf. And then after that, we have single square bracket with one. And that gets the first item of the bottom shelf. But again, it knows it came from it came from a shelf. It doesn't know that it's the bottom shelf anymore. But it's only one item. And if we only want the number negative one, and we just want a number without knowing this list structure, then we again do double square brackets. So it's a double square brackets for double square brackets one. And then we finally get the negative one. Most of the time, you don't have to worry about our with using all these brackets, it should work just fine. Even if it knows that it came from a bigger list, it just might look a little bit differently. In the output, it might have some extra things that it prints out, such as the double square bracket three above this list of logicals. Any questions about that? All right. Well, we don't let's give our shelf some names, because we really want to know what we have, what we're storing on each shelf, right? A shelf has to have all the same content. Otherwise, it's going to be coerced into a different type. And we really want to know what we're storing on each shelf. So we can name them when we create them. And you're going to encounter name lists, and eventually you're going to just stop thinking about them as lists. But let's create some names and notice now we're going to create a list that has shelves that have different site number of items on each shelf. So a typical list might be something that has a title. So we're going to give it a title of statistics. And we're separating all these items with commas. The next item in our list is going to be some numbers. And it's a one colon 10. Does anybody have any idea of what that might do? Let's just see. We can highlight it and we can do command or control enter. So one colon 10 will give us the numbers one through 10. And then we're going to create a logical called data, and it's going to say true. And if we create this, we're going to create a named list where all of our shelves in our garage have names. And so we know exactly what's stored on each shelf. Now notice we've got instead of the square brackets with numbers, we have a dollar sign in the name that we gave it. That's going to make accessing that information in our list more convenient and more intuitive because we will it has a name. So we know what to call it. Any questions about that? All right, let's get to some fun stuff. Let's import some data. So there are lots of different so there are lots of different ways that we can import data. There is a GUI graphic interface way to do that. There are two different for a CSV file. There's two different ways to import the data. Today we're going to be talking about the one method that doesn't require installing any additional packages. So that's the top option from text base. There is another option that we will talk about later in a different workshop called from text with read R in parentheses. We can also do it by running some code. So again for reproducibility, anything that we do that we want to be able to redo, we want to have it in code. But a lot of times when I'm importing data, if I don't know exactly where it's stored and I want to browse to it or if I'm not exactly sure of the parameters that I want to use when I import the data, I will go ahead and I will get the graphical interface up and I will select my data. And so a CSV file is a comma separated file. The same option will work for tab delimited or white space delimited. So we'll eventually get to this dialogue. If you're on Windows, you get to this dialogue faster. But you can set some different parameters and play around with what you want it to look like. You can preview the file. But what we really want to do is if everybody has their zip file extracted, you should be able to just hit the play button on this line here where it says BlackfootFishRead.csv and then in quotes BlackfootFish.csv. So again, if it's saved in the same place as the RMD file, you shouldn't have to specify the whole path, which makes sharing documents with other people more convenient so that they don't have to know your whole file structure. Did anybody have any problems importing the data? Okay. Sarah, maybe Sally can take over for you just in case there's anybody online that has questions. Okay. Does everybody online good? Okay. If you don't have the CSV file saved in the same place, you can run this really long import code that will get it directly from GitHub. But it sounds like everybody has it, so we don't need to worry about that. So once BlackfootFish is imported, let's take a look at it. So first thing we want to do is check the class of BlackfootFish. So now it is a data frame. Okay. So it is not just a vector, it's not just a list, it's a specific type of list called the data frame. And because it's a data frame, it has dimensions. So what do you think these numbers mean? Rows and columns. So which is which? First one is rows, second one is columns. Great. Okay. So the rows are going to represent the number of observations that we have. And the columns are going to represent the data that we have collected on those observations. And let's look at the names. So each column has a name. And so I'm kind of giving away some of the answers to these questions. So on these fish, we have what trip they were collected on, the mark, length, weight, year, section, and species. This is not data that I have worked with. I mean, I've worked with it in these workshops, but I was not part of the data collection process, but we could imagine that they were out on a boat, and they were doing a transect, and they were electrofishing perhaps, and they were collecting fish. So a trip just keeps track of which trip out they wear on, mark might be if they mark the fish in some way, and then they collected this data. So STR gives us information about the structure of the data. So it's not just the number of observations by the number of variables, but it gives us that. It gives us the column names, some information about the type, and a preview of the data in each column. And the number of things that it previews just depends on the length of each item in that list. So let me, are we, is this more complicated than, is it, are you still working in the zip file, or are you out of the zip file? It's extracted. And you're in the RMD file that's extracted. Okay. Did you try the GitHub link? You said that it's working, it's just giving you an error every time. It's not registering the working directory, and then you try to access the working directory. Okay. All right. I don't want you to get too far behind, but so, I'm sorry, you're still having problems, but hopefully we can keep moving forward. Are you good with that? Are you wanna? Okay. All right. So let's look. So we looked at the structure of the data, but we can actually get a summary of all of the data in our data set by just running the function. So the function is summary. And then we have open parenthesis, and then what we want a summary of. And so we're going to get a summary of the entire data set, Blackfoot Fish. And so what do we see here? What is this giving us? So I said that we're, I told a few people that we're not gonna really get into a lot of statistics here, but we're gonna do a little bit of statistics. And so if you don't have a big statistics background, that's okay, but just what's going on here? Mm-hmm. What's going on with section and species though? You can't take a mean of a character variable, right? So it's just saying that it's a character variable, and it tells you how many there are. Okay. Let's see. Let's look at the type of Blackfoot Fish and see how it's stored. And it is stored as a list, right? So I said a data frame is a very specific case of a list in that it requires that each object in that list have the same length. So lists in general can have items that are different lengths, but a data frame, everything has to be, have the same length so that we can get a nice rectangular frame. Any questions about that? Yes? Yep. So you, it will only give you a preview of the, well, it might give you a preview of just a few of them, but if we only want a few of them, we can definitely just do a single square bracket and let's say one to three. Let's see what that does. And it'll only give you, because it's just pulling out the first three columns. So it's only giving you a summary of the first three columns. That's a great question. Yeah. We'll get, we'll get there. No, no, no. That's a great question. We'll get there shortly. So there's other, some of these functions that we've already run for data frames. DIM for the dimension gives the number of rows and columns. If we just want the number of rows because that tells us the number of observations and row gives you the number of rows and call is the number of columns. Length only works for the length of a vector. So you can't give it a data frame. It'll complain or it'll give you the number of columns. So if you want to know how long a vector is, it's length. Those of you that are, are no Python or are interested in going and intending our intro to Python workshop in October. Unless Python has changed, Python is has len for length instead of length written out, or at least it used to. So you have to be careful about making sure that you use the right function for the right programming language, but our length written out will give you the length of a vector. So that is size information, content, head will give you the first six rows, tail will give you the last six rows of the data frame, view with the capital V will open the data set in a viewer window. So actually let's run that one. So I am going to do command shift. I am. All right, let's do view with the capital V black fish. And now we have a new tab in our editor window, and we can see all or some of our data. The other way to get this is over in the environment. You can click on any of these objects. And it will also create a tab in our environment in our editor window. Pain. Call names will give you the names of the columns. Sometimes data frames have row names, so row names will work. It is they don't really like or you know, if you use data frames, sometimes some code will give you will complain. If you have row names, it's not really something that is recommended these days. But a lot of times old code or old ways of doing things will name rows. And so row names might have information in it. It might just be the index number. STR is the structure of the data frame. Glimpse is another way, but it does to get the structure. It requires a special package. So we'll use that in a different workshop. And summary, we talked about that gives summary statistics for each column. And a lot of these will actually work with other types of data, not just data frames. So a data frame is a structure of tabular data, like an Excel spreadsheet, but again, it requires or has the assumption that we have the same number of rows for each column. And we could create a data frame by hand, but usually we import it. We're using read.csv right now, but there's other ways to import data. You can import data directly from excel.xlsx files. You can also import from other stats packages that might require you to install a package so that you can import that data. Tab limited or white space limited data also can be imported. And you can they need to have the same type of data within a column, but each column can have different types of data. Any questions about data frames? All right. Now let's extract data out of our data frame. And before we do that, we'll before we get into extracting out of rows, we'll first extract some columns or some variables. And then we'll actually get into extracting rows out or particular values out. So if we are interested in getting a particular variable or a column from our data frame, these are named variables are named named vectors. So we can now instead of using all the square bracket stuff, we can use the name and we'll access the name of a variable or a vector using the dollar sign. So black foot fish dollar sign year, we'll extract the year column out. And we're actually going to save it in a new variable years plural. And so the assignment arrow will sit store it in there. And then we'll look at the structure of years. So now we have a vector that's an integer. And it goes from one to 18,352 because that's the number of observations. And we can see the first few observations are all from 1989. And so we can look at this output to know how long the vector is or what was the command to tell us the length of a vector just like maybe it'll be important for us to actually have that as a specific value and not just something that we have to look at to get out. Okay. And we can, you know, we can also look at the environment tab to see how many observations are in a particular data set. But again, we might need to have to use that value if we're going to create our own average function, we might need to know how many observations are in there. The other way to access a particular column or part of the data is using matrix notation. So in matrix notation, the row number comes first and the column number comes second in square brackets separated with a comma. And so we could look in the environment to get the dimensions. We already know those. And so we could say instead instead of specifying years by name, we could just know that it is in the, it's the fifth item in our list or our frame. So one trip is first mark, length, weight, and then year. So then we could say nothing, which will give us all rows comma five. So that'll give us all rows for the fifth column. Let's run that by itself. And so years down here should still be the same as what we had before. And we do see 1989 a few times. If we want the first observation in the fifth column, we could do one comma five. And that will just give us 1989. And we should have the same structure for years, because it's just a different way of accessing the data. So instead of working with something that's 18,000 observations long, let's create a smaller data frame so that we can really understand how we can work with this. So our new data frame that we're going to create has hnt, wv, and then some months and some years. We can look at the entire data frame because it's small. So we can see that it took our names and made them columns. And the items in each of our lists or vectors will become items in rows. So we can say, okay, what would df square bracket three comma nothing give us? What do you think that would do? Third row, let's give it a try. So I'm going to do both of these so that it'll print both out. So of the whole data frame, the third row should be t March 2018. And let's see. And that's what we got. This three here is the row index number. What would what do you think df dollar sign z square bracket two would give you? Second row of column z. So it should give you a whole row or a value. Okay. And the value would be 2015. And that's what we got. So what about two comma three? Second row third column should be 2015 again. Right. So we already did exercises four, five and six. So oh, sorry, four and five. I will let you do exercises six and seven. How would you pull off only columns x and y? What about pulling off only columns x and z? So start with that one first. So go back up to this code chunk here and think about how would you get just columns x and y. And then just columns x and z. So it's like here that I think had more of them. Yeah, so it's like, yep. Yeah. So you could, if it was part of a larger thing. So the question is, how do you just get a particular row? Yep. So you could say, let's say, let's, you have some sort of situation that gives you a row number, but you don't really know what it is. Let's just say that it's h. We're going to store in four there. So I'm going to do command return on Mac or control enter on windows. And then if we wanted to just get whatever, we don't really know that that's a four, but it's just some number. We can do df h comma. And as long as we don't put that in quotes, it will print out the fourth row. So if you have some sort of condition that finds a row, then that will, that will get it for you. If it's a named row and you put it in quotes, then it'll pull out that particular row by name. So how did we get columns x and y? What's one way? There's multiple ways. Yes. All right. So df dollar sign x, df dollar sign y. So that will give us the two vectors, but they're not connected together. So it's not a data frame. So we want, let's say we want to have them together in a data frame. So we could do comma one colon two is one way. Another way is square brackets comma c parentheses x comma y. So we have to put those in quotes because they're in the names. And we can also go by subtraction. We haven't talked about subtraction yet. So I will save that for later. How would you get just x and z? We could do concatenate one and three, or we could do it by name. Square bracket comma concatenate x and z. And again, we could do it by saying we don't want y and we'll talk about that here in just a little bit. Any questions about that? Yeah, how come you have to put the like c and the parentheses? Because we're skipping two. So if we were, let's just see what happens if we do df comma one comma three just gives us the first column because it now looks like it's going into three dimensions and it doesn't know how to handle the comma three. So it has to know that everything that we're putting in that second spot in that call is just column information. So we're combining the two columns together that we want. But that's a good question. All right, so let's go to just look at a vector. Let's say we have a vector of six values I can add. And so I've saved that we can see it in our environment. We've got it over here. We want to get an output of 22 and 24 from s. What would we do? Again, there's more than one way to do this. So 22 and 24 are the first two. So we could just do one, four and two. And we would get those out. Or we could say we don't want to rely on the fact that they're adjacent. We could do c one comma two. And that will pull out also the first two observations. Let's see what happens if we do three, we get 14. What happens if we accidentally put a comma after this three? Go ahead and try running that. Sometimes things in red in our studio are not meant to be scary. And sometimes they're legitimate errors that are trying to be helpful. So says incorrect number of dimensions. What are we trying to do here when we have a comma after the three? Get the third row or the third column. So we're trying to treat it like a matrix, but it's a vector. And even though vectors are matrices with just one column, R doesn't know that. Okay, not that smart. So we have to make sure that we know if we're working with a vector or if we're working with a matrix. And I remove that so that eventually we'll be able to knit this. All right, let's say that we import data. And when we import data. All right, let's go ahead and get started again. During our little short little break, we did have one question about some cheat sheets in a better way to learn more. Good news. Posit, the company that makes our studio has tons of cheat sheets, and now they make it easy to find them. So if you go to the help menu, there is cheat sheets, and they have all sorts of different kinds of cheat sheets, the IDE cheat sheets. So all the different things that you can do in our studio, how to do data transformation, data visualization, a lot of these are for specific packages. But then there is our markdown. So writing these documents, cheat sheet, our markdown reference guide. And the next workshop is going to be data visualization. And so we will specifically talk about the data visualization cheat sheet. The other thing here is that there are keyword shortcuts that you can click on and it'll give you just the different shortcuts and it changes depending on if you're on a Windows or on a Mac computer. And the other thing that we didn't really talk about that you can do is changing how your interface looks. So I'm sorry for having some trouble speaking. But if you are somebody that likes to work in dark mode, and it makes visualization easier for you, you can go to let's go back here again, tools, global options, appearance. You can change the size of the font here directly instead of using command plus or command minus or control plus and control minus. But then you can change the theme. Let's just look at Dracula, Dreamweaver. I don't really see much difference there. God, that would actually be really hard for me to look at. But maybe that works for you. There's lots of different kinds. And the other thing that is nice is that for code you can change to let's see here somewhere is rainbow parentheses. I really like rainbow parentheses. And I can't remember where that's at. It's got to be in code. Thank you. Yeah, rainbow parentheses down at the bottom. So that will help you keep track like in Excel, it'll help you keep track to make sure your open and closed parentheses are balanced. I'm not going to change Sarah's settings. I'm just going to cancel that. But so great question about the cheat sheets. I'm not going to really go into detail about the different ways the data is imported when we talk about importing it via read dot CSV and read dot table and read underscore CSV. A lot of the differences have been removed by changing default options. But for right now, we're just going to focus on what what do we do if our data comes in and we don't like if it's referred to as a character. We don't like that. We want it to actually be numbers. We'll talk about what we how we want to handle that. And then if it comes in as character, maybe we want to numerically code what those different character strings are. And that will be called that will be a factor variable. So we will want to do that deliberately and not just have it done automatically. So what we're going to do is we're going to go back to our Blackfoot fish data set. And let's just look at our two categorical variables now instead of the numeric variables. And let's see what values that we have. What are the unique values that we have in each of those columns? So that function is unique. So unique open parentheses Blackfoot fish for data frame dollar sign species for a particular variable. And we see that we have RBT for rainbow trout WCT for West slope cutthroat trout, bull and brown. And so those are different species of trout that exist in the Blackfoot river. And let's look at the section. We have two sections there, John's root and Scotty Brown. And if I'm pronouncing the first one wrong, I apologize. Again, this is not my data. Now let's say, okay, well, we know that there's only four types of trout. There's only two types of section. We want to numerically code them in order to be able to do specific visualization or analysis. So that is converting it to a factor variable. Back in the old days, like when I was in college, computers couldn't really handle long or storage was better if you could store a character variable as a factor because numbers are easier to store than characters. And so as long as you had less than 256 unique observations or unique levels, then it was better for storage to store that as a factor variable rather than as a character string. There is still some storage savings if you save it that way, even now. And especially when you're working with large data sets. So that means you just store the labels once. And you know, there's 256 of them, and then you just have the integers from one to 256. And that saves you a lot of storage space. But again, there's there's still some reasons that we would want to be able to create a factor variable, especially in social science where we do math on factor variables, i.e. we do average Lycurt variables. If you don't know what that means, don't worry about it. But sometimes there are reasons to do math on strings. So we can create a factor variable by using the factor function on our character vector. And we're going to save this in a new variable within our Blackfoot Fish data frame. And we're not re we're not overwriting our original variable because we might want to actually look at both of them or we might realize later that we made a mistake and we want to go back. And if we overwrite it, we would have to start over by re importing the data. So we're not changing the CSV file, but we are mutating it. So we do want to make sure that we are have a way to go back and not change it with or require re importing the data. So we're going to change both of the species in the section into factor variables. And then we'll look at them. So now, if we were to look at the look at the head of Blackfoot Fish species, and I'm using tab complete so that I don't have to worry about spelling mistakes, I want to look at let's look at the head of species itself. And so we just see a bunch of RBTs. Let's see what happens when we replace that with species F. And we still see RBT. But now it gives us this other information where we've got levels of brown bull RBT and WCT. And so it it knows that it's characters, but it's stored as a factor and it knows that there's only four different possibilities for the species of trout. But maybe we don't like how this is ordered, how the levels are ordered, brown bull, RBT, WCT. R likes to do things alpha, alpha numerically. So first it'll go alphabetically or by numbers, whichever comes first, although it doesn't like things that start with numbers for variable names. But if we have, for instance, if we wanted to use a particular type of fish as our control that we want to compare everything to, we might want to change the order of this. So if we want to change the order of the levels of the factor variable, we use a levels parameter. So first we have the same thing that we had before. We had factor is our function. Blackfoot fish is our data set, dollar sign species. And now we want to give it additional information in the function. So we have a comma there. And we're going to specify a parameter levels. And then we're going to concatenate the levels and we're going to put it in the specific order that we want. So we're going to put bull in front of brown. We run this, we can doesn't print anything out, but we can go and let's use the levels function. So levels is a parameter in the factored function, but it's also a function itself. So we can say levels of blackfoot fish tab complete to get their dollar sign. And I'm just going to scroll down. I want species F. And now we can see from before it was bull, brown, RBT, WCT. Now it's bull, brown, RBT, WCT. So it just puts it in the order that we specified. It doesn't change the data. It just changes the understanding or the coding that's used for those levels. And I already used that code. Okay. Yes. It's to suppress some sort of output. It's just to give it a cleaner look on the coming out. Each code chunk has a few different options. We can change and we can suppress any warnings. Sometimes when you load packages, it'll put on a long message and you can turn those off, can suppress warnings, chunk options. And we could, we could learn more about the different types of parameters there. It's not something that I typically use a lot except for when we're just trying to clean up how things look on the coming out. And now I have to remember how to get back to our studio. Okay. So how do we work with factor variables now? Well, actually, so we converted character variables to factor variables. What happens if we convert a numeric variable to a factor variable? So years could either be treated as categorical. If we're just thinking about we have a few different years and we want to treat them as the category, they could also be treated as numerically. But let's see what happens when we convert a year to a factor variable. And I'm just going to add in here the head of this so we can see what happens. And we'll compare a year to the head of Blackfoot Fish as a factor variable. All right. They look the same. So it looks like we've got the output pretty twice. But the first one is for just the year. The second one is the year as a factor variable. But now we have a sense that we only have a limited number of levels, 1989 through 2006. And we can verify that year F is viewed as a categorical variable with the same levels as year. And I'm just going to go ahead and run these for a sake of time. So the different levels of Blackfoot Fish year are the levels that were displayed below. And they're the same as the unique values. They just don't have quotes around them. Because they're values and not names. Okay. Now let's say, let's, we want to go back. We regret changing year to a factor variable. We want to go back to treating it as a number. And so let's see what happens here. So we've got, let's first, we're going to recreate the year F as a factor variable. And now let's try to go back. And we're going to try to make it a number again. We're going to say as dot numeric, we want this factor variable to be treated as a number variable or number vector. And so we're going to create this vector year recover. Let's see what happens here. And I'm going to create a new data set or data frame that has the original factor variable Blackfoot Fish dollar sign year factor year F and then new recovered variable recovered, which is takes on the values that we stored into your recover. And let's look at the head of that. So we can see in the original column, the factor variable column, it's identified as a factor variable underneath there. We can see FCTR for factor. And in the recovered column that says DBL, and it has values of one. Why does it have a value of one? Let's look at the tail. 1991 has a value of three. Yeah, exactly. These aren't this data frame is not sorted in a particular order, right? So it's just the head happens to contain 1989. And the tail just happens to contain 1991. 1991 is three years after 1989. Remember factor variables, take all of the strings and store them as integers. So the first one in the list gets a value of one. The second one in the list gives a value of two and so on. And maybe we want to know how many years after the initial year, but we probably if we're trying to recover back to the original year, we probably don't want that we probably want one to be 1989 and three to be 1991. So you have to be really careful about converting back from factor variables to numbers. And that's why we recommend having a new variable instead of overriding so that we can actually get back to what we started with. There are some functions that need them. And sometimes, like I said, sometimes we want to actually do math on something that looks like strongly disagree to strongly agree. And in that particular case, we want it to be a value from zero to five in social science. But if you don't need to have them as a factor variable, it's recommended that you don't. So but it's important to talk about them because if on import, somehow your print your default parameters get messed up, your character variables might come in looking like characters, but actually be stored as numbers. All right. So we're going to briefly, briefly talk about packages. We're not going to really get into that a lot. We will use packages in the next workshop. So we want to make sure that you know how to install them because some of them take a while to install. And if you're here because you're using our studio in a class, you're probably going to need packages before we cover them. So some common packages that you might need to use, and you will need to use for the database workshop is the remotes package, tidy verse, and we can install those both at the same time by going to the packages tab and going to install. And we can put multiple packages in here at the same time. So remotes, tidy verse, and we could keep going. We don't need the space between the packages, but we do need the comma. And I love your explanation about what packages are as a librarian. So packages are like books in a library. And so if you want to know certain things or use certain functions, you have to check out a book. But before you can check out a book, you used to have to get a library card. You probably still need a library card. So you have to be able to get access to the library or access to the book. And so you have to install it first. And I'm not going to do that. Make sure you keep installed dependencies checks that will get all, you know, if you need a stack of books in order to use one book, it will bring them all out. And I'm paraphrasing because I can't quote you directly. And so when you install them, it doesn't mean that you actually have access to them, you have to open the book in order to know what's inside of it. So after you install it, then you need to load them by using the library function. Occasionally, if you are looking at somebody else's code from a long time ago, they might use require instead of library. Library is friendlier and requires seems more demanding, but they give you the same thing. And so you would just run those. And this is where you might see a lot of red scary things. Most of that can be ignored. Sometimes some packages will have a new understanding of a particular function. And so it'll tell you that there's a conflict. But don't worry about that. It just might be something that you need to be aware of that there might be the cat stats package that's used that's developed here at MSU might have a different understanding of how to calculate the mean of a vector. And so they might have a new, a new command called mean that overwrites the base are way of calculating a mean. And the conflicts is just telling you and making you aware that there might be some new interpretations of certain functions. Again, installing the tidy verse package is comes with a lot of dependencies. So it takes a long time. Same thing with mosaic. So if those are things that you need or you're told that you need, make sure that you do it before class or before a workshop. And don't try to get it done in a couple minutes. All right, fine. Being help. This is really overwhelming. And there's lots of questions. And I still need to look up things all the time. So I still access help for different functions or packages or just to learn new things. So there are a few different ways of doing this. You could put it in your markdown file. If you do that every time you run your code, it's going to try to pop up a new window. So it's generally better not to put requests for help in markdown, but to do it either in the console or in the help pane. Let me try to make this bigger. So help over here. You can search. Let's see what we can look up mean. And then it will pull up. If it's just if there's only one thing that goes by that or that you have loaded in a particular package, it'll just automatically give it. It might give a list of different functions that have that same name. You can find in a topic. So if you need to look for a particular parameter or something, you can type that in there and it'll be in that particular page. The nice thing at the end is that there's references. If you want to learn more about a function, there's also examples. If you click run examples, it will run it in the help pane. But what's also nice is that you can copy. Oops. Sorry. Not windows. You can command C and copy and put it in the console. And then you can modify this. Let's say that we don't want the numbers from zero to 10. We want the numbers from zero to 100 and 50. And then we can run them over here and we can get different values out. So we can modify the examples that they have. You can also use question marks in the console. And sometimes if you start typing a function, it will pop up a preview of the help and then you can get additional help. So if you're not really sure which mean you want, you can scroll down here and get the preview for some of them if they have it and then look at the help there. I know we haven't really talked about chat GPT. I've avoided using it just because I don't. But I know that other people have that can write our code. But you have to be able to read our code enough or understand our code enough to know if it's doing the right thing. So I recommend not going to chat GPT for help, but maybe you could ask it for help on a particular R function and it would give you help. But I think that the built-in help functions are just as good or probably better or more reliable than what you can get. And statistical consulting and research services scissors, we are having drop-in office hours every day but Friday. So there's two hours on Thursdays and then an hour each day. You can drop in and get help quick help for if you have any coding questions. So we are available and this is for classes that are not covered by the math and stat center. So anybody anywhere can come in and help. If you're attending online, we can do virtual office hours as well. We don't really have anything set up but you can email and we can set up a virtual meeting as well. We talked about functions a little bit. We'll go more in depth into functions briefly now but really we'll talk about them more in intermediate R which will run in the spring or you can look at the watch the video that we have. We've already used some functions. We've used summary class. We looked up the help for the mean function. If we want to repeat, if we want to look at a new function, let's look at rep or ep and that will repeat a particular value, a certain number of times. Functions generally have parameters. I'm going to put the help for rep function over here and so we have some usage. We need an x and then there's times and length so we can look at the arguments. We need at least an x times and length are optional and so it's going to repeat a particular value a certain number of times so we can repeat zero ten times so we get ten zeros. If we name our parameters we can change the order that we have them in the function so if we have rep ten times equals ten zero you get the same output. If we don't give any names we just need to be careful about the order that we put them in so rep zero ten repeats zero ten times but what happens if we have rep ten comma zero no names we get an empty vector because we're repeating ten zero times which is probably not what you want unless you're initializing a vector creating a place for something that you'll store in there later. Quickly some other functions that are commonly used mean takes a numeric vector but we get an n a out. The reason why we have an n a is because we have some fish that for some reason they did not collect a weight on them so if we want to remove the n a's that we can actually get an average of all the other values we have a parameter called n a dot r m for it remove the n a's we're going to set that equal to true and now we can get a mean weight. Let's see what happens if we try to run the median function argument is not numeric or logical returning n a why can't we get a median of that vector yeah what's the middle species right um there we don't have any sense of order so we don't really know what the middle one would be so we can't we have to make sure that we are using the functions on the correct data type which is another good reason to look at the type of which is another function that we did um or the structure of the data uh core will give us the correlation between two variables okay correlation is the strength of the linear relationship between two quantitative variables if you guys haven't heard that in class if you're in a stats class and you haven't heard that before it's key so the strength of the linear relationship between two quantitative variables all right length and weight are quantitative so we're good so let's this should work right let's see ah n a it doesn't work because we have missing values and weight right so let's see what happens n a dot r m worked before should work again right unused argument n a dot r m equals true all right well let's go over here for health core how do i get rid of missing values well it says n a dot r m equals is false we should be able to use it oh but that's only for the variance function correlation doesn't have an n a dot r m function this is where we talk about this one because it's confusing okay so oh that's covariance correlation use everything okay so coral in order to get rid of um missing values for correlation we have to use the use parameter and this is actually um pretty complicated if we go through and read down here use can take on everything all observations complete observations generally we recommend using pairwise complete observations and you should be able to abbreviate it with just pairwise so it'll compute the correlation between all observations that have a complete pair so sometimes the help is a little bit confusing and so being able to look read that or go and seek others resources is also good we talked about dimension already if we want to get rid of all missing values just completely wipe them all out all the observations that have any missing values n a dot omit is very um we'll do that just remove everything at any row that has at least one missing value which is probably too strong but it's possible and so we went from 18 000 observations down to 16 000 and but we kept the same number of columns because we're not removing columns and we could um let's say that we want to actually do that and only keep complete observations we can save that in a clean data frame now so that we are not so that we're starting from a new data frame from a place where we know that it doesn't have any n a's in it at all and then we can work from there so we're running way behind um sarah do you want to just wrap up okay i think we can so you've kind of seen how we work through all of this material on in the rmd file so you can continue to work through on your own um we also have videos online um let's see i have it here uh data science montana.edu slash data science slash training we have recordings i think we made all the way through last fall so you can look at some of the additional um material there um but let me the one thing i want to do is getting exiting our studio so when you're done with all your work and you want to exit out it will ask if you want to save your workspace and we don't recommend doing that um so instead go to tools global options as gretta did and make sure that let's see save workspace to our data on exit you want that to be never which it is for me so do that and then do i get out of here there we go and so save your rcode as a dot r or an r markdown file which r's already is you can see that my title is written in red but if you just um do command s that will save it and it will be written in black to let you know that you're good to go so that will just um it's a cleaner way to exit r then you can exit r and your whole workspace won't be saved but your r and d file will be