 Hey folks, over the past dozen episodes, I have been trying to lead you through what I think are the most important components of base R that everyone should learn, even if you're doing everything within the tidyverse. Yes, the tidyverse is great, but there will come a point where you need to know how to write a function. Perhaps you need to know how to write a for loop. Perhaps you might need to make an if else block, right? You perhaps you need to know what different types of data are with their attributes are. Well, that's what we've really been focusing on here in CodeClub with base R. Today, we are going to start to pivot towards working within the tidyverse. That involves using a data frame called a tibble. What is a tibble, you ask? Well, I hope you watched through the end of this episode because you will find out. I've got my analysis dot R script open. This lives within the code directory that's within the distances project directory, we source in code read matrix dot R that gets us the function read matrix that allows us then to read in a distance matrix that is being stored in my data directory. So again, if I run all this, this simple break artist dot dist gets us a 10 by 10 matrix. And so I'm going to call this dist underscore matrix. And so now if I use the str function that we talked about a few episodes ago, on this matrix, we see that again, it is a numerical vector that's got 100 values in it. And it's got two dimensions one 10 rows, the other 10 columns, and we have the names of those rows and columns. So say we wanted to get the first row out of this matrix, you do this matrix, square brace, and then we need to give two index values, right? So I might do row one, and then nothing else to get all columns. And this then returns a named vector, where each seat in the vector has a different name, right? Alternatively, what we could do would be dist matrix, square brace, and then let's do f3d11. And then again, we need the comma to indicate that this is the row we want, not the column. This then again, returns another vector with names for each of the seats. And we could, you know, combine things with do say like f3d125. And this should return point 44. Sure enough, that's what we get, right? And so again, this is what is contained within a distance matrix, and a little bit about what's under the hood in terms of attributes, and that it really is a vector that has two different dimensions. Another type of data that we sometimes would want to use for a distance matrix is a distance matrix. And so a distance matrix type of data is a lot like a matrix, but it has some special attributes. So what I could do would be az.dist matrix. This then outputs something that looked very much like the input file in that it is a lower triangular distance matrix, right? And so we still have all the rows, names, and columns. You'll notice there isn't a row f3d0, because again, the matrix is symmetric, and has this diagonal of zeros, f3d0, f3d0 would have a distance of zero, it doesn't get stored. So let me go ahead and take this as dist, and let's save this to dist, dist, and we can then look at the structure of dist, dist. So we see now is again, it is a vector. It's of type distance, and it has 45 values, and it has additional attributes that we perhaps hadn't seen before, right? So we have labels for the different samples in the distance matrix, the number of rows and columns, information about how this was generated, and whether it is a diagonal matrix or a upper triangular matrix, and so those values are both false, because it's being stored as a lower triangular distance matrix. The only time I ever use a distance formatted, a distance typed distance matrix is when I am using something like vegan, or some other clustering algorithm that requires as input a distance type of data, okay? So know that that is there, but it's a little bit different. Something else to point out would be if we did dist, dist, and say we did an f3d11 with a comma, that this is going to complain, right? Because it's an incorrect number of dimensions, so they might think well maybe I remove the comma, and run that, it gets you an NA value because it doesn't know what to do with that, and so what you really need to do is you can give it the value of a cell, so if I did five, I then get out the distance that's in the fifth slot, and so that's like 0.339, and so what it's doing is it's going column wise, so one, two, three, four, five, right? And so you'd have to kind of know what element it is column wise in the distance matrix to get it out. That's just really painful, it's just so much easier to work with a matrix formatted distance matrix than a distance formatted distance matrix. So the next way to represent the data that I want to work on with you is as a data frame. So you do as.data.frame on distance, dist matrix, this then gives you output that looks a lot like what we had with the matrix, right? Well, let's go ahead and do dist underscore df, and we'll go ahead and then do str on dist df. And so you'll notice this is quite a bit different than when we looked at the structure of dist matrix, whereas with dist matrix, we had a vector and then these two attributes of names, the dimension names, whereas the structure on dist df indicates that this is a data frame with 10 observations of 10 variables, and that we see this dollar sign. And as we've talked about in previous episodes, the dollar sign is an indicator that the object is a list or contains list elements in it, right? So I could do dist df, dollar sign, f3d11. And this then returns the contents of that column f3d11. Again, the data frame is a list of vectors where all the vectors are the same length, and the position in each vector corresponds to the same sample. So here I've shown how we can use the dollar sign with f3d11. You might be wondering, well, could we use that dollar sign with the matrix format? No, you can't. So if I do this matrix, dollar sign f3d11, it says a dollar sign operator is invalid for atomic variables, or vectors like that matrix is. So again, we can use dist df dollar sign f3d11. We could also wrap this in square braces with quotes. And that's going to give me I want the rows and the column f3d11, right? And so that will then give me all of the rows and the column f3d11. Alternatively, the rows also have a name in dist df. And so f3d11 is a named row in dist df. And here again, we get the same thing, but or the same numerical values. But the output here is a data frame, right? It's a data frame with one row that row is named f3d11. And we have these different columns. Again, because the data frame is a list of vectors, I can use that dollar sign, but I can also use a double square brace notation. So I can do dist df bracket bracket f3d11. And so that's going to give me the variable f3d11 from dist df. So remember, those columns are also thought of as different variables. So this then gives me a vector of values in f3d11. If I only used single square braces, this is going to then return for me another data frame. Again, it's a list with one vector in it. That one vector is called f3d11. And those different values in there have names, right? Again, dist df is a data frame. If we look at it again, we see it has row names and column names. We know that these are row names. For many reasons, again, we could look at the structure of the dist df. But also I can see that they are row names because there's no column heading above f3d0 here. That tells me that these are row names. Well, I would prefer not to have row names. A while back when we were talking about data frames at some point, I commented on how names are nice because we can do lookups like we just did on f3d11. But say I had numbered these samples by the day. So I went 0, 1, 11, 125, 13, right? I didn't have the f3d. Well, I would get confused over, you know, is value 11 the 11th row or the sample or the row named 11, right? And so that gets a bit confusing. And so it's generally considered to be a good practice not to use row names. So instead, I'd like to have a column for my samples. And so what we can do is we can get rid of these row names in distdf and put them instead as a column for our sample. So I'm going to create a vector called samples. And this is going to be the row names on distdf. And again, if I look at samples, samples, not sample the function, I get all those names. And so I can create a column in my data frame by doing distdf dollar sign samples. And let me do sample one, because I'm going to do this a few times. And I will call that samples. So now if I do distdf, I see that I have a column called sample one at the very end that has all my sample names, right? So I could get rid of those row names and instead work with this sample one column. So that was with a dollar sign. I could also do this with the square brace notation where I do distdf square brace comma sample two. And that could equal samples, right? And so now if I do distdf, I now see I have two columns at the end, they're identical to each other. But again, we created them by two different approaches. One was with the dollar sign. And one was with the square brackets, but using the comma to indicate the new column we wanted to make. And in fact, I think I could probably even do the same thing with a single square brace without the comma sample two, let's do sample three, equal samples, let's go crazy and do it with two square braces here to see what happens. And we'll name this sample four. And so now we can see that we have these four columns. And so the point here is that all of those approaches that we use to get the values out of the distance data frame, we can also use to add values to the data frame. And so that's a pretty nice trick to be able to add information. If you're used to working in the tidyverse, this is a lot like the mutate function, right, which you've probably seen before, one problem with this is a really a small problem. But I would prefer to have my samples column be the first column in the data frame, not the last. And so when I create a column like this, it's putting it at the end, rather than at the beginning. So how could we go ahead and add a column to the very beginning? Well, there's a special function called C bind. So C bind binds columns together. And so I could take samples. And I could then do just DF. And what I notice now is I now have a column called samples as the first column of my data frame, which is pretty cool, right? Well, let me do this a different way. I could perhaps do row names, just DF. And so what this will do is it'll take the values of row names, which is all those sample values, and then put them first, followed by the rest of the columns of just DF. And I should also point out with C bind, it's important that the things we're binding together have the same number of rows. Otherwise, it doesn't really make sense for combining them together. So now when I run this, I notice that my column name is now row names just DF, rather than samples. So how could you set the column or change the column name even, right? So we could do samples equals row name just DF. And sure enough, we see that we now have samples as the name of that first column. Again, I can name that first column anything I want, I could do C bind, sample five equals samples, and then just DF. And this will again give me sample five as that first column, just because the variable I'm adding is called one thing, doesn't mean I have to keep that name. I can rename it like I did here in C bind with sample five equals samples. Again, to clean this up a bit, you can go ahead and take this line 13, the disc DF as data frame distance matrix, and then I could do dist DF, C bind samples equals row names on dist DF. And then I'm going to bind that with the rest of the values from dist DF, running these two lines again, and looking at dist DF. I now see I've got my data frame with samples first, but I still have those row names. So how can I get rid of those row names? Well, I showed you this a few episodes back when I was talking about attributes and names and row names and column names. Well, I could do row names on dist DF, and I can set those equal to null. So by taking the names and set them to null, I get rid of all of the names. And now if I do this DF, I now see that I no longer have row names. And then I have my samples in their very own column. It's the first column of the data frame. And life is pretty good. So the moment we've all been waiting for, let's go ahead and run library tidyverse. So we reload all those great tools from the tidyverse, the package that we're going to be using today is from table because we're making a table, we could take as underscore table on dist matrix. And what we get out is something that looks a lot like what we've been seeing all along kind of a lot like what we saw up here with this DF, right. And so what we have are column names, we don't have row names, but then we have our distance matrix that we've kind of grown to expect. We noticed that the formatting of the numbers is a bit more compact, the zeros are zeros without any trailing values after a decimal point, there's three significant digits to the right of the decimal point for all the other values. Whereas back up here, we have like six digits to the right of the decimal point, right. So it's a much more compact format. We see that it's a table 10 by 10. So 10 rows, 10 columns, as we'd expect. And then also below the column names, we see the type of data in each of those columns, right? So that's pretty convenient. So at a glance, you can look at the data and see what's being represented. However, we lost all of our row names, right? So we'd like to get those row names back. How do we do that? Well, we can repeat as table, and we can say row names equals samples. And so this will put our row names into a new column called samples. And sure enough, we now see we have a column called samples, we now have a 10 by 11 date table. So 10 rows, 11 columns. And the first column we see as of type character with those samples. So I put the samples in the very first column for us, which is just just wonderful, right? So I could go ahead and take this, and we will put that up here. And I will call this dist, tibble. And let's let's go ahead and look at the structure of this, right? So we can do str dist, tibble. And we see it looks a lot like what we had for a data frame. And in fact, right here, we see data frame, but it's a tibble, right? It's 10 by 11. It's the formatting that we see in the output here is a little bit different than what we've seen before tbl df tbl. It's all short for tibble, you get the idea, right? And then again, we see that very much like a data frame, it's a list, right? It's a list of these 11 different vectors. So again, if we look at dist, tibble, and dist, df, we see that the output looks very similar, but there's important differences, right? So again, there's some little differences that I don't think really matter. And that the character in the tibble is left justified, whereas down here in the data frame, it's right justified. Some other things that are nice to see are the dimensions of the tibble. We don't see that for the data frame. We see the type of data in each column for the tibble. We don't see that for the data frame. Something that we see for the data frame is that it wraps around multiple values. Let me go ahead and rerun all of this. But let's use instead of my simple, let's use mice dot break hurt us. This is going to have all of the values that we had in our original distance matrix before I simplified things up. So again, if I look at dist, tbl, and dist, df, we're going to notice a few more differences. So again, scrolling back up here, you'll already notice one difference that with a tibble, we now have 348 by 349. It shows us the first 10 rows and nothing more, right? It says there's 338 more rows and 338 more variables. So 338 more columns that it couldn't fit on the screen. It is a much more compact and easier to look at and read layout than what we see below with this Df where it's only showing us two rows, but it's showing us all of the columns and you can't possibly really look at all these columns. And so then it tells us that it basically outputted as much as it's allowed to output and that it emitted the 346 other rows. So I personally prefer the output of a tibble, where it's what you can see in one screen and nothing more. If you want something more than what's shown in this first screen, there's a few other approaches that are probably much better than what we saw down below with that dist Df. So you could use the select function to get the columns you want. You could perhaps do mutates or group buys, you could make the data tidy, again, refashion the data in a more pleasant to look at format than looking at 348 rows and 349 columns. So if you wanted to get more than what we saw with that fit one screen, what you could do would be to do print. And then we could do dist tbl. We could do width equals 20. And so this is going to give us 20 character wide, right? So we could do say width equals 200. And then this gives us 200 columns of output. And you can see it gets really funky. And if you want just everything, then you could do width equals inf. And this then gives us all of the columns. But it's the first 10 rows of all of the columns. If you wanted all of the rows, what we could then do would be to do n equals, let's say 20. And this then gives us 20 rows. And you could do n equals inf as well, to then get all 348 rows. And in this case, we're getting the first 11 columns, I believe it was, or whatever fit on our screen, right? So again, I think the tbl is a much easier to work with and look at version of the same data represented as a data frame. Again, it's the same data being represented slightly differently with a little bit more bells and whistles to make the tbl just a little bit more attractive to work with. So go ahead and practice with this, look at the data frames that you have in your projects. See if you can think about how you convert those data frames into tibbles and what you think of the advantages and disadvantages. We'll see you next time for another episode of Code