 Hey folks! If you've been paying attention to any of my episodes of Code Club, you'll know that the first line of basically every script we've written has been library tidyverse. Well, today and for the next several episodes, I'm going to show you what happens when you leave the tidyverse, when you don't work within the tidyverse. And that's because I have a special application that just won't work in the tidyverse. What is that application? Well, I've got data that is not rectangular. Those rectangular data frames are very common across all different fields. But there's also a lot of other different types of data out there that we need to learn how to work with and bring into R and eventually perhaps get into a position where we can work with it within the tidyverse. Well, the application that I'm thinking of is something called a distance matrix. So distance matrices are fairly common within the field of ecology. And so what you can think of it as representing is the distance between any two different entities, any two different samples, different cities even, right? You can imagine that, you know, the distance between Detroit and Chicago is a distance, then between say Detroit and Nashville is a distance and Nashville to New York City, right? And you can go on and on with these pairwise distances. Well, in ecology, we have similar types of distances, but they're not geographic. They're more of similarity of different types of communities and how similar those communities are to each other. So we don't really need to go down that road of describing how we calculate those distances. But the output that we frequently will get is what's called a lower triangular distance matrix. So because the distance between Chicago and Detroit is the same as the distance between Detroit and Chicago, we generally only represent one of those distances in these matrices. So what I want to do is show you how we can read in that distance matrix. Now, we will take step one today. But this is going to take us several episodes. Along the way, we are not going to be using the tidyverse, we are going to be using base R. And so we are not going to use the library function to load the tidyverse. So I really want to expose you to elements of base R, because there are times like this when you need to know base R. Even when you're using tidyverse, it really pays to know a lot of base R. So that's what I'm going to do. In each of the next episodes, I'm going to show you a different aspect of base R and how we can use it to solve this problem of reading in non rectangular data like a distance matrix to getting it into a format that will play nicely with a tool like tools from the tidyverse. We'll head over to our studio here. And I have created a repository up on GitHub, where you can find all the different data and code that I'll be generating down below in the description, I'll give you a link to a blog post where you can get a link to the starting condition of the repository, and the ending condition of the repository at the beginning and ending of each of these episodes. So go ahead and check that out. Even if you don't want to use GitHub, which is fine, you can find the raw data that I'm going to be working with there that the distance matrices that we will be trying to read in the data in mice break hurtest.dist was generated in a study that my lab published almost 10 years ago now, looking at the temporal variation in the gut microbiota of mice over the course of a year. And so I think we looked at maybe maybe 15 or so different mice over a year, got lots of different data. And so, you know, we now have this distance matrix that tells us the distance between all time points and all mice across the study. And so that's what I want to read in here. Also, I'll put a link to that study down below in the description if you want to read up on what we found in that study. Anyway, the mice simple break hurtest.dist shows the first 10 samples from that larger distance matrix. And so if I click on that and open up this file, you will see the lower triangular format of this distance matrix. Again, this is a philip format of a distance matrix. The first line this 10 indicates the number of samples or number of entities that are being compared. We then have the names of those 10 samples in this first column. And then we have the distances, right? So F3D0 doesn't have any distances. Because again, F3D0 against F3D1, which is the next one, would be 0.93. So up here, kind of where I've got my cursor here, would be 0.392. Whereas down here, the distance between F3D1 and F3D0 is 0.392, right? And so we see we have this lower triangle matrix. And again, this works because the matrices are symmetrical with each other, right? If I looked at the upper triangle, it'd be the transpose of this lower triangle. Anyway, it should be obvious that this is not rectangular data. If I tried to read this in with the tidyverse, I could go ahead and go ahead and load the tidyverse. And then I could do something like read TSV because this is a tab separated values file. I could do mice, simple, breakurtis.dist. This reads in. And of course, it's gagging because it doesn't know what to make of things, right? So it thinks that the column name is 10. And then it's kind of seeing that there's tabs and all these other lines. And thinking that that tab should be the delimiter to separate out the different columns, right? We could try to do some tricks, right? So things like we could do skip equals one to skip that 10. But now it thinks F3D0 is a sample name. And really, there's no way to do this. And so I don't want you to think that I am shunning the tidyverse for the sake of shunning the tidyverse. This is really an application where we need to use base art. Also, I know, I know, I know there are other packages out there that will help you to read in a lower triangle matrix. Yes, you could go out and get that package and use it. You could skip the next seven or eight episodes. But what would you learn about base R? Come on. So we're going to ignore those other packages. And we are going to use base R because it is the appropriate tool for this type of problem. So I have created an R script called read LT matrix dot R. And what we are going to use to read in this kind of oddly shaped data is a function called scan. And so scan is a function that again is in base R. And we can give it the name of our file. So I'm going to do my simple break Curtis dot this. And again, I'm using this simple version, because it's easier to kind of work through the issues with a smaller file than a file with like close to 400 different rows in it. So we'll work with the smaller version of the data set. And when we get to the end of the process of building up a function to read in a distance matrix, we'll then apply it to the more complete distance matrix. So again, we've got scan. And so scan gives us an air. So scan will read in the entire contents of a file. And by default, it'll delimit it by white space, right? So all those tabs and line breaks, it'll use that to separate apart the data. And that'll make a little bit more sense as we go along here what I mean. But it expects that the data is a double as a numerical value. And it's finding things in there, like we saw in these column names, like F3D zero, and it's saying, that's not a number. So what we need to do instead is give it an argument, which would be what equals character, right? And so we can give it character as a function to tell scan that what it's reading in our character data, we no longer get that error message and it read in the 56 items, which I think makes sense, right? So if we think about having 10 rows times 10 minus one columns, right, divided by two to get the lower triangle, that gives us 45, 45 distances, plus the 10 names, that's 55, plus the 10, which is 56. So it read in everything we'd expected to read in. So scan is reading the data in as a vector. And in a vector, all of the data in that that object are of the same type. So everything is either all character or all numeric or all logical, you can't have mixed types. And so it's forcing everything to be of the same type. And so that's why we had to put in what equals character. One of the other things that I don't really like about this function is that it tells me how many things are read in, I kind of like my output to be simple. And I don't really like getting this red text back to me. So something I could do would be do quiet equals true. And so that tells scan to be quiet. So we read that in and we no longer get that statement saying how many things that read in. So by default, scan is separating all the values in our file by white space, perhaps we want to be a little bit more specific, right? So we could do set equals space. And so then this will try to separate everything by a space. And so I think it's using that space also as a surrogate for a line break, because it's not separating things within a line, right? If I wanted to separate everything on a tab, we could do backslash t. Now it separates everything into its own cell by separating everything by a tab, or by a line break. And if I wanted to separate everything by a line break, I could do that n. And again, this gives us basically what we had before where we gave it the space, right? And so in this case, each line is a different element of the vector. So the first seat in the vector is 10, the second is f3d 10, and so forth, right? What I'm going to use is I'm going to use the backslash t, which again is that default behavior where it's separating on the tabs and any other white space that it comes across. I'm going to go ahead and save this as a variable called distances. And let's go ahead and kind of put each argument on its own line. So it's easier to see what's going on. We'll run that and then look at distances. So this distances vector is what we are going to be working with in future episodes to try to extract out the distance matrix that we basically had in that file. But again, we need to convert what was in the file into something that R can understand and that we can ultimately convert into something that will work with tools from the tidyverse, so that we could eventually perhaps do some type of visualization, right? So say I wanted to look at the day to day differences of a given mouse, well, I need to get it into a different format, so that we can make that type of visualization. So we're going to take this vector and we're going to do some manipulations on it. But before we can do that, we need to learn how to create vectors and how to access values from vectors, so that we can again convert this vector into a distance matrix that we can work with here in R. So that you don't miss any of the future episodes, please make sure that you subscribe to the channel. I'll go ahead and put a link to the rest of the playlist up here. Be sure to check that out so you can see the full progression of this story. I think this can be a really exciting dive into base R.