 If you're watching that last episode, you know we started with about 75 lines of code, most of which was the same four or five lines repeated 10 times. We replaced that with a for loop. It worked very efficiently, worked really well, and it got the answer that we were hoping for. We were able to generalize it to other distance matrices, and it worked really well. Well, in this episode, I want to go about removing that for loop and seeing if we can do it a different way. Perhaps it might be a little bit more efficient, but ultimately the goal isn't to do it just a different way for the sake of doing it a different way, but to do it a different way to help us to learn to program better, to learn more things about programming in R. That's kind of like this current series we're working in, where we are trying to read in a lower triangular distance matrix, also called a file-up formatted distance matrix, into R. This is giving us an opportunity to really dig into the nitty gritty of functionality within base R. Sure, we've got the tidy verse and a lot of great packages out there that make using R easy, but I find that there are these kind of nooks and crannies of R, things like vectors, matrices, for loops. And in today, we'll learn about lists and how we can apply functions over multiple elements of a list and how we can work with lists. And you know what? Lists show up all over the place, you just don't realize it. So let's head over to our studio and we'll get going on today's episode where we try to remove a for loop from a perfectly good chunk of code. So as I mentioned, this is all base R. There's no packages being loaded in here, no tidy verse, nothing else going on. We read in this lower triangular distance matrix, it comes out as a vector. In the last episode, we went ahead and used the full distance matrix with like 348 samples. I'm going to go ahead and turn this back to the simple one. The smaller file makes it easier to kind of work through issues and understand what's going on, then getting the output for all those 348 different samples. From that vector that is produced by scan, we then looked at the number of samples, we removed that from the vector as you recall, we kind of chomped through the vector. And that's really what's going on in all of these different for loops. So what we find is that distances is a vector that's got 56 different values in it. We've done the accounting before in that that made sense. But anyway, when we ran scan, we separated all the values by a tab, right? And so again, if you think about this distance matrix, anywhere there's a tab or a line break, that demarcated where a different value of a vector was created. And so there's 56 values here. Our vector, as we saw down here, has 56 different values in it. I again, want to get away from what we have here with the for loop. And what I'm going to do is I'm going to go ahead and comment out all of that code because I don't want to use it. Perhaps I will use it for inspiration and thinking about things. So I'm going to start back at the beginning. And instead of separating on tabs, I'm going to separate on line breaks. Each line of my distance matrix is now a different value in my vector. So the reason we can't read these data in, say using read TSV is because there's a different number of values in each row. Again, the data are not rectangular, they're triangular, right? Lower triangular. And so a thought that I had would be, can we treat each line or each seat, each value of this vector separately, pull it apart by the tab. So again, we're working on each line separately. We want to separate by the tab. And then perhaps add values so that all of the lines have the same number of distances in it. Right. So that's what I'm going to do. So that might not totally make sense. And we haven't talked about the tools that we need yet to achieve that. And so that's what I'm going to go through with you today. So we can use a function called str split. And so str split will take, say distances, and let's give it like row four, or value four, I think of these as rows, because I have one value per line here on my screen, but it's really seat four, it's a vector, it's not a data frame. And we could then think about separating by a tab care. And so the output we then get, again, as we see, is row four, or value four for F3d125. And we get its three distances. And again, what I'm thinking is, it'd be really cool to then kind of add on seven additional distances, all zeros, or all blank characters, so that we could then have every value in our vector having the same amount of data. Now, something that you might notice is a little bit fishy with his output is that we have this double square bracket around a one. That is the telltale sign that you are working with a list. So string split outputs the splits as a list. So each value in the vector becomes a different value in the list. And then each value in the list is that vector of split values. To have that make a little bit more sense, let's go ahead and do string split on all of the distances splitting each string on a tab. What we then see is that again, each value of the vector is a different element in the list. And then within that element of the list, we have a vector which got which has the sample name, as well as the different distances that correspond to that. So I'm going to go ahead and assign the output of this string split to file split. And you know what, I'm going to go ahead and rename the red in file as file. And we'll go ahead and take that and that and that update everywhere else. And so let's run this and make sure it all works. And so now if we look at file split, we see that we again have 10 values in this list. And it looks just like what we had up above. So we've spent a fair amount of time in recent episodes talking about ways to get values out of a vector or out of a matrix. So I'm going to spend just a brief moment talking about how to get values out of a list. So we could do file split, and single brace with a one, we get the first list element out of the list, which was one, right? And so if we did like, say four, we would get the fourth sample, right? F3D125. But the output of this is still a list, right? It's a list with one element in it, right? If we had done say four colon five, we would then get two elements out of it. Well, if we do file split, with two square braces, and say two, so with the double square brace notation, we no longer see that double bracket one, or in this case, would be double bracket two, that it's outputting a list, right? So single brace, you get back a list, double brace, you get back a vector. Again, if we look at file split, the names of the different elements are numerical, right? They're not names, they're numerical indices, kind of like we saw when we were defining vectors without being named vectors, right? Well, I can assign names to the different values in my list. And so this is a bit silly. So bear with me, I can say names on file split. And again, those are the names for the list seats, so to speak, in file split. Now, I don't have any names, but I can assign names, right? So I could assign them say, let's try with letters, and we'll do one, two, and samples. And now if I do name file split, I get those letters of the alphabet, right? And if I do file split, I now see that instead of the double brace, one, I get a dollar sign a dollar sign B for seat two, right? And so if I do file split, let's say three now, I still get that third row. But another way I could have done that would have been file split dollar sign C. And so this is the way to get a named list element out of a list that you can use a dollar sign C. Now, some of you may be familiar with working with data frames, and know that you can get a column out of a data frame using a dollar sign, the exact same thing, a data frame is actually a list of vectors, where each column is a vector. But those columns are linked together with a list. For now, I'm going to go back to what we had up here with file split using the string split, and having our different elements of the list be numerically defined consecutively, each element of our list has a different number of entities in it. So the first entity has one in it. And I believe element 10 has 10 entities in it. And that again includes the name of the sample for sample 10. I'd like to create a function that will tack on extra values to make it so that everything has 10 elements in it. Actually, I would like them to all have 11 elements, because this one doesn't have a zero for the final column. So what we can do is let's start by creating a function that we can test and perhaps see how we can then apply that across every element of our list. So we'll create a function called fill in. And this syntax, I think we've talked about in previous episodes, but fill in is the name of the function, kind of like str split is the name of that function here, it's fill in, we have the function keyword, which takes arguments and these are the arguments that we're going to give our function. And then we also use the curly braces like we saw in the last episode for the for loop, this curly brace is going to define what happens within this function. So the arguments are going to be x and x is going to be the vector that we're feeding it. And then we'll also give it length, which is the length that we want to add padding to let me show you a little bit of my thinking here. If I do file split on six, I'm going to get that vector back. I can wrap that it as the argument to the length function, right? So that's six. So if I want to get out to 11 spots, that is going to be 11, or let's do this, we can do n samples plus one, because I have the sample name that I want to have minus the length of the vector, right? And so that would mean I need to add five extra spots, right? So this is getting a little bit complicated, but bear with me, right? So what we could do is we could make a vector with the C function, that is file split six, right? And then I want to repeat, so I'm going to create a vector of the same empty character, right? So double quotes, and the length of that is going to be this. And so now when we run that, we get 11 seats in our vector, right? We've got seven, eight, nine, 10, 11. So that's cool, right? That works. So this is going to be my special sauce, right? That's going to be what my fill in function does. Now, we are going to take length, which is what we are going to pass in here. So number of samples plus one, right? And x is this vector, right? And so, yeah, so we'll place that, let's see, let's do fill in. And to double check that it works, we can do fill in on file split, brace, brace six. And the length we want is 11. And sure enough, that does exactly what we want. So we could always make this a for loop, right? But I don't want it to be a for loop because we've already done a for loop. And my goal here is to remove using a for loop. So what we're going to do is use a function called l apply. Now, this is a base r function that allows us to take a function and apply it over values of a list. There's a variety of other apply functions out there that allow you to do same type of thing, right, to perhaps apply something over different values of a vector, or over different columns of a data frame or different rows of a data frame for sure, right? And so this is very similar to what you might be familiar with those the map functions from the per package. Now, the per package makes things a lot easier, and perhaps a little bit more comfortable to work with within the tidyverse. But again, we're doing this all in base r. So I want to show you how to use the l apply function. Again, what we can do with l apply is that we can apply the fill in function over all values of file split. So again, we'll take l apply file split, and we will give that fill in. Now when we run that, we get an error that argument length is missing. And that is because fill in needs a value for length. And so we can give fill in that argument within l apply here, by saying length equals n samples plus one, right, because we want we used 11 up here online 18, because we had 10 samples plus a column for the name of the sample, right? So we're going to do length and samples plus one. So now when we run this, what we see is that we get 10 values within our vector, each of those having 11 elements in the vector, right? And so these now all have the same length. So I'll go ahead and call this filled. That's great. And we'll also go ahead and remove this test code here. So there's a couple of different approaches that we might try to use to concatenate those elements together to form a matrix. My go to instinct would be to do something like our bind on filled. So our bind binds rows together or barns things together as rows. What we'll find is that this doesn't work. This basically makes each list separate column. I'm not totally sure why that is. So this doesn't work. But that was my instinct. To get this to work, though, what we could do would be a function do dot call. And then we're going to do our bind over filled. So this is kind of like L apply, where we're going to call our bind on each value of filled rather than on filled all together. And what we get sure enough now is a matrix like we'd expect, right? Where the first column is the sample name, and then the 10 by 10 distance matrix. So that's one approach we could use. I think that's a lot simpler syntactically. But do call is a function that I only ever use in this situation. So I have to remember that function for just kind of a pretty esoteric situation. Alternatively, we could leverage some tools that we've learned previously, when we're talking about matrices in that episode. To do that, we could do unlist filled. And what that does is it unlists the data, right? So it takes those 10 elements of our list, and basically concatenates them all together to make a really long vector of 110 elements. We could then use that as input to the matrix function, right? We could then say n row equals 10, and call equals 11. And then we could say, by row equals true. And so we get back then the same thing. So instead of n row equals 10 call 11, I could of course do n samples. And so I'm not hard coding the number of samples in there. So both of these approaches work perfectly fine. I think I'm going to run with the do call, because that's a little bit more in the spirit of working with lists. But know that both approaches work, I'll call the samples distance matrix. So now if we look at samples distance matrix, we again get that output, I can then say samples equals samples distance matrix. And it's the first column, right? So I could do comma, I'll say dist matrix as samples distance matrix minus one, right? So that's going to be everything but the first column. Now if I look at dist matrix, I get my distance matrix, it's 10 rows by 10 columns, it doesn't have the samples. And if I look at samples, I see those 10 sample names. So what I'm working towards is taking this dist matrix object, and I want to make it a numeric distance matrix. If I ran as numeric on this right now, these double quotes would become NA values. And so it occurs to me that instead of repping double quotes, I should have wrapped a zero. So we can come back up here. And inside of the rep function, we can put our zero, we can rerun all this good stuff, because we're saving our code and making things reproducible. We can do this matrix. And now we see we've got those zero values. Now, if I want to make this numeric, I could do as dot numeric, dist dot matrix. And now I get numerical data, but it's a vector, right? It's removed completely all of the structure that I had up above from from the matrix, right? So to reform this as a matrix, we could of course feed this in to the matrix function as we discussed earlier, with n row equals n samples. And now we see that we have a lower triangular distance matrix, which is great. And so I'll call this dist matrix, basically writing back over dist matrix, which is good. But we want to get a full square matrix, we don't want only the lower triangle. So to get the upper triangle, we can use the t function, the t function is the transpose. So if we do t on dist matrix, we then get the upper triangle, right? So what would happen if we did dist matrix plus the transpose of this matrix? Sure enough, we get the square distance matrix, right? So I can then save this back over my dist matrix. And now I have read in and formatted my distance matrix without a single for loop. It did require us learning a little bit about lists and parsing things apart a little bit differently than what we saw before. But I think this was a fun challenge of seeing how we could read in a lower triangular distance matrix without using a for loop. And yes, it can be done. Is it necessary? No, it's not necessary. We had a perfectly good result before with the for loop. But, you know, trying to do things differently or imposing constraints on yourself is a great way to kind of stretch your mental muscles and to learn different parts of the programming language. If you always do things the same way in the strange strategy, then when you need a different strategy, you won't have those skills. And so again, I find it really useful as an exercise to impose these types of constraints on myself, and then see if I can solve the problem a different way. And you know, knows you might learn something, right? So again, we have done this completely in Base R without a lick of deep plier or any other package. I'm pretty proud of that. And I think that's pretty cool. We are learning a lot about Base R as we go through this. So the next episode actually won't be released until early January. I'm going to take the week off to spend it with my family as we celebrate Christmas. I really wish you the best of the season, whatever holiday you're celebrating with with hopefully your family and friends. And I really look forward to digging into more R and reproducible research practices in the coming year. So have a great New Year's and we'll see you in early 2022.