 Hey folks, we're in the process of learning more about base R so that we can use those skills to enhance our skills using packages and functions from the tidyverse. We are reading in a non-rectangular data type. Specifically, we're reading in a lower triangular distance matrix often called a philip formatted distance matrix. We are using this because we can't read in the distance matrix using something like read TSV or read CSV from the read R package that automatically imports things as tibbles or data frames, right? And so because it's got this odd shape and it's not rectangular, we have to, you know, learn some base R. And that's all good, because learning base R will certainly enhance our skills across the R programming environment. Now in the last episode, we learned about matrices, and we saw how we could kind of eat through a vector and convert that vector into a matrix. We had a couple problems. So problem one is that it is effectively hard coded, if you will, that the matrix we were reading in had data for 10 samples, we read it in using the same chunk of code repeated multiple times. And so if we wanted to read in a matrix that had say 300 samples, I'd have to repeat that code like, I don't know, 290 more times. And that would be really painful. It's what's called not dry, right? And so dry is an acronym for don't repeat yourself. If you will, I guess this is wet code, right? We're repeating the same chunk over and over with subtle differences. So in this episode, I'm going to show you an approach that we can use to dry out our code. And we're going to do that, recognizing patterns in the code and using a for loop. I know you've probably heard that for loops are just the worst possible thing ever in our and you should work to remove all possible for loops. And while that's certainly true in some cases, in this case, it's actually not that big of a problem. So I'm going to show you how we can write a for loop and how we can use that for loop to dry out our code and make it a lot simpler. A for loop has a fairly simple syntax. We're going to say for variable in vector. And then we are going to use curly braces. And the curly braces will define the the functionality that's going to get repeated over all of the values of your vector, right? And so here you might think of this as like your special sauce to be repeated, right? And and so then again, it's going to repeat what's in that, however many times or over all of the values of your vector. Let me go ahead and comment this out because this is kind of a template for what we're going to do. So I could say four x in one to 100, right? So for my variable x in my vector, one to 100, right? So that vector is one, two, three, four all the way up to 100. I'm then going to use those curly braces. Those curly braces are to the right of the P on your keyboard. If you're like me, you probably never use those curly braces except for programming. But curly braces are helpful for defining a body of code, really helpful within a for loop and with functions in our I can then do print x. And so print is going to print the value of x over every element in 100 from one to 100, right? And so now when I run this, I see as the output that it is printing every value of the vector one to 100, right? And so it's printing x and x is iterating between one and 100, right? So I could then modify x and I could say x squared. And again, if I run my for loop again, now I get all of the square values of the values from one to 100, going 149, right? So that's the square of 123. And finally getting down to 10,000, which is the square of 100. So this is a for loop. This is a very simple for loop. And again, what I want to highlight is that the for loop, the four is a function, right? And the argument requires x and some type of vector, right? And so we're taking x as the variable that's used within the body of the function. And we're iterating it or we're looping it over all of the values of the vector, right? So another thing that, you know, perhaps you could do before n in names, right? And perhaps I have a vector of names, right? And I could have c, pat, Mary, Jose, and Doug, right? So I've got that vector of names. I could, I could similarly do print n, right? And so this should then print all of the names in the names vector, right? And so again, names, the vector that we're iterating over could be numerical, it could be character, it could be anything. And what's really cool is that n then takes on each subsequent value of the vector we're giving it and it gives it these values in the order that they are in the vector, we have this read LT matrix dot r script. If you want to get this code and you're following along, be sure to go down below. I have a link in the description to a blog post where you can get the GitHub repository and the code that I'm working with here in this project so you can you can work along with me. And what we're doing is we read in the distances using the scan function. And again, as we saw before, distances is this long vector that has all of the values that were in our mice, simple break, Curtis distance matrix as you see here. So it's clearly not rectangular. And you can kind of see that distances is the values, all these values. And again, that was read and using the scan function, a few episodes back, right? So now we have distances. And the first thing we get out is the number of samples. And we then remove that value from our distances vector. And then in the last episode, we saw how we could make the matrix and a vector for our different samples. And then we saw how we could basically step through this distances vector, populating a matrix in this first step here as we're going through distances, and read in the sample name, and we removed that sample name from the distances vector. Then we came in, and we got the second name and the distance and we took that distance and we plugged it then into the distance matrix matrix, right? And we kept iterating through each row or each sample as we went through. And if we look here, say at samples five, we're reading in the fifth sample name, removing that from the distance matrix. And then the fifth row of the distance matrix are the we're looking at the one through four columns of the fifth row, because again, it's lower triangular and things above the diagonal of the matrix are the same as the values below it, right? And so the first four columns of the fifth row come from the next four values in the distances vector, we then remove those four values from the distances vector to create the new distances vector that's then used for the sixth sample. So we see a bit of a pattern here, right? If I say created a variable I, right, and let's say the value was six, you know, what if I were to write this expression that I have down here in terms of I, what would it look like? Well, let's go ahead and bring this up. And I'm going to make samples I right so that I think is six, right? These ones and minus ones and distance will be the same. Here, I'm going to go the I throw the sixth row. And I'm going to go from one to I minus one, right? So not five, but I minus one, right? And we will then use the same kind of approach to pull those distances out of the distances vector. So we'll do one to I minus one. And then we'll remove the distances or those values from the distances vector that probably wasn't a very well named vector. Oh, well, from one to I minus one, right? So let's go ahead and run everything through this step. And so what we should expect is that if I look at samples, I should have six sample names in here. And sure enough, there they are. And if I look at disk matrix, then I have the first six rows and five columns populated, which sure enough I do. And if I look at distances, I now see I'm ready for the seventh sample, right? And so, you know, a way to think about this might be that well, I could take the same chunk. And I'll just basically copy this down, right? So that's now seven. And then I can do eight. And then nine. And I'll do 10, right? And what you'll notice is that I'm not changing anything in these four lines of code. All I'm changing is the I. So now if I rerun everything, and I look at samples, I now have all 10 sample names. And if I look at disk matrix, I now see the distance matrix that I got before with the previous version of the code. And if I look at distances, I now see I have an empty character vector, which we had before, this is better, but it's still not dry. And what we can see is that we're taking this same chunk of code, repeating it, but for different values of I. So if we take this chunk of code, and we bring it all the way back up, let's see where this pattern starts, right? And so, the first thing we pull out after we get the number of samples is the name, right? And so, I think the pattern really starts here with two, right? The sample two and row two of the distance matrix. So if I did I of two, then, you know, does does everything map here, right? So samples two, right, that maps this one and minus one are the same. And down here at this distance matrix, so this is two comma one, which is what we have right here. And then here as well, one to two minus one or one, right? And so this all holds from two down to 10. So I'm going to go ahead and delete all that code. I'm going to delete this code. So now I have where I'm reading in the vector, I'm reading in the data from the distance matrix, I'm setting things up. And now I'm ready to iterate over all of my samples. So let's go ahead and create our for loop. We'll go ahead and do for I in two colon and samples. And notice that we're using two rather than one, because one is basically the condition up here, right? We could maybe make it go from one to end samples, but that would require a little bit extra coding a little bit more complicated, that we just we just really don't need to worry about. And so then we'll put an open curly brace here, a closing curly brace here. And to make things look nice, I'll go ahead and bump those over a smidge. And let's go ahead and run this and see what we get. Again, we can look at samples. And we get our 10 different sample names. We could look at disk matrix, and get the distance matrix that we've seen now a few times. And we could look at distances. And we see that that's an empty vector. So we've gone from maybe 50 lines of code down to 21. But this is much more maintainable, right? The code is dry. If I had forgotten to make my character is numeric, all I'd have to do is update it here online 19 to add that as dot numeric and we'd be good to go something that I might think about doing actually is go ahead and put in the transpose term, right? So I could do dist matrix one colon i minus one comma i is the same as this thing, right? Because it's mirrored, it's transposed over the diagonal, right? So I added that mirroring here one time, right? I didn't have to repeat it every time I went through each step of the whole process, right? And so again, we can run all this, we see dist matrix. And we now see that we've got a symmetric distance matrix. And again, the only reason to point this out is to demonstrate how easy it is to update the code if you're not repeating yourself, right? If I'm repeating myself multiple times, then I have to add this line of code every time I run that chunk of code. But here in a for loop, I do it once, and we're good to go. The other benefit of making this for loop is that I'm no longer hard coding the 10 samples, the 10 steps through those samples, I'm looping over the number of samples. And so you'll see in here that the number 10 doesn't appear, right? And so what I should be able to do is go to my spray Curtis dot dist, which is our full distance matrix. And I should be able to run this whole thing, read it in and get a 348 by 348 distance matrix. If I do dist matrix, sure enough, I'm only getting the first two rows because it truncates the output, because it's such a big data frame. But I see that I've got a symmetric distance matrix that has 348 columns. And it's telling me it's emitting 346 rows, which with these two that it's printing out, I see now it's got a 348 by 348 distance matrix. So again, that's really slick that we've dried our code out. And in the process, also made it more generalizable so that we can read in any sized lower triangle distance matrix at this point. Again, don't be scared to use for loops. I think we spend a lot of time in our trying to think about how can we avoid for loops, when the goal is to get the right answer. And on data sets this size, we probably don't notice the performance hit. If you write a perhaps poorly written for loop, you notice it more with much bigger data sets. Another trick that you can use to reduce the detrimental effects of using a for loop is to pre define your data, where the data is going to get inserted, right? So if I kept adding another value to samples, without pre defining it, it would be looking for a new place every loop to store the samples data on my computer hard drive. And I realize that seems a little bit esoteric, but again, with these big data sets where it's moving big pieces of data around on your hard drive or in your RAM, more likely it's in the RAM, right? Then things slow down and just kind of bog down. And that's a performance hit that I know our suffers. And anyway, I think we have written though a good for loop here, because we've initialized the data, when we've set things up, and then we're populating an already existing data vector for our sample names and matrix for our distances. Also, we demonstrated the value of having things dry, by going ahead and getting the transpose of the matrix, as we step through each row to populate the column, the corresponding column for each of our samples. So keep practicing with these four loops. Don't be afraid of them. Try to play around with vectors and matrices and all the other base our concepts that we've been working through in recent episodes. And I'll see you next time for another episode of Code Club.