 Hey folks, I'm Pat Schloss. In this episode of Code Club, I'm going to show you how we can introduce logic into our R scripts so that we can control the flow of data, control what is actually happening to our data at different steps, depending on what the data look like. Does this sound a little abstract? Well, bear with me. We're in the process of stepping through a variety of episodes where I am focusing on aspects of tools that we get from the base R software, from the base R programming language, if you will. We're not using anything from the tidyverse or any other packages. Yes, I know there are packages out there to do what we're trying to do, which is to read in a philip formatted distance matrix. But if we use those packages, then we don't get this great opportunity to learn all the fun things from base R. And while we do have great packages out there, including the tidyverse, it really does pay to know some of the fundamentals of base R, even if you only use them a few times. Over here in our studio, let me remind you of what we're working with. We have this lower triangular distance matrix that's called a philip formatted distance matrix. The first line where we see 10 is the number of rows that follow the number of entities that we're calculating the distances between. This first column has the name of all of those entities. In my case, these are different samples taken from a mouse, a fecal sample, and we are then comparing the community structure of these 10 different communities. They happen to all be from the same mouse on different days of their life. The values then are the ecological distances, the Bray-Curtis distances between all pairs of samples. And we say it's lower triangle because we get the lower triangle. If you can imagine this being a square matrix, well, the distance between F3D0 and F3D0 is 0. But the difference between F3D0 and F3D1 is 0.392. And the distance between F3D1 and F3D0, which you might expect to see kind of appear where my cursor is, would be the same 0.392 as well. And so to keep things simple, in this format, we keep a lower triangular matrix. So what we've done in the previous episodes is build out this readLTMatrix.R script that reads in that lower triangular distance matrix. We see that we get a 10 row by 10 column distance matrix that is symmetrical, right? So here we have that 0.392. Here we have the corresponding 0.392, right? And then that diagonal is a series of zeros. So something I was thinking about is what would happen if we gave this script a square distance matrix? So some software like mother or even the original file-up package will output for you a square matrix or a lower triangular matrix. I think mother's default is actually that lower triangular matrix. But, you know, say somebody gave it a square matrix, what would happen? Well, I have gone ahead and created a square version of this distance matrix. And so you can see it has all those distances. We kind of have funny things happening here with the wrapping of the lines. You can trust me, I have 10 rows here, 10 columns. This is the square version of this. If you want to get this square matrix as well as the code that I'm working with, if you go to the link down below in the description, you can find this file. I recently added this square version of the distance matrix for just this episode. So if you want this, you'll have to go ahead and grab that file. So let's go ahead and see what happens if we give this the sq file. And we'll go ahead and run each line and we can then look at file. And here it's reading everything in. Again, it's parsing the file by the line. And we see that again, there are 10 samples. We then do the file split. We then fill in. This fill in was a tool that we used to add extra spaces because each line didn't have the same number of columns, because it was lower triangular. In this case, all of the lines have the same number of columns. And so it shouldn't have to fill anything in. So then if I look at filled, I should see a list with 10 elements, which I do. And each of them is 10 elements long. So that works. And so then we can bind those together to make it a distance matrix. And so again, this is the samples distance matrix. And we can again see that we now have a 10 by 11 10 row 11 column. And then it's of type character, because a matrix all of the entities in the matrix are of the same data type. And so if you have a character in that matrix and everything in the matrix is a character. And so then in the next step, what we do is we remove the samples to make the samples variable. And then we remove that column from the distance matrix. So now if we look at samples, we then see our 10 sample names, we now see that we have a 10 by 10 matrix, but it's of all type character. And so then what we did was we then it went ahead and here on line 22, we went ahead and made it a numerical matrix. So again, if we look at this matrix, we see it's of type numerical and you know it's numerical in this case, because it no longer has those quotes around each of the numbers. So the next step here on line 23 took the lower triangle distance matrix and added to it the transpose. And that's because again, if we go back and let's go back to the lower triangle case, we'll remove that sq. And we'll run everything except that line 23, I see that I have a lower triangle distance matrix, right? So the transpose takes this and adds the transpose of this matrix, right? So if I take the t distance matrix, I get the transpose, I get the upper triangle. And so then when I add those two together, I get a square distance matrix, right? Okay, so that's what's happening in the code. Again, let's go ahead and put back in the sq, and we'll come all the way back down to line 22. So if we look at this matrix, we see that we have a square distance matrix. But if we run this next line, this matrix plus t disk matrix, we're going to be adding the distance matrix to itself since it's already symmetrical. And what you can see is like this distance now is 0.282. And if you come back up to the previous version, we see that it was 0.141. So we doubled the distances, right? So this line 23, I only want it to run if my matrix is lower triangular. If it's a square symmetric matrix, I don't want to run this line because we're going to be doubling our distances, which is no good, right? So to tell R to only run the transpose when we have a lower triangle distance matrix, we need to build logic into our code. And so we can do this using a series of if-else statements. Now, you may be familiar with the function if else, right? You might also be familiar with if underscore else, right? So if underscore else is from dplyr, and if else is from base r, right? I'm not going to worry about the dplyr version for right now, because it turns out we don't want either of these, right? So let me give you a brief demo on how if else works, because it'll help motivate what we're doing and why it's different, right? So if we have a vector of say 1 to 20, and say we do x modulus 2. So again, modulus is returns the remainder, right? So then if we take x modulus 2, we get a series of ones and zeros. And if it's one, one, then it is odd to its even. So I could do if else, x modulus 2 equals equals one. And then I could put in odd as a character string. And then even if it's not one, right? And so then if I run this, I now get odd or even, depending on whether or not that spot in the vector was odd or even, right? That's pretty convenient. So let's see if we can use the if else function to tell our when to add the transpose of the distance matrix. So let's go back up and let's work with the lower triangle version. So I'll go ahead and remove that underscore sq. And I will then rerun everything down through my line 22. And I again, I can see that I've got my lower triangular distance matrix, right? Because I've got all those zeros in the upper right corner. So if I think about disk matrix, and then I want to know, is that equal to the transpose of disk matrix, right? So if this matrix equals the transpose of this matrix, it's symmetrical, right? And so if I run that, this then gives me a bunch of falses, except for all the values on the diagonal, which were zero anyway, right? So this tells me that it is not symmetrical, right? So let's say if I had done sq instead. And again, we'll run everything down to line 22. And then if we run this matrix equals equals the transpose of this matrix, I get a whole bunch of true values, right? So again, you can hopefully see the difference that with the lower triangle matrix, nearly everything is false. With a symmetric matrix, everything is true. Now, I can ask our all the values in the vector in a matrix true by doing all using the all function. So if I say all are all the values in this entity true, it returns a true value, right? So alternatively, if I give it the lower triangle matrix, and again, let's run all that down to line 22, and then do this all statement, I get back a false, right? So if this is false, so if false, then it is a lower triangular. If true, it is symmetric, right? Then so that's the logic that we want to be able to build into this R script, right? So again, we could take if else, and this could be our question, right? So this would be the analogous to this x modulus two equals equals one, right? Because this is a statement that returns a series of truths and false values. So if that is true, then I'm going to return this matrix, because I don't need to change anything. If it's false, then it's lower triangle, and I need to add the transpose, right? So again, if, if everything is the same, when I transpose the matrix, then I'm going to return this matrix. If it's false, then it's lower triangle, and I need to add the transpose, at least that's my theory, right? And so I run that and actually get back the number zero, which is not at all what I was expecting. So why are we getting this zero value? Well, if you look at the documentation for if else, it tells you that if else returns a value with the same shape as test, which is filled with elements selected from either yes or no, depending on whether the elements of test are true and false. So this x modulus two equals equals one is the test, that's the test argument. Similarly, this all function is also the test argument. And as we saw before, this has a value, a single unit long, right? So it's false. And so it is replacing that false with data from this yes, or no, or true or false, right? And so we're just getting that very first cell in the matrix. And so if else doesn't work for this situation, if else is great, when you're giving it a vector, and you want to get out values that are the same length as that vector. So this isn't going to work. But this logic will help us to think about how we can control the flow through our program using a similar if else statement. So we could do if as a function on its own, and to if we can give it our test statement, right? And so we can say if all values of distance matrix equal the transpose of distance matrix, we said then that it's symmetrical. And so we can then put that in curly braces and I'll say print data are symmetric. Right? And so if we go ahead and run this, it doesn't say anything, right? And so we can then say else print data are not symmetric. So again, if we then run this block, it's going to return data not symmetric, because we've given it the lower triangle version of the data. So if the data aren't symmetric, then we want to return this disk matrix plus the transpose of disk matrix. Now the output is our symmetrical distance matrix. And it only did that in the case where the data were not symmetrical. So we'll go ahead and say disk matrix equals that. We don't need to change anything here in this if statement, because we're leaving this matrix untouched, we could add the sq, and we can then run all this code again. And then we can come down to our if else block, run that it says data are symmetric. And if we then look at disk matrix, we now see that we've got that symmetric distance matrix without doubling our distances. So again, we're able to use that if else statement to to control the logic to control what is happening to the data as we go through the script, which is really convenient. So this if else block is a little bit silly, because we don't really need the else statement. If we write our if statement a little bit differently. So if instead of all, if I did exclamation point all, that would then take the false value and make it true, right? And so that would basically be flipping my if and my else, as I've got it written here. And so I could then take data are symmetric, data are not symmetric and switch to those locations, right? And so now again, my data are symmetric. So when I run this, it will say data are symmetric. And again, if I remove this sq, I see data are not symmetric. And I look at this matrix, I now get the symmetric distance matrix. So as I look at this code, I start wondering about all sorts of different cases where perhaps I screw up the data, say I went into my distance matrix file, and I was looking at the distances. And as I was looking around accidentally hit a key on my keyboard, and that kind of screwed up the data in the distance matrix. So I could have a distance matrix that was supposed to be symmetric, it's square, it's not lower triangle, but it's not symmetric, it doesn't have the same values on either end of the diagonal. This code would effectively double all the distances, because it would say that the values on either side of the diagonal are not identical. Admittedly, this is a little bit of a concocted situation, but go with me so I can show you a little bit more of the tricks that we can use with our if else statements. So to extract the upper triangle of the distance matrix, we could do upper.try as a function on disk matrix. This then returns trues and falses for each value of the matrix to tell you which values are in the upper triangle. By the same token, we could also do lower to get the values that are in the lower triangle of the matrix. Well, we could take upper try, and we could use that as the indexing value into disk matrix. And this then returns a vector of all the values in the upper triangle. Well, I could use this as the argument to sum, to then get the sum of all those values. And if the sum of all those values was zero, then I would know that it was a lower triangle matrix. So I could take this, and I can say if the sum of all that is, I gotta make sure I get my parentheses right, if that is zero, then it's a lower triangle matrix, right? So I move this disk matrix up, and I will say print data are lower triangle, right? And then I will go ahead and add this disk matrix. And you can see as I'm, as I'm adding in line breaks, it doesn't quite know where to put this line, it's putting it at the left side. And that's because I actually forgot the open curly brace to my if statement here. I can then go ahead and close this block. So if I run this series of if statements as it currently is written, it will run this first if statement, it will go ahead and if it's a lower triangle matrix, it will then make it square, it will then come down and run this next statement, where it will then say, are the data symmetric, right? And so, well, this could probably work. What I'd rather have it be is one complete block where the data comes in and it asks the question, you know, if this is true, if that's false, then I wanted to ask another question. And if that's false, then I wanted to do the else statement. Well, I can add this to an else if statement, right? And so by adding else, if it then says, well, if this is false, then otherwise else ask, is this true? So if this is true, then it'll say data not symmetric. If it's false, it'll go ahead and do this else statement, where it says data are symmetric, right? And so know that you can use this if else if else structure to make very complicated sets of statements, you can have if statements with inside if statements, you can have as many else if statements as you want. To some degree, this becomes a lot like case when or a switch function that you might have seen with tidyverse or other elements of base R. But know that case when and switch again, like if else work on vector data, whereas if else if else blocks like this will work on single values. So let's go ahead and clean up our code a little bit. And again, what we have is our lower triangle matrix. So if we run this, it should output data or lower triangle. And then again, it does say data or lower triangle. And if we look at this matrix, we now get our symmetric distance matrix. Let's go ahead and put in sq, rerun everything data are symmetric, again, this matrix, we again see our symmetric data, I'm going to do something where I will come back up into my square version of break Curtis. And I'm going to accidentally delete some values, right? So go ahead and save that. And so now if I rerun that, it'll tell me my data are not symmetric. Again, this situation of corrupting your data, so that the distance matrix is not symmetrical, it's a bit artificial. But it does allow us to highlight how we can use if else if else statements all together to control the flow of data through our script. One other thing to notice about the syntax is that the else has to be on the same line as that closing curly brace. If I put the else down, and go ahead and run this, then it's going to complain, right, unexpected else in else, right? And that's again because that else needs to be on the same line as the closing parentheses. Okay, I'm going to go ahead and clean up the code a little bit to go ahead and remove those else if statements and else statements, as well as that print statement, I'm fairly confident that my data are going to be well structured and not corrupted. But certainly if I was creating a package, I would want to put in all sorts of tests to make sure that the data was well structured. Because if I'm putting this out into the wild, who knows what kind of data people are going to end up giving it. So I feel pretty good about this. I'm going to go ahead and save and commit it. And you can find the final version up on the GitHub repository. Again, the link for that is down below in the description. Come back next time, make sure that you subscribe so you know when the next episode is released. And we'll see you for another episode of Code Club.