 Hey folks, I'm Pat Schloss and this is Code Club. In recent episodes, we've been trying to do a deep dive onto a variety of concepts from Base R. As you know, in all my other episodes, I do a lot of discussion around things in the tidyverse packages like the plier and ggplot2, but it's really important once you've mastered those to familiarize yourself more with functions from Base R because even though you might do a ton of work in the tidyverse, you'll still need to know a fair amount of what's going on in Base R. So in the last episode, we talked about how to create vectors using functions like c, the colon operator, seek, and rep. Well, I never actually told you how can we get values out of those vectors. So that's exactly what we're going to do in today's episode. The application that I'm going for is that we are reading in non-rectangular data. Specifically, I'm reading a lower triangular distance matrix. It's also called the philip formatted distance matrix. We can't use a tidyverse function like read CSV or read CSV to read in this distance matrix again, because it's not rectangular, it's triangular. And so what we saw a few episodes back was that we could use the scan function to read it in scan will read the file in as a vector. And so now what we're getting ready to do is to extract values from that vector produced by scan, so we can recreate a matrix that will work well in R. So that's what we're going to start working on today. And so to do that, I need to show you a variety of ways that you can extract information from a vector. Let's head over to our studio and we'll get going on today's episode. So as a reminder from the last episode, we can make a vector of numbers from one to 20 using that colon operator that gives us our vector x. Now say I wanted values between seven and 12 from my x vector, there's a variety of ways to do that. One approach that I'll show you is similar to what we've previously done in the tidyverse with dplyr using the filter function. So we could use a logical operator. So I could say I want the values out of x, where x is greater than or equal to seven and where x is less than or equal to 12. This then gives me the numbers seven through 12. So what's going on here is inside the square braces, we have this logical question, I have x is greater than or equal to seven. So that would get me seven through 20 as being true. And x is less than or equal to 12 being true. So everything from one to 12 being true. When we do the and we want the cases where both are true. And so this then creates a vector, another vector, right, where all the values from say one, one through six are false, seven through 12 are true. And then 13 through 20 are false. What happens then is that this gets plugged into x, the square braces, the square brace notation. And so those values that return something true are then outputted as the new vector, as we saw seven through 12. Again, we can take a vector of true and false values and plug that in to the square brace, we could do something like x, c, true, false. And so what this is going to do is it creates a vector true and false. Now, of course, that only has two elements. But what's going to happen is that x will repeat true false true false true false. And wherever there's a true value, it will output that. So here we get all the odd numbers, right, 1357 up to 19. And again, it's taking that vector, expressing it within the square braces of x to give back trues and falses. In the last episode, I showed how we could make a vector using this C function. And so this is a vector, believe it or not, of the number of legs of four different types of animals that I might have living on my farm. And so this could be called n legs. And so this is again, the number of legs on a dog, number of legs on a chicken, number of legs on a cat, and the number of legs on a fish. So zero. And it sure would be nice to be able to have a name to connect to that number. Now, I could create a separate vector called like animals, right? But then, you know, things might not be as well connected as I'd like them to be. What we can do with the C function is we can actually name these seats. So I could say dog equals four, chicken, and these need to be in quotes equals two. And then I could say cat equals four, and then fish equals zero. And so now if I look at n legs, I see a little bit of a different output than what I had up here, right? I've got names on top of each value of my vector. And what's cool about this is I could then do n legs on cat. And what do you think is going to come back? Cat with four, right? And so I can name the actual seats in the vector and then use that name to get it back, right? So again, I could do n legs, dog, and get back four. And this way, I don't need to worry about the order of the legs, the number of legs in my vector, I can call it up automatically. This is a feature that is found in many programming languages. I believe that Python calls them dicks or dictionaries. I think Pearl, if I recall correctly, calls those lists. It gets a little bit confusing because R also has lists, but they mean something different. But in R, we will call these a named vector. Now, this is really convenient, but you can run into a little bit of trouble with naming things if you're naming things numerically. So let me give you a quick demonstration. Let's do ranking. And because college football is on our mind here in Michigan, let me maybe try to recreate the rankings as I remember them. So number one right now is Alabama. Two is Michigan. I don't like their chances in the playoffs. Three, we'll say is Georgia. I think that's right. It might be Cincinnati, but who cares? And then I think fifth is Notre Dame. So we again have our named vector where we have these four seats, these four universities, and then each seat has a name. Now, what you'll notice again is that when I name each of the seats, both here for ranking as well as for n legs, I put the name in quotes. And again, it's important to keep that distinction in mind between a name and the seat number in the vector because if I do ranking one, well, that gets me Alabama, right? That's fine. But if I do ranking four, that gets me Notre Dame. That seems weird, doesn't it? I would think that if I did ranking five, I would get Notre Dame, but instead I get an NA value. And again, that's because of the confusion between if you will, the seat number or the spot number in the vector versus the name. So if I do ranking, and then in quotes five, I will get Notre Dame, right? And if I did ranking quotes three, I'll get Georgia, right? So in this case, one, two and three work perfectly fine if I call that value in ranking as a character or as a number more precisely, it should be as a character because I've named those slots, right? And so we definitely see that problem with Notre Dame where the name is five, but it's in the fourth slot. This is a silly example, but trust me, this comes up a lot. And it's actually one of the reasons why if you noticed in the plier, when you look at a table, there are no row names, because the row names end up causing a lot of problems. And so for many cases, naming your vectors like this really isn't so ideal. And the best way would be to have a data frame, like we do in the plier, where you have a column of say the rankings and a column of the university names. And that way, then you can do those deep plier operations on those values, rather than having a named vector. But again, this is helpful for getting us to think about how we can get access to the values in these vectors using numbers, as we've already done, like in the case of ranking bracket four, right? So at the end of the last episode, I showed you that we have this letters vector. This is a built in vector that has all 26 letters in American English. And if we did letters, square bracket, I could then say four to get the fourth letter. And so what's really cool with vectors is that in those square brackets, you can put the number something that's really important to know about are if you do letters one, you get a right in some languages, you would use zero to get a, but letter zero doesn't exist, as we see in this output. So if you're using Python, know that that starts at zero in its vectors, as does C and C++, R and other languages start at one. So it's always an important thing to keep in mind, if you're pivoting between different languages is, what are these vectors indexed on? And again, in R they're indexed on one. So again, I can give letters a vector, right? So I could do say one, three, five, and I'll get back the first third and fifth letters, AC and E. I could do letters, let's do 10, 20 to get letters 10 through 20. And again, this is really nice to be able to numerically insert a value into the square braces of a vector to then get back those values from the vector. So this is what I call a positive way of accessing values in a vector. There are also negative ways. So specifying what you don't want in the vector. So if I do letters bracket minus one, I then remove the first seat, right? And so now A is gone. And if I were to do a no A as letters minus one, no A would be that. And so then if I do no A minus one, then I remove the B, right? So let's go back to letters. And if I don't want a bunch of values, I could do minus C, one, two, and three to get rid of A, B, and C, right? So this positive and negative approach to getting values from a vector can be really useful. So let's take this information and think about how we can use that with the output from scan. So here again, I have my read LT matrix dot R script. I'll go ahead and run this chunk of code. And let's look at distances. And what we see is again, we have the entire contents of this distance matrix as a vector. In fact, if I look at my distance matrix, I can perhaps see the mapping, right? So this 10 is there, F3D0 is right there, F3D1 is there and so forth, right? And so the first value in this vector is the number of samples in the data set, right? Whereas F3D0 is the name of the first animal in the study. So how can we use this? Well, let's come back to distances. And let's think about how we can get the number of samples. So we could say n samples. And so then to get that first seat, we could do distances of one, and that would be distances, right? And so now if we look at n samples, we see we've got 10. So that's a character type. We know that this is a character because the 10 is wrapped in those quotes. Scan reads in all the data from this file as a character. We told it that the data are to be read in as character or as strings. And so if we wanted this to be numeric, as we saw from the last episode, we could wrap distances one in as dot numeric. And now if we look at n samples, we now no longer see those quotes and that we see 10 is a numerical value. Now, though, if we look at distances, we still have that full vector that 10 is still there in the first slot. So how would we go about getting rid of that 10? Well, as I've shown you, we could do distances minus one, that then removes the 10 from the first seat in this vector, I can now call this distances equals distances minus one, we now have our updated distances vector where we no longer have the number of samples in that first slot. And so you can imagine that we might repeat steps like this to pull out the sample name as well as the distances, we're going to save that for the next episode because we need to learn about another type of data structure called a matrix to pull that off. So be sure that you've subscribed and that you've liked this video so that you're sure that YouTube will show you that video when I've released it in a couple of days. But what I want to point out before we close for the day is that this series of steps that I've shown you on six and seven in other programming languages is called a shift where you take the first value and you remove it from the vector, you take that value and you sign it to another variable. And then the resulting vector is one term less. Alternatively, there's also a function in other languages called a pop. And so pop POP removes the last value in kind of the same way that shift removes the first value. R doesn't have either shift or pop. And so we kind of have to roll our own like this. And again, that's one of those little things about R that you might find annoying that that R's vectors are indexed on one is another thing that people find annoying. But really, who cares? It's not that big of a deal, right? I'm not too bothered by it. And you kind of figure out these little hacks along the way to do exactly what you might be able to do in those other languages. Anyway, like I said, we're going to come back to this in the next episode and see how we can use this knowledge now of how to get values from a vector, as well as how to remove values from a vector with the concept of a matrix to build out our distance matrix and vector of different sample names. So keep practicing with us. See how you can use this knowledge of how to access and remove values from a vector in your own work. And we will see you next time for another episode of Code Club.