 Hey, folks, welcome back for another episode of Code Club. In recent episodes, I have been working to build a package in R that would allow us to do a bioinformatic analysis. The details of that aren't super important, but you can always go back and start at the beginning of the series if you're just jumping in right now. What's relevant for today's episode is at the end of the last episode, I tried building a object in R that I think had something like 25,000 rows and about 65,000 columns. And it just exhausted all of the memory that R would allocate to that object. And so that wasn't going to work. What I've been thinking about, though, is that that object has a bunch of zeros in it, so that the data is very sparse. And so I don't need to store it in that large of a configuration. And so there's a variety of different options that we have at our disposal. And so instead of kind of jumping into solving the problem, what I want to do is instead take a step back and think about the different options that we have and think about how those different approaches perform in terms of speed and perhaps memory utilization and things like that. And so in the next three episodes, I'm going to be looking at some of these more fundamental data structures, things like vectors, lists, and data frames. So these are all effectively one form another of a list, but we're going to start with a vector. And so a vector is a container, if you will, where every value in that container is of the same type. So you wouldn't have in that vector, say their name, their age, whether they were over six feet tall or their birth date, right? Those are four very different types of data we'd represent by four very different ways. But instead of vector would be, say, everybody's name. So if you had a class of 20 people, you'd have a vector with 20 values in it with everyone's name. Another vector might be their ages, right? Another vector might be the grade on the most recent quiz they took, things like that. The way we typically access that would be by giving the name of the object and then in square braces a number and that gets us the seat, right? So a built-in vector would be something like letters, as we see here in my RStudio session, and this gives us the 26 letters. And I could always do something like letters two to get B or I could do something sneaky like two colon five to get B through E, right? And so vectors are very versatile. The challenge with vectors comes, however, is when we're using these with really large data sets and creating a vector, it turns out, can actually be quite slow. And so again, in today's episode, what we're gonna do is focus on vectors and think about their performance and how we can use them to perhaps make them a little bit more performant. So in today's episode, we're gonna use a library or package called micro benchmark. Make sure you've got that installed. I'll also go ahead and load up my tidyverse package as well so that I have all those great data manipulation tools at my disposal. So for micro benchmark, there is a function called micro benchmark that then takes a list of functions, right? And then you run that and it will then run each of your functions 100 times and it will then return those timings, giving you some summary statistics of how long each function took on average over those 100 iterations. You can always increase it to a thousand. I'm not really interested in that much precision. I'm more interested in getting kind of a relative sense of how a variety of different ways of creating a vector as well as getting values out of a vector, how they perform. So the first approach that I'll use will be what I'm gonna call growing. So a common function that we use to build a vector is called the C function, which is short for combine. I believe so you can do like C three comma four to get a vector of two elements that's got values three and four, right? And so what we can do is we can use this to grow a vector. So we could say, let's assign this to the value X and then we could say that X is going to be X with say 10 combined onto the end of it, right? So now what we get is X. We get three, four and 10. Of course, we could have always done C three, four and 10 and gotten the same value out, right? But sometimes we don't know all of the values, right? And so what we might need to do is what we'll call grow the vector. And so I'm going to create a function where it will take X to be kind of like any upper value and what I wanna output is a vector that length where each value in the vector is the square of the position it's in. So if it's in position five, the value in position five should be 25, okay? And so I'm gonna take X and for now let's pretend that it's five, I'll go ahead and remove these to kind of clean up the space a little bit. And what we'll do will be a for loop. So I'll do four I in one colon five and define the body of that for loop with those curly braces. And I need a value or object that we'll call output and it's gonna be C output comma I squared. Now we need to define output because if I run this, it's gonna complain because object output not found. So I can define output to be of type numeric. So now when we run this, it works. And if we then run output, we get our first five values. Cool, so let's wrap this in a function. So I'm gonna call this vector grow C and again the function keyword and we'll give it X to be the upper bound or the length of the vector we wanna grow. And so we'll go ahead and tab this over to make it look nice. And then we'll ship out of the function, the output object and then we'll close with the curly brace there, right? And so we can do vector grow C on five and we get those values, right? Cool, I'm gonna go ahead and delete this for now. So that's one method of creating a vector, right? And again, it's a special kind of vector where the values are the square of the position they're in. I'm gonna take this and I'm gonna make another function that's very similar to that. And instead of C, I'm gonna go ahead and call this B or maybe I'll call it BR for bracket. And so what I'm gonna do instead of growing it with the C function would be to do output with the square braces and then put I equals I squared, okay? And so this numeric vector has no length, right? And so it's gonna strike many of you that have programming experiences weird that R will do this, that will allow you to get access to position five when that position doesn't even exist, right? So when we run this now and again do output, we again get those values. I'm realizing that I have one through five here. This should be one through X on the upper end, right? And so I need to go ahead and reload those functions. All right, and so again, we can test that these functions do what we think they should be doing by say taking vector grow C and let's give it 10 and we get those values. And if we do vector grow BR, we should get the same values, cool. So now we have two functions. We're going to go ahead now and feed that into micro benchmark. And I'm going to then give it vector grow C on one E to the four. So a big number, so it takes a bit of time and then vector grow BR on one E to the four. I could also do N equals one E to the four. So I don't have to keep, I'm going to make a number of functions. So I don't have to keep typing this and if I decide to change the N later, I can change the N up here on line 26 instead of with each function. So I'll go ahead and do that. All right, so again, we've got that and we'll go ahead and run the micro benchmark here. All right, so this is outputting in milliseconds we see and we have both of our functions here on the left side because my window is kind of squinched in a bit. Not everything fits on one span, right? So we've got kind of the last two columns of the data frame getting put down here. And so I typically focus on the median column and we see is that growing it with that C operator actually takes a hundred times longer than growing it with the bracket notation that we had here, right? So that's I think lesson number one, do not use this, do not grow a vector using the C function, okay? So hopefully I can convince you of that. Go ahead and remove some white space here just to kind of make our code a bit more compact. And also if you wanna get the code that I'm developing here, look down below in the show notes and you'll see a link to the GitHub repository for this project and I will be saving this. I'll go ahead and save it now into the Filotyper home directory. I'm gonna call this benchmarking. And so that will be there at the end of the episode. So cool, all right. So that's two ways to grow a vector and to create a vector again with the square of the values. So the next approach I'm gonna take is gonna be similar to this growing with the bracket but I'm gonna say pre-allocate and instead of giving numeric an empty value, I'm gonna give it the value x. And so if x is five and then I do numeric on five, that gives me a vector five units along with all zeros in it, okay? And then so the difference between vector grow br and vector grow pre-alloc is that here, on line 21 and line 13, on line 21, I'm specifying the length of the vector. So we'll go ahead and add this to our tests. See what we get. And it's not happy because I have to be sure to load the function, very good. So that ran and we see that allocating it actually, that length is even faster than using the bracket notation. So we see that we're getting a bit faster as we go through, but we're not done yet, okay? So there's another way to create a vector besides the C function and that is to use a colon. So if I did like one colon 10, I can get those 10 values and then if we then raise that to the second power, this is not exactly what I want. This is going from one to 10 to the squared, right? What I want actually is the vector, each value in the vector raised to the squared. So I need to put that vector in parentheses and now I get 10 values where each is the square, okay? I'm gonna call this approach a vector colon. So do function with X, all right? And so then we will do one colon X and we'll raise that to the second power and we'll go ahead and load this. I'm gonna kind of strip out some of this white space so it's not such a long script here. And then I will go ahead here and add this to my testing regimen. Wow, so that's even faster than what we had before. That's pretty nifty. And this is happening because the colon notation is a, it's an R construct, right? And R is basically expressing this under the hood in C which is optimized to be really fast anyway. This is a bit of a non sequitur, if you will, because we're not going to have a function that's as simple as raising a consistent series of numbers to a power, right? The functions that I'm using are gonna be a little bit more complicated. I'm using this operation as kind of a surrogate for something more complicated. Anyway, but it's useful to give us a sense of different ways to make vectors and how those vectors and those approaches perform when we embed them in other functions. There's another way to make a string of numbers, a vector of numbers, and that's the seek function, right? So you could do seek one colon 20 and you could do something like buy twos. And this will give you all the numbers from I guess one to 19 by twos, right? You could also do seek one to 20 by ones to get all those numbers. So again, let's go ahead and repeat this. And instead of vector colon, I'm gonna call it vector seek. And we'll replace this with seek one comma x, buy equals one. I think the buy equals one is the default. So that's probably not, so if you did seek one comma 20, yeah, the one is the default. So it's just there to kind of convince us that it's going by ones. So I'm gonna go ahead and add vector seek now to my pipeline here. So we see that the seek approach is slower than the colon approach, but it's still faster than pre-allocating. This output is getting a little bit cumbersome because I'm kind of generating a number of different functions here and I'm trying to have to eyeball how they compare to each other. So what I'd like to do instead of looking at this table is output it instead of using something from like dplyr where I can then sort the different expressions. And so if I do something like dot last value and pipe that to str, I then see that this is a data frame that's got two columns, the expression column and the time column. So what we could do is we could take this and we could pipe this to group by expr and then summarize and we'll summarize to get a median time as the median on time. So here we go. We have a, I think a little bit cleaner output. So I think that the time here has been changed to nanoseconds. So something like vector colon is at 19396, whereas back here it was 21, right? So that you multiply by 1000, you get 21,000. It's within a couple thousand. It's really close. And again, I'm mainly interested in relative difference between these different functions. One of the things that we've seen already is that growing our vector using the C function is slow, but pre-allocating the memory or the space needed for the vector is much faster, right? So what's happening under the hood in R is that R is doing something called copy on modify. And what that means is that whenever something gets modified, so if we had say X equals five and then I change X to be 10, it's creating or it's looking for a new spot in memory to store X. In other languages, X stays put in the memory and its value might change, right? And so because R is always looking for a new place to store an object, that can get slow, especially as things get quite large, right? And so you can imagine as we're growing a vector and it keeps getting longer and longer and longer, R has to keep looking and looking in my RAM for a new place to store this object. So I'm not totally sure why this version of the function performs so much better than growing it with the C function. I don't really know because we're saying obviously in both cases that output is a vector, but this is again, growing it one element per loop. And so I don't totally understand that, but what we do find is that when we allocate a complete chunk of memory for the output to be the correct size that that performs even better. Of course, when we're doing this with like one colon X or with the C function, R knows upfront that it needs to store this big chunk of memory, right? And it's perhaps even doing it in C. So it's not doing this copy on modify. So it basically creates the vector in C and then brings it back into R by both of these functions and seek is probably a bit slower than the colon because it's doing probably a little bit extra under the hood because it has this capacity to jump seats even though we're not using it in this case. So another bit of our lore out there is that for loops are evil and that you should avoid for loops at all cost. So I wanna repeat this, but instead of using a for loop I wanna use the S apply function. And so we'll do vector S apply and we'll do function X. And then here what we'll do is we'll do S apply and we'll go one colon X. So this is a little bit silly again because I just created the vector just kind of like I did up here, right? Anyway, I'm then gonna use what's called an anonymous function. So an anonymous function is a function without a name. It's anonymous, right? And so I can do that by using a backslash parentheses, a variable name, and then a closed parentheses. Maybe instead of X, I'll call that I to keep things simple. And then what I'll do is I to the second power, okay? And so this is a function that's going to take values of one through X and raise it to the second power and then create a vector as output. If we say X is five and then run this line, I get the expected output, right? So I'll go ahead and load that and maybe clean up my white space here. And then we're gonna take vector S apply and add that to this micro benchmarking series of functions. So we see actually that S apply is quite slow, which is really surprising to me. I'm not really sure why that is. I think it's perhaps because S apply is still running a for loop, although yeah, that's just really surprising that it's actually even slower than pre-allocate. Let me double check. Yeah, it's the same type of syntax. So maybe there's something going on with S apply that makes it just so much slower. Maybe something else I'll try will be an L apply instead of S apply. So S apply will take list output and force it into vector if it can. And so it's doing a little bit extra work, right? And so let's try it with L apply and I'll go ahead and add that as well down here to change that S to an L and nope, it didn't make any difference. So that's a bit of a red herring. So that's really surprising to me that this S apply and L apply are so much slower than an actual for loop when you've pre-allocated the space. So that's really significant. And so I think ultimately what we would do with a vector would be to grow a vector or create, it's not really growing a vector pre-allocated. I should change that from grow to pre-allocate because we're not growing it, right? We're pre-allocating the space and then filling it. So I'll go ahead and fix that name because that's a bit confusing. And so test that again. But anyway, I think when we think about converting our data into a more usable format that is more sparse that it's gonna be more likely that we know, say, how many camber values there are. And so we can pre-define the amount of space we need and then we can use a for loop to go back and fill that in. And we're not probably gonna have such a charmed life where we can use something like seek or colon to plug that in. One other thing that you will often hear when people talk about performance in R is that if you go through all these steps and your code is still too slow then you should use C++. So what we can do is we can use the RCPP package. There's also a CPP 11 package, both of which allow you to write C++ code in R. I'm a little bit more familiar with RCPP. And for this benchmarking, I'm gonna try this out to see if we can't make a C++ based function that we can then use in tide of our pipeline here. And so I'm gonna go ahead and create a new C++ file. And it has all of the great stuff that I need here already plugged in, which is fantastic. So I'm gonna kind of clean some of this up. And so I'm going to have, we've got this header space with using namespace RCPP. So I'm gonna call this vector RCPP. And it's going to take an int, an integer of value of X. And let's see, let's go ahead in here. It's been a while since I've done C++. So bear with me here. We're going to create an output, which is going to be a numeric vector. And output is going to be of length X, right? So we're gonna pre-define that. C++ code uses semicolons at the end, which you can kind of see here, right? And then we'll do for int I equals zero, I less than X, I plus plus. And so one thing I'm noticing right off the bat actually is that it's coming back to me that C++ is zero based. So the first slot in a vector in C++ is slot zero, whereas in R it's slot one. So that's a little bit different. So I'm gonna change what I'm incrementing over here to go from one up to an including the value of X, right? And then my output, I will then do I minus one equals I times I. So the circumflix, the thing over the six, which we use for power in R, does not work for power in C++. So that's a bit odd, right? And so then we will return output. And I'll go ahead and save this benchmarking CPP. Back in benchmarking CPP, I need to do library R CPP. And let's put in a space here. And then I need to compile it. And I can compile it doing source CPP, CPP. And if I run that line, it then compiles my C++ code. Hopefully it goes through without any issues that worked. And then again, my function name was vector R CPP. So I can add that to my testing rig here, give that a shot and see how the C++ code fairs compared to everything else. So what I find as perhaps expected was writing this function in C++ actually outperforms everything else, right? And so I don't know that it'll come to us writing our code for our package in C++. I think the downside of writing things in C++ is that it's perhaps harder to maintain because not as many people out there know C++ to contribute to the package, to kind of help us maintain it as we go forward. But at the same time, it's a lot more performant, right? So if we go from pre-allocating here with 632,000 divided by what we have here with a 17,000, you know, it's a 37, say 40-fold difference in how long it takes to calculate that vector of numbers. So that's pretty interesting. One other thing that often comes up is kind of the amount of effort it takes to calculate a vector, right? And so we've been doing this for 10,000 values in our vector and we've looked at different ways of generating a vector. Well, we could imagine doing the same type of thing, but doing it with fewer or more values in our vector, right? And so, you know, the question that is, how does growing a vector or filling a vector change based on the size of the vector? I think we would like to think that if we double the size of the vector, it might take twice as long on average to fill that vector. So I'm gonna go ahead and copy this down and start with grow C and I'm gonna divide all these values by 100 to give us, say, 100 values, right? And so to see how doing this analysis on 100 compares to 10,000 and to see how the different functions change in their performance. So what we see is again, our vector RCPP and vector colon are kind of the fastest of the bunch. And the vector colon with a smaller dataset performs better than our RCPP code. But again, as the dataset gets bigger, that we see that our RCPP code runs faster than the colon based code. Again, we're not gonna have data as nicely organized as we found with like say one colon N, something like that. And so it's really reassuring and edifying to see that RCPP code runs so much faster. I think the other thing that we might see is the difference between kind of growing it with that C operator, right? Up here, we, let's see what the difference is between these two values. And we see that the difference, so a hundred fold difference, we would expect to take 100 times longer to run, right? And what we actually find is that it's more than that. It's 10 times more than that, it's about 2000, right? So it's not quite, we talk about big O notation, it's not quite N squared. So it's not quite that you double the length of the vector and you quadruple how long it takes to fill it, but it's approaching that and that's not good, right? And so we can also kind of look at some of these. And so the other one that did pretty well was this vector pre-allocation and where's, yeah, here's the other one here. So let's compare those values. And so that's, again, a hundred fold difference results in 52 fold difference in how long it takes to perform. That's pretty good, I would say. We also see like the S supply is quite a bit, it's a bit slower, it's perhaps not a hundred fold quite. Let's see what we get here. Yeah, it's about 60, right? So I think what we're gonna find ultimately is that this RCPP code is just a lot more performant than things that are baked into R. We'll talk about that later, but there's gonna be trade-offs in whether or not we go down that road of using C++ code with our R package. So hopefully this was useful and interesting to think about how to grow a vector. I know there's a lot of kind of urban legends, if you will, out there about vectors in R and whether or not you should use for loops or not. I learned something here. I always thought that S supply was at least as good as a for loop. Maybe that would change under different situations, but I think the key is pre-allocating your vector. So this is generating a vector, right? Setting values in a vector. I wanna look at it from a different approach which is getting values out of a vector. There's a variety of ways that we have to get values out of a vector. I wanna see if those vary in how long it takes for them to run. So if I have a vector, I'll call X, that goes from one to say 100, right? And let's make this a little smaller. So we have a variety of ways that we can get values out of this vector. We could do like X5 to get the value of five out of it. We could keep the theme going with the squared values that we had before. If we want, yeah, so let's do that. So we get 25 for five, right? We could also do like X5, 10, so we should get 25 and 100, right? Sure enough, we get that, right? So there's a variety of different ways that we have to get these values out. So I'm gonna do vector, get one, and this will be a function that's gonna take in a vector of values and we'll then return X and I'm gonna pick five for lack of anything better, right? Cool. So I'm gonna repeat this where we're gonna get 10 values, okay? And so I'm gonna do individual values. So that's five, six, seven, eight, nine, 10. And I'll just kind of, let's go by fives here. So we'll do 10, 15, 20, 25, 30, 35, 40, 40. 45, 50, okay. So we're gonna return, we're not really gonna return, but this is gonna go through the steps of returning 10 values from this vector. So the question then was like, how does this compare to getting one value back, okay? And so I'm again gonna do benchmark on this or the microbenchmark and we'll give it these two functions, which was vector, get one and we'll give it, I'll call it initially long and then we'll do vector, get 10 with long. So I need to define long. So long is going to be one to one e to the fourth squared. I'm gonna go ahead and grab my code from up here where I was kind of making my own summary statistics. Oh, and I need to load long. So I need to load this function. So when I'm creating these test functions, I always forget to load them. We'll get that going. Okay, so that went pretty quickly. And so we see that getting one was faster than getting 10. It's not quite 10x, but there's probably other stuff going on in the functions that cause it to not quite scale linearly, right? Cool, all right. So this is kind of a silly way to do it. We would only output from this function the value of 50. So if I do vector, get 10 on long, I'm only gonna get 50 squared, which is gonna be 2,500. 2,500, right. So I would rather get a vector out, right? So let's go ahead and do vector, get 10. And I'm gonna call this C because I'm gonna combine these together in a vector. All right, so I got that. And so I'm gonna add vector, get C with long. So as we've already seen, creating a vector takes time, right? And so maybe what we'd rather do is have a vector of numbers that we can then index into X, right? And so I'm gonna go ahead and do vector, get index, function X. And then I'm gonna say index is gonna be five. I guess I need to find the function five, 10, 15, 20, 25, 30, 35, 40, 45, 50, okay? And then I'm gonna return X on index. All right, and I'll add a vector, get index to this. So that actually takes quite a bit of time. And one thing that occurs to me is that I don't know that this is kind of an apples to apples comparison any longer because I'm creating this vector. And theoretically, if I was using this in my own function or in my own package to kind of get a value out, I'd have other code that's generating index. So I think I need to copy this line to be in all of the functions, even though it's not used in these other functions, so that we have kind of this similar type of overhead in each of our functions. So I'll go ahead and reload these and rerun the benchmarking. And so what we find then is that the index actually then turns out to be quite a bit faster, that that creation of the vector takes most of the amount of time. And so returning one value is gonna be shorter than returning 10 values, which we saw earlier. So I'm gonna then try, again, what we did earlier, where we changed the length of the vector. And so I will create another vector that I'll call short. Probably knew I was gonna do that when I named one long, right? So I'll do that and let's go ahead and do this as one to 100. And again, all those values squared. And then I'm gonna come in here and add short as the argument to these. And we'll go ahead and run that. And that's quite fast. And again, what we find is that for the most part, there is no effect of the length of the vector on the performance, which is pretty cool. And again, what is happening when we get access to a vector is that R knows where a vector starts. And then it says, I want position say 55 or 50, right? And so it basically can jump directly to position 50 to get that value. Those of you that have been watching know that I went through this process of converting a string into a number. Some people call this hashing. We took it from base four notation for a DNA sequence into base 10, right? And so that is the advantage, right? Is that we can basically jump straight, straight to it. And so I think that's reassuring, right? And so growing a vector scales at least with the length of the expansion of the vector, whereas getting access to the vector doesn't vary. And so we might say growing a vector by pre-allocating the memory is on the time order of N, the number of elements in the vector, whereas getting access to a vector is on time order of one, that it's constant. It doesn't depend on the length of the vector. And we see that we should more than likely use the index-based approach, right? And again, this is using R's built-in vectorized approach. This is very much like why you should use one colon 10 rather than C one comma two comma blah, blah, blah out to 10, right? That the built-in vectorization of R is very powerful, very fast, because again, it's using C under the hood to make it so much more efficient, right? So again, I hope this is edifying and teaching you a little bit about the different tools that we have to build and get access to values and vectors. I know people who come from other programming languages often get really frustrated by R that there's so many ways to do the same thing, wherein other languages, there's one way to do it. Well, that gives us some flexibility. It can also give us some headaches. And so this will be really useful to be thinking about when we go forward with our own project. Vectors, again, are a consistent type of data across all values within that data structure. The seats, if you will, there's slots in that vector are all consecutive. They're all, you can think of them as being next to each other, if you will, in memory. An alternative that we'll talk about in the next episode is what's called a list. And a list is a much more relaxed vector. And so we'll talk about those and kind of go through the same type of benchmarking and then compare them back to what we got here with the vectors. So that you don't miss that episode, please make sure that you've subscribed to the channel and we'll see you next time for another episode of Code Club.