 Hey folks, welcome back for another episode of Code Club. In a recent episode, I showed you a variety of ways of building vectors. Some are better than others, right? I also showed you a variety of ways of getting values out of those vectors. Well, I got a lot of really positive feedback from that episode and people asked questions down in the comments. And as I was kind of doing some additional reading on related things, a number of other questions came to mind. And so I thought it'd be fun to go ahead and add that to our testing system, see how these different methods compare that people have suggested. Also, I think I can make our testing approach a bit more robust so that we can see the true impact of changing the size of our vectors and what influence or impact that might have on generating the vector, as well as getting values back out of the vector. So let's head back over to our benchmark marking dot r script. If you want to get a copy of this script down below in the description as a link to the GitHub repository where we're storing everything and the link that you'll find down there is to the current state of the repository as I'm speaking right now and going to write this code. There's also a link there for the end of the episode. I would strongly encourage you to code along with me because I think you'll get the most out of it. And also along the way, you might have your own questions. And so you might want to, you know, fold those questions into the analysis I'm doing to see how you might get be able to answer the question that you have. Here again, we have benchmarking dot r. We load a variety of different libraries. So micro benchmark will allow us to do the benchmarking tidyverse will allow us to do some data manipulation that'll make things easier. RCPP here will allow us to write code in C plus plus. So we saw that that version actually was blazingly fast so much faster than all the other code. And then I have a variety of functions here. And so the first was growing a vector using the C function. This showed to be the worst. Then we had vector grow br. Basically what this did was define output to be a numeric vector and then using square brackets, we went ahead and then fill this in. We also then pre allocated a vector. So not just naming it as a numeric vector, but saying it had a certain number of values in it. And then go ahead and fill it in vector colon. What we're doing is we're giving a value so the size of the vector, and then we're using the colon notation. So one colon five would give you 12345 right. And then what we're doing is squaring all those values. We're another approach to generating a vector of numbers is the seek function here. And then we also have a couple of functions vector s supply and vector l apply. And so what s supply does is it takes a vector of values kind of like we had up here with this colon notation. And then we can use this anonymous function to iterate over all values of one through x. An astute observer noticed that my vector l apply and s supply functions did the same exact thing that this should really be l apply. Ah, so let me go ahead and change that to l apply before I forget. And what l apply does is that actually, I maybe I should step two steps back. So vector l apply will take a vector of numbers from one to x. So again, say one to five. And then for each of those values individually, it'll square each of those values, right? And so that returns a list. So l apply returns a list. S supply is the same thing, except it converts the list into a vector if it can do so. Okay, so we got that fixed. That's problem number one resolved. Finally, then we also have we're sourcing source CPP. So we're compiling C plus plus code on benchmarking CPP, which is a C plus plus version of how I would go about generating this vector of squared values in C plus plus. So I will go ahead and load all this good stuff into our and then go ahead and run micro benchmark. So to remind ourselves what this looks like, we I sorted the output of the micro benchmark according to the median time. So micro benchmark basically ran all of those functions 100 times. And so then outputs it as a table. And we then wrote some de player code to sort these and descending order of median time to run the function where we have n, again, that n was 10 to the four, n divided by 100 would be 100, right? And so that means that the vector in this case was 100 units long. Whereas up here, it was 10,000 units long, right? And so what we can see again is that growing it with that C function is horrible. It takes so long. But supply and l apply. I think before I had them running at the same time, of course, because I gave them the same function. And so what we see here now is that l apply is a smidge faster than supply. Again, supply is converting that list into a vector. Regardless, it's slow, right? And so we can also see things like growing by brackets where we defined output to be a vector. And then we kind of filled it in using the square bracket notations. And then pre allocating the space is where we start to see some really significant improvements in speed. And the seek function is again, it's a bit of an artificial construct because rarely are we going to have such evenly spaced data that we can iterate over or vectorize over. And so the seek and the colon are really doing the same types of thing. But of course, we see that our C plus plus version of the function is much faster than everything else. Of course, also in here, I have dividing by 100. I think for now I'm going to go ahead and remove all that code from my micro benchmark here to kind of keep things a little bit more clear since I'm actually going to be adding stuff. And that might make things a little bit more confusing. Okay, that's a bit simpler. Again, we have eight different ways of building a vector. And we're going to add a few more to this. And so I want to just kind of keep things a little bit more simple. And we'll come back to this issue of how these times vary by the size of a vector before we're all done. There's two approaches that I want to build that make use of a pipe. And so what we could do would be something fairly similar to what we had up here with vector colon. And so again, I could do say one colon five, and that will give me a vector of values one to five, you might be used to doing is using a pipe like this from the tidy verse that actually comes from a package called mag Ritter. So we can use an anonymous function in line in a pipeline like this. Again, we had an anonymous function up here, right? So what I could do is we'll have two sets of parentheses. In the first set of parentheses, we use the anonymous function definition. Right. And then we have a second set of parentheses. My understanding is that that second set of parentheses tells our to basically run this first function. It's a little bit weird, but anonymous functions are kind of weird too. Anyway, let's go ahead and replace this then with our anonymous function. And so that will be the backslash parentheses x, like we did up here, and then I to the second, right. And so if that works, that doesn't work because I was looking at this I instead of x, which is what I meant to put here, right? And so that then gets us the squares. I'm going to go ahead and replace those x's with Is. And so that does what we want, right? So this is again, very similar to what we did up here. The, the function here is taking one to five, and it's inserting it as the argument I and raising it to the second power. It's different than what's happening here in this s apply and L apply. What's different is that in s apply and L apply, it squares each value sequentially. Whereas here, it takes the whole vector and squares all the values of the vector at once. It's a subtle but important difference. And if you recall, before how we talked about, you know, pre allocating the memory, this is a bit more like what's happening in s apply and L apply than what we're seeing here, right? And so, but this raises an interesting question also that this is the magrider pipe. We might also want to do the base pipe. So I'm going to go ahead and call this a vector magrider function x, right? And then we'll use this as one to x, right? And then I'm going to also do the base pipe. And we'll go ahead and replace this to be base. And then this will be that'll be x also, right? Oh, I need to change the pipe. So let's go ahead and use that. And so let me just double check that that works. Again, by replacing this pipe. Good, that works too. This might show a difference in performance between using magrider and using the base pipe. The base pipe is seen more as syntax, where what's going to happen is the base pipe will basically take this function and put this in as the argument before it executes it, right? So it's basically at a syntactic level, rewriting your code, and then running the code. Whereas, as I understand it, again, the magrider pipe is much more of a function that this is a function that is going to execute this, and then take that value and put it into this. And again, these are subtle differences. And so maybe it will maybe it won't have a difference on performance. We'll do the experiment and we'll see. So let's go ahead and save that and we'll go ahead and insert that into our micro benchmark setup. And so we'll do vector base on N. And then we'll do vector magrider on N. We'll give that a run. And I forgot to load my functions to something I did last time I recall. So let's try that again. And now it's running and we'll find the result. All right, so that completed. And let's find our pipes. So we now see our magrider pipe and our base pipe down here. So it was actually run really fast, better than seek, but not as good as the colon. And what we do see is that the magrider pipe is about 50% slower than the base pipe. Again, you're not going to usually notice this difference when you're doing your day to day data analysis. But it's important to remember that that pipe is has differences, right? And there's certainly things that the magrider pipe will do that the base pipe won't do currently. And so there are reasons that you would want to perhaps keep using the magrider pipe. And I have a previous episode where I talk about the differences in the pipes. And I think I may have found a situation that was an odd situation where the magrider pipe did better than the base pipe. In this case, certainly, at least for this size of vectors, the base pipe does better speed wise than the magrider pipe. So that's cool. All right. So that was an interesting suggestion that I'm glad someone asked about that allows us to compare these two different pipes, as well as something new for making anonymous functions. Related to the L-apply and S-apply functions is the map set of functions from the per package. I'm curious to see how those perform relative to S-apply and L-apply. I think per comes with tidyverse, but we'll go ahead and load it anyway just to be explicit. So we'll do per with the three R's there. Go ahead and load that and then come back down here. And the syntax is going to be very similar to what we saw with L-apply and S-apply. If I do map and I do one colon, let's do five for now, and then I do my anonymous function, wrong back back slash, not back tick. And then I do I and then I squared. This then returns actually a list. And we know it's a list because we can see the double bracket notation, right? So it's a list with five elements, and each element has a one seat vector in each of those. So this is a lot like the output from L-apply, right? So if we had run L-apply instead of map, we'd see L-apply. What are you doing? I don't want all that stuff. Sometimes RStudio throws stuff because it's trying to be helpful. And again, we get the same same output, right? But again, we want a vector of type double. And so there's a variety of map functions. And so to get a vector of type double, we'll do DBL. And this then generates output that is very much like what we saw with S-apply, right? So if I do S-apply instead, we get that. All right, so let's turn this into a function. So we'll call this vector underscore map. And we'll then do function x. And then with our curly braces, close that off. And again, I want to be sure to change this five to an x. And so now we have our vector map function loaded. And we can then add that into here. And we'll do vector map with n. And we see that vector map is about the same speed a little bit slower actually than S-apply. But by no means is it as bad as vector grow. I think vector map has some of those nice features that makes it easier to use. But we can see that it clearly takes a performance hit. And there are a variety of other map functions related to map DBL, map CHR, which returns a character, logical. There's a DFR, which will allow you to concatenate together data frame rows. The final thing I want to test out is something that I learned recently while reading in Hadley Wickham's Advanced R book. There's a new second edition out. I guess maybe it's not super new at this point. There's a version available online. But I like having the physical copy to look through things. One of the things that he talks about is that his convention is to use double brackets whenever you're expecting to get one value out and single brackets when you expect to get a collection of values out. What does that mean? Well, if we say look at my favorite vector of letters, lowercase letters, and we do letters with a single bracket, and I'd said like two, I'll get a B. And if I do 20, I'll get a T. I could also do letters with two brackets. I actually didn't know this was possible until I read it in the book. And you get a B. And so typically we use that double bracket notation, or at least I do with lists. And that will give you back the vector that is in that seat of the list. So the single brackets, as I mentioned, allows you to return a collection of values, right? So if I did 20 colon 22, I'd get TuV. If I tried that with double bracket notation, so say 20 colon 22, it complains, right? So again, double bracket notation returns only a single value, single bracket can return one or many values. Hadley Wickham's point is that to kind of keep the mental model straight, he likes to use double brackets whenever you're expecting one value, single brackets when you're expecting multiple values. I was wondering whether or not there's any kind of performance difference between using a single bracket and a double bracket. Kind of suspect not, but hey, we've got a testing rig here. So let's go ahead and check it out. And so I'm going to come back up here to where we had a vector pre-alloc, and this is using the single bracket. So I'll go ahead and modify this to be sng. And then I'll go ahead and copy this. And I'll make vector pre-alloc dbl. And I will put double brackets around my eye there and make sure you've got that loaded. And then I'll come back down here and vector pre-alloc sng. And we'll go ahead and add the dbl. And so what we find is that we actually get pretty similar performance. In this case, we got a median runtime that's a little bit faster by double brackets than single brackets. If we run it again, I bet we'll get very similar, or perhaps even the opposite result. Let's give that a shot. And so again, there's a little bit of play. There is some stochasticity to how these bench markings run, so I try not to take them as like absolute critical values. But again, thinking about them in terms of relative performance, they're certainly on their own class. There's not a huge difference between uses single and a double. And I suspect if we kind of did this a few times, we'd see that they were all really close, certainly within a percent of each other. Anyway, I like that idea of using the double bracket for a single value and a single bracket for multiple values. In some ways, that's counterintuitive, right? Because if you have more brackets, you'd expect more values, at least I would. So that might take some time for me to retrain my brain. All right, well, these are the different additional tests that I wanted to add. One thing that we did when we started messing around with these microbenchmarking studies today was that there were a whole collection of functions that I was running with n divided by 100. Those function calls allowed us to see how the performance compared based on the size of the vector. And so we did that with two points and divided by 100 or 110,000. And so it's possible that the performance would scale uniformly. So if you double the number of values in the vector, that performance wouldn't change at all. It might be you double the values and you double the time. It might be that you double the values and you quadruple the time. That'd be quadratic. And then there's values in between there. And so with two time points, it's really hard to get a sense of that. The other problem with two time points is that you don't get a sense of the overall overhead of basically setting things up. There is a cost to running the function that has nothing to do with the size of the vector. You might think of that if you were to plot the data with time on the y-axis and your n on the x-axis as being like the y-intercept. So if you gave it no sequences or if we did like one value, what would that scale like? So let's go ahead and modify our code so that we can generate that plot. And so that's what we're going to do with the rest of our time here today. And I will go ahead and create a function that I'll call mb by n. And we'll go to function and we'll give it the n parameter, right? And then our open brace. And I'll go ahead and tab that over as the body of my function. And so then it will use n as the input. And so now I have my function. So I could do mb by n. And let's give it the value of 10. And then we see how long it takes to run for a vector of size 10, right? Very similar type of relationship that we saw before. Although, I guess in this case, running by the colon is actually faster than by c++. I think we saw this in the last episode that a bigger n, you certainly see the benefit of running it in c++. So we want to try this with multiple n values, okay? And so we'll do a map function. We could also do s apply or l apply. But again, the benefit of map is that we can take the output as data frame rows and we can concatenate together data frame rows, right? So we could do map and then the function will be dfr. And then we'll go over our ends. And then we'll pass that into mb by my fingers aren't working by n. And I need to define an ends vector. And so we'll do that with the c function. So again, c function is fine. Don't use the c function to grow, don't iterate with the c function, right? So let's try it with one and see what happens 10 100 1000. Let's go up to 2500 5000 7500 10,000 12500 15,000. Okay. So we'll do that that's loaded. And then we'll go ahead and run our map dfr function and generate our data frame for our 12 different functions at however many different ends I have here. So it's going to take a little bit of time because it took some time for running just to 10,000, much less all these other values. So that ran through, we had 10 values. So we could do like length on NS, we got 10, right? So we had 10 ends that we used, and we had 12 tests. So we get 120 rows and two columns. One thing I'm noticing is that we don't have a column for the end, right? And so I think the easiest way to do that would be to go ahead and add a mutate to our function, we do like n equals n. And that way, then if I run mb by n on, say one, then I get a column for n of one, right? Cool. I'm also going to go ahead and save this because that did take a minute or two to run. And I will go ahead and call this mb data. We'll run that and then we'll have that stored. And then we'll be ready to take mb data and go ahead and generate a figure from that. So let's go ahead and make sure everything looks good with mb data. And we again see that we've got 120 rows, we could look at the tail on this and see that yeah, it goes out to 15,000 with our different median times. There's probably stuff in our function here that we don't need like this arrange. We don't need that if we're going to go into a plot, but whatever, we'll leave it. So let's make a plot. It's been a while since I made a plot here in CodeClub. I'm excited. So we'll do mb data and we'll pipe that to ggplot aes. On the x-axis, I'm going to put my n on my y. I'm going to put my median time, right? And we'll go ahead and then do color equals expr. That makes the plotting window. Let's go ahead and make the line. So we'll do geom line. Very good. And so we see one is off the charts. This looks kind of quadratic the way it's increasing more than linearly. I suspect this is the vector gross cn. I'd like to go ahead and clean up my legend a bit. There's a lot of extraneous information in here that we really don't need. So I'll go ahead and do a mutate on my expr variable. And we're going to use str replace. And I'm going to use, we're going to modify expr. And I'm going to use a regular expression to get out the stuff between the underscore and the first underscore and the first open parenthesis, right? So I'm going to do vector underscore. And then I'm going to use a set of parentheses with a period star inside of it, the period star inside of the parentheses the period star means match any character, zero or more times, which is basically match everything. And then those parentheses don't mean match your parentheses, but means save the information that's in here for the replacement value. Now, I don't want the whole thing. I want to remove the parenthesis n. And so I will then do backslash backslash parentheses to match an open parenthesis and back back close parenthesis to match that. And then I will for my replacement value, I'll then do back back one. And so running that, we now see that we've gotten rid of the vector underscore and the n. And I will go ahead and fold that in to the rest of my pipeline here. So our legend has been cleaned up here, right? Maybe we could do some things like labs, x equals size of vector, y equals median time in I think I believe it's in nanoseconds. So this is going to be really impossible to see because we've got 12 different colors on a rainbow gradient. And it's really difficult. I know that this gross C is this green line, because in our initial benchmarking, that was the one that was off the charts. But these others are going to be really hard to see. Maybe what we'll do first is zoom in. And so I'll go ahead and do a cord cartesian with y lim. And let's go from zero at the bottom to let's say like one e to the six to a million. That allows us to zoom in a bit. Maybe let's come back to like five million. Yeah, that's we can play with it. But that works well for now. Now let's turn our attention to the colors, right? And so let's go ahead and I'll do scale color manual. And then the values are going to be the color values. I don't care about setting specific colors for specific values. I think what I want to have is four colors, and then three shapes, right? And so then I can put one shape, use one shape for each, each color, or basically have four colors and three shapes off 12 combinations. And then each condition will have its own unique combination of color and shape. So what we'll do is we'll then do a repeat on four colors. And so I will do tomato, which is kind of this reddish color Dodger blue, gray. And then let's also do orange. Nope, I got a spelt orange right. And maybe I'll do dark gray. Dark, not darn, dark gray. And then I'm going to repeat that three times, right? Because I've got four colors, I'm going to repeat three times. Let's go ahead and add a plus to that. I now see that I've got those four colors repeated three times. I'd like to go ahead and add a shape to that. And so if I in here in my AES do shape equals expr, it's going to complain because I have 12 values. And so you can't have more than six shapes because things get hard. So what we want to do is go ahead and then add scale shape manual. And we'll then do values equals wrap. And then here we'll do see, and I'm going to do square a triangle in a circle. And those values I think are 15, 17 and 19. And here I did wrap three times, right? There's an argument for wrap that you can use, which is each. So I can do each equals four. And so to show you the difference here, if I run this, this gets me my 12 colors. And this then gets me my 12 shapes. And so you'll see that like these first four values, the 15, which I think is a square, will correspond to these four colors. And then 17, which I think is a triangle, will go to these four. And then the circle will be those four. So we'll have one shape for each of the colors, right? So this should work. If I put a plus at the end here, I'm not getting any symbols showing up. And I'm remembering that I did geom line, I didn't add geom point. So let's go ahead and do geom point. This is good. There's might be some things I would change like getting rid of that background. Why don't I go ahead and do that because that always drives me nuts. I'll do a theme classic. And that cleans up the background a bit. Maybe it doesn't matter, but it matters to me. And so we see, first of all, this line that's going off the charts is the orange with a square. And sure enough, that's the grow by using the C function. And then followed by this gray triangle and gray circle. So the gray circle is s apply, the gray triangle is map. And I'm going to guess that the red tomato triangle is L apply. Sure enough, it is good. And then we've got grow by using the bracket. And so that was defining that output was a numeric vector, but not defining its length. And then down here, these next two lines, you can see this orange triangle on top of the circle. You can kind of see it behind the circle there, that those are pre allocated using either the double bracket or the single bracket. So again, no difference between those. This yellow or orange, I should say, circle is the sequence. And then further down, we have other things that are kind of hard to see, but are going to be things like the colon and the RCPP, the MagRitter pipe, as well as the base pipe are probably down there, right? So if I zoom in even further, I'm going to leave that there, because I kind of like that resolution for making say like a thumbnail, right? So let's go to, let's do one e to the six. Yeah, you can see there's a bunch of stuff down in there. Let's go even further. Let's go one e to the five. Yeah. So this allows us to zoom in pretty well. And so we see the RCPP are those blue circles. And as I mentioned, the colon does better this blue square does better at smaller size vectors, but at larger vectors than the RCPP version starts to do fair amount better. And then this base pipe is a little bit slower than using the colon. And maybe there's a small performance hit for, for using that base pipe, but really not much at all. And then the MagRitter pipe is a little bit slower still. So let's go ahead and scale back and see how these things are changing generally. So let's go back to five million, like we had before, let's go back even further, let's go one e to the seven. And so we can see that at least the ones that are on the screen, not including the gross see all appear to be varying linearly, right? Like so some of these like RCPP appear to be constant, but clearly when we drilled in, it was going up linearly. I think what's important to remember is that although they're scaling linearly, they don't all scale at the same rate. Again, if you if you double the number of sequences, they're all going to take twice as long to run. But that that initial period, right of like your end sequences going to two and that n, the time it takes to run those n is going to vary. So let's go back even further. Maybe I'll go ahead and remove that line altogether. We can then see as I think I mentioned earlier, this gross C line is quadratic, right? I don't know that it's exactly n squared, but it's getting there, right? And so that's really the absolute worst approach that we could possibly be using is to be iterating and growing a vector using the c function. Anyway, let's go ahead and put this back to the scale that I want to use for thumbnail. So I think it's easy to look at a list of options like this and say, Oh, we should be doing RCPP, or we should be using the colon or oh, we should be using seek, or we should be using base are right. And so there's reasons why you may or may not do that, right? So depending on your problem, the colon may work great, it might not work at all, right? And so there might be situations like where if you're concatenating together data frame rows, where map, even though it's going to be slower than L apply or some of these other approaches, the map might be the preferred way to go. Certainly, if you're concatenating together data frames, then that's probably going to be easier to do than looping or binding things together by rows or all sorts of things, right? So you can't think of this as a one size fits all. One thing that I think is pretty clear is that writing the code in C++ is a lot faster and perhaps more generalizable to a bunch of different set of problems. So the thing to remember is that writing things in C++ is not as trivial or not as easy, I should say, as writing it in R. Not everybody knows C++. You may not know C++. And so should you go learn C++ so that you can improve upon your performance by, you know, a couple milliseconds or seconds even? No, that's crazy. If you're in a situation, like I think we are in developing an R package, then, you know, the if it takes me longer to write things in C++, and it saves you a potential user seconds or minutes, then I think it is worth it, right? Because if you take the time that you save as a user, and then scale that across however many users we end up getting for this, then it could certainly pay back that time that I invested just many fold, and will then make the package that much nicer for people to use, than it would have been if I'd have done everything in pure R. Another downside to doing it in C++ is that like maintaining the code is a real, right? And if I want people to contribute to this project down the road, and everything's written in pure C++ with a thin R wrapper, I'm not going to get people to help me. Because again, not many people know C++. And so there's all these tradeoffs. And so I'd really encourage you not to look at this list of options for generating a vector and say, oh, I've got to use the fastest possible one. Because again, for most situations, the time performance doesn't matter. What's going to matter to you as a R programmer is how long it takes you to write that R code. All right, I think we've said enough about vectors for now, probably for a long time, actually, the next episode, we're going to do a similar type of analysis looking at building lists and getting values out of lists. Lists are a lot like vectors, but they're different, right? And so we'll see some similar yet some different approaches that we can use to generating and getting access to a list. What do you think of this idea of using double square bracket notation for getting out a single value and single square bracket notation to get multiple values? Let me know down below in the comments, where you fall on this. I think it's something that I might try to do going forward in some of my R development. And we'll see if that kind of helps me with my mental model of what this different bracket notation is actually doing under the hood. All right, so that you don't miss that next episode of Code Club, please make sure you subscribe down below and we'll see you next time.