 Hey folks, a couple episodes back I talked about pipes and that episode was one of my most popular episodes yet. Thank you for watching it. A lot of great feedback came in as well as a few questions that I want to take on in today's episode. Three general types of questions came in. The first asked about the shortcuts through the keyboard of inserting a pipe character so you don't have to write percent greater than percent or the vertical line and then the greater than sign. Is there a shortcut to do that? Yes there is and I'll show you how. The second asked, Pat, aren't you aware that there's a new version of Rout that allows you to use the placeholder for the base R pipe? I wasn't aware of that. I will show you how you can do that now. The third comment that I received was that there are people that do need to worry about performance and are wondering about the difference in performance between using the base R pipe and the Magridder pipe. So we're going to look at all three of those over in R Studio. You will see right off the bat that yes I am using R version 4.2.1. The added functionality to the pipe came with R version 4.2. I would suggest that we keep looking in the release notes for anything having to do with pipes because this definitely seems like an active area of development for base R. Now in my source code I am using pipedemo.r which I developed in that previous episode. If you want to get a hold of this script down below in the description there's a link to a blog post that will get you everything you need to follow along. Again it's not really important what the actual code is or what the data are just kind of that you see my logic as I'm working through everything here. So I'm going to go ahead and load my data. This code localweather.r gets me a data set of local weather from a NOAA weather station over in Ann Arbor just a few miles from here. It also loads Magridder. I don't really need that for this episode because the typical pipe that we're all used to the percent greater than percent comes to us with the tidyverse which is loaded in code localweather.r. So the first thing I want to show you is how you can do the keystrokes to generate the pipe. So the shortcut to insert the Magridder type pipe would be control shift M. That should also work on a PC. Alternatively on a Mac you can also do shift command M and that will also get you the pipe. If you're not sure if you're having problems understanding what those keystrokes are you can always go up to tools and then keyboard shortcuts help. That brings you this monster page of all of the different shortcuts that you can use to get different things to happen. But what we're interested in is in this column source editor and then if you come down here you will see insert pipe operator and so again that's control shift M to insert the pipe. Of course this is the Magridder type pipe. What if you want the base R type pipe? Well for that you can go up to RStudio preferences. Again that's where it is on a Mac but you can also go over to tools, options to get the same dialog window. This will also work on a PC. If you then go to code you'll see that there is now a radio button here for use native pipe operator and again this requires R4.1 or greater which we have so I go ahead and click check on that and then scroll down and I'll hit okay and so now if I do control shift M I get that pipe character right and if I do command shift M I also get that pipe okay. I'm going to go ahead and leave it with the default of the Magridder type pipe so again I'll go back to tools, global options, code and then I'm going to deselect use native pipe operator and I'll go ahead and click okay and then just to double check I now have it. Those of you that have watched many of these videos know that I am kind of just used to typing out the three characters right so maybe something I will work on is just developing the muscle memory to use those shortcuts of typing out the pipe character. Issue two then is the placeholder and so if we look down at our code you'll recall that we talked about creating this data frame no NA no zero where we removed all the NA values from our local weather data frame local weather as you may recall has four columns the date Tmax prcp and snow and so no NA no zero removes any of the rows where prcp or snow are NA values and keeps any values any rows where snow is greater than zero right and so we load that we could then feed that no NA no zero into core.test and we could calculate the correlation between the amount of precipitation and the amount of snow for a given day and we find that the correlation is point six four I then developed a way without using a pipe that basically uses tools from d plier like drop NA and filter to go ahead and run that and get the same basic result right so we again see that we get the correlation previously of point six four here we get point six four as well we then came to using the base our pipe where we again used local weather we then used drop NA and then filter and so these are d plier verbs d plier functions that are stitched together with the new base our pipe and so again when we load that I then had to kind of drop outside of the pipeline because I wasn't aware that there was a placeholder right so if I come down here to kind of the more pure d plier approach with magridder pipes you can see that I can use that period as a placeholder and so that period says is take all the output of these three steps and plop it right here right and so what we did here is we broke the pipe we assigned the value to no NA's no zero and then plop that in there well I'm here to tell you along with people in my comments that there is a placeholder and that is an underscore and so again I can go ahead and add to the pipe and then instead of putting data equals no NA's no zeros I can put an underscore like that and so now when I run that I guess I can go ahead and remove this no NA's no zeros I now get point six four as my correlation value which is basically um what I had before but it's subtly different right so here it's point six four oh four one one here it's point six oh four six four oh three eight so that little difference is going to matter for something I'm going to do here soon and the difference is up here I dropped NA's for PRCP and snow whereas here it also will drop NA's for T max so let's go ahead and put in PRCP and snow and so now we get the same correlation value that we had up above ever so precise and we'll want to do the same thing down here by taking PRCP and snow and putting that into the values for drop NA again we get the same value so again what I wanted to show you in this part of the episode is that similar to how we can use a period with the magridder pipe to indicate where the data should go if it's not in the first slot we can use an underscore now with version 4.2 and after of r when using the base r pipe so something I didn't really mention in the last episode is that a lot of the stuff that comes to us from dplyr and the tidyverse is a little bit slower than what we find with base r and so the premium of why do we use dplyr is because the things work more intuitively right it's easier to read what's going on the verbs make a lot more sense than perhaps some of the other verbs that come to us from base r right I don't have to worry about brackets and dollar signs and commas and all sorts of things I can use verbs like filter right or drop NA and so what I'd like to do is go ahead and do some benchmarking of these different approaches and so to do that we're going to use a special function from a package called bench and so we'll do install dot packages on bench and so that installs this into my rnv I'll go ahead and do rnv colon colon snapshot great so now bench is installed in part of my r environment which is great and so now what I need to do is I can basically create a benchmark test comparing these three different approaches I have pure base r I have dplyr with the base r pipe and I have dplyr with the magritter pipe so what I'm going to do is I'm going to use the mark function from the bench package see what they did there to benchmark these three different approaches to calculating the correlation okay and so I'm going to start with these two here and so we'll do bench colon colon mark and then we will then use the open parentheses and the closed parentheses here and we will separate these two pipelines with a comma and I will call this first one this will be dplyr base because I'm using dplyr verbs like drop NA and filter with the base pipe and here I will do dplyr magritter equaling that and so if I run these two approaches I then get this output from the benchmark function and so what you'll see is that there's two ways of doing it and the benchmark function basically takes my function and it runs it a bunch of times and it then makes sure that the two functions give us the same output which is why it was important to make sure that the correlation output that we got from these different pipelines was the same and so it is and so what we see then is that dplyr with magritter is actually faster than dplyr with base right for whatever reason for this conditions right we're able to do 224 iterations per second using dplyr with the magritter pipe versus 114 iterations per second with dplyr using the base r pipe okay so let's now compare it to using pure base r so we had something like that back up here and so I'm going to go ahead and grab this code and I will call this base base and we're going to need to massage this a little bit so this was the operation that got assigned to no NA no zero so go ahead and take that and plop that in there in place of no NA no zero and now I've got my base base and I need to put a comma at the end of this run that benchmark again and so now we see base base is a lot faster than dplyr magritter even right so it'll do 655 iterations of this correlation analysis in the time that it perhaps takes 254 iterations using dplyr and magritter right so we see that pure base is faster than dplyr with magritter of course one of the advantages of having the pipe is that it's a lot more readable than all this gobbledygook right so we have dplyr with the base r pipe what happens if we use base r with the base r pipe right so we can do that where let's go ahead and grab this code and we will call this base base and we're going to take local weather and we're going to pipe this into a subset command and so we'll do a subset subset is a lot like filter I'll go ahead and put that here and so we'll then say not is dot na on prcp and not is dot na on snow and so I'm going to run this to make sure I get that same correlation value we had before so this base base is the same name as this I recall so I'm going to call this base no pipe and then we have base base so base r with the base r pipe dplyr verbs functions with the base r pipe and then dplyr with the magritter pipe again I can run this benchmark and what I find is that again if we look at the median time that the base with no pipe takes 1.46 milliseconds base base so base with the base pipe takes basically twice as long so about 3 milliseconds the dplyr verbs with base is actually the the slowest of the bunch and then dplyr with magritter is just under four seconds right so let's try a fifth approach which would be base r with the magritter pipe okay hopefully you don't feel like I'm beating this over the head too hard so I'm going to go ahead and take this base base and let's do base magritter and we can again replace these base r pipes with the magritter pipe go ahead and do that and then this needs to be a period so this result really surprises me I'm glad I did this comparison what I find is that the base r functions with the magritter pipe is actually faster than the dplyr with the magritter pipe and it's basically the same speed as base with no pipe right it's even faster than base with the base pipe right so that's a little surprising to me and I'm sure people out there will have thoughts on that um in the long run I'm not sure that these differences in execution speeds really matter a whole lot um but it's surprising that the performance hit the bigger performance hit seems to be when you start using those dplyr functions okay so one thing we could do is we could go ahead and save all this benchmarking to what I'll call benchmark pipe and then I can do autoplot with benchmark pipe and we'll do type equals jitter autoplot is a built-in function with the bench package and it will automatically generate a plot of the benchmarking data and so these values that are keyed by color is the amount of garbage collection basically how it's cleaning up after each iteration we really just need to focus on the none right and so those are the fastest speeds and again we saw what we saw in that table view right where base r with the base pipe um is a little bit slower than base with no pipe and slower than base with magritter followed by dplyr with magritter followed with dplyr with the base r pipe so the final thing I want to do is I want to go ahead and let's take the fastest one we had which again was base with magritter and I'm going to copy this down and I'm going to put a s at the end of this for streamline because I have two subsets here right and so I have one two three pipe characters but I probably only need two right so let's go ahead and bring this snow greater than zero up here into this other subset and I can remove this line 39 run all that and then I can look at benchmark pipe and again this is the streamlined version and we can see that by removing one of those pipes it actually speeds up by about 30% or so right so if you remove the pipes and you remove those dplyr functions things get a lot faster so looking back at this result again for the base with no pipe makes me worry that I'm perhaps not doing a true apples to apples comparison right and so up here I have all this stuff going on where I'm using the selections where I'm basically building different vectors I'm pulling out three vectors and then combining them to each other right and so I wonder if the subset function isn't perhaps a little bit more efficient than what I'm doing in my other code so let me refactor this so I'll go ahead and copy this base no pipe I'll go ahead and put an underscore s at the end of that and we'll do a subset local weather so not na on prcp and not na on snow and we want snow to be greater than zero right so again let's run this and I'm missing something I've got something extra I've got this extra square brace here all right so go ahead and rerun that and I have an extra parentheses here and I don't need that comma there all right so let's try this again up and I need a closing parentheses here and again that gives me the result I'd expect so again I now have I think a bit more of an apples to apples comparison instead of doing subsetting directly into the rows and columns of local weather I'm using the subset function so I wonder if the subset function isn't actually faster than what I was doing basically by creating a bunch of logical vectors to then index the rows I want out of local weather so let's run this and we'll find out so we have this surprising result then that the base r code with no pipes that we've used the subset on still runs slower than base with no pipe right so where we basically did it as a single one-liner with the indexing with those dollar signs actually ran a bit faster than using the subset function I'm intrigued by this result where magridder is running faster using base r code I don't know that these differences really matter a whole lot this x-axis is on a log scale and so these differences aren't huge and I suspect if we created a different set of computations that we wanted to test that we would get perhaps a different result okay so I think in general the results are um dplyr functions are slow and pipes tend to add uh complexity they tend to slow things down a bit where does that all leave us well I think the reason that we use pipes isn't so much for efficiency in running the code but it's for efficiency in reading the code right and so while I might have something like this that to me is very readable right as long as there's not a pop-up in the middle this is a lot more readable than perhaps something like I have up here right that's just I get a little bit lost and this is frankly a relatively simple a line of code and again these pipelines whether we're using dplyr or base r or we're using the uh the base r pipe or the magridder pipe right it's just a lot more readable and understandable what's going on also I find that the dplyr functions are more intuitive to use they're more intuitive to read and so yeah perhaps they don't perform as well but I don't need all that performance right we're talking milliseconds here on this type of calculation I don't need that level of performance when I can very easily do something like drop na here on a single column there is an na dot omit but that's the same as running drop na without any arguments so basically it would remove all of any row that had an na value in any of the columns right whereas drop na gives me some more precision and this drop na kind of harkens back to functions that people are used to using with like databases like my sequel and things like that so again I think the advantage of using dplyr functions isn't their performance obviously because we see that they're slow but in their readability and how they work well together and to me I think that's a benefit there's people out there that you will find that are dplyr tidyverse haters I think they're in the minority but realize that there is this trade-off perhaps between performance and readability expressiveness and the ability for these tools to fit together again if you are interested in optimizing your speed then use tools like we did here with a bench package that has that mark function and try different experiments I think that's one of the cool things about our programming is that we have these tools that enable us to do these experiments I suspect most of you probably don't care and that the difference that you're going to find in performance is super minuscule I don't think I'm going to revisit this in another episode but if you have experiments report your results down below in the comments I would love to hear what you come up with and let me know all right spread the word about the all the cool things we can do with pipes and I'll see you next time for another episode of code club