 Programming languages like R give us powerful tools to do things with data. Those ways of manipulating data are called functions. Seemingly simple things like addition and subtraction and assigning a value to a variable are functions. Reading in a file, read CSV, it's a function. In fact, anytime you see a set of parentheses in R, you know that you're using a function. One of the things that I love about programming is the ability to write my own functions to do something new or to expand the utility of an existing function. In an earlier episode, we saw how to write our own functions. Today, we'll get more practice writing functions and learn about the ever-mysterious anonymous function. Hmm, what are those? Hey folks, I'm Pat Schloss and this is Code Club. In each episode of Code Club, I present various concepts that I use in my own research to improve the reproducibility of the analysis. Over the past few weeks, I've been using an example from my own research interests to motivate the concepts we cover in each episode. We've been investigating the properties of exact and Amplicon sequence variants, also called ESVs or ASVs, when analyzing 16S RNA gene sequences. Please be sure to subscribe to the channel and click on that bell icon so you know when the next episode is released. In previous episodes, I've talked about the dry principle. Don't repeat yourself. We saw it in the last episode when we used make to generate a lengthy list of file names that we wanted to build. We also saw it the last time we talked about functions. Functions are great because they allow us to package actions together so that instead of running those actions each time we need them, we can call a function to do it for us without worrying so much about how it all works. These functions can be quite complex or quite simple. Today we'll create a function that's actually pretty typical for my analyses that will help us to combine the 50-some data frames we created in the last episode. We'll also create a unique function called an anonymous function. Why is it called an anonymous function, do you think? Well, don't overthink it. It's called anonymous because it doesn't have a name. These types of functions are useful as arguments to functions that take functions as arguments when that function may only have a single line of code. Yes, I think I did just use the word function about four or five times in one sentence. Don't worry, it isn't that complicated and can be pretty useful. Stay tuned for the rest of the episode and I'll show you how. Hopefully you'll believe me that even if you don't know what an ASV or a 16S RNA gene or even a gene is, you'll still get a lot out of today's episode. So we'll go ahead and go to our project root directory. I'm going to go ahead and fire up Adam to look at my make file and scrolling down to where we were last time. You'll recall that we made these variables, EASV, which is both the exact sequence variance tibles and the Amplicon sequence variance tibles. I created this phony variable. I'm going to go ahead and remove that because I don't really need that. And what we're going to be working on is here, the data processed R&DB. We want to make EASV count tible. And all the dependencies are going to be from that variable, EASV tibles. So what we need to do, though, is we need to modify this code to get it to work. And before I do that, I first want to look in my data directories and make sure I don't have any tibles, count tible files, that are missing indication of whether it's an EASV or ASV. So I'm going to first look in data processed for tible. And I see I've got this ESV count tible. So we're going to be replacing that with EASV. I think I have this under version control double check. I'll do get RM on that. So I don't. So I'll go ahead and just do the RM on that. And if we then do LS data V tible, we see lots of stuff. But I'm looking for these types of things where I've got R&DB count tibles. Maybe I'll do it a little bit easier data V R&DB dot count tible. And there's four things there. So I'll go ahead replace that LS with an RM and everything is in good shape. That way we don't have to worry about having these older legacy files laying around. All right. So as I said, we're going to be working on this code combined count tible files dot our file. And I'm going to go ahead and launch that up in our studio and open up with our our project file. So I'm in my project root directory. If I go to files, code, and then we want combined count tibles files dot our. This is here that all looks good. And you'll recall that we did this. I think to learn about map DFR, although I don't really remember in what context we talked about this script. But map DFR allows you to take a listing or a vector of file names or of any, any values and then apply that into a function. So we used read TSV to read each of the tible files names. And we then were able to read it in with read TSV and compress it together. We generated the names of those tible files from arguments coming in. So the other prerequisites and we use string replace on this. And you'll recall that we had ESV. So I'm going to replace this. So we wanted to also get anything, whether that's ESV or one of our numbers. And I'm going to do a period star, but I'm going to put two forward slashes before those periods after R&DB and before count tible. So that we tell are we really mean match a period because normally a period means match any symbol. Okay. So this should work for the most part. I suspect we'll still have some problems with this. Something that we can do as a temporary work around the command args is I can do to count tible files. I can do list dot files. And this allows me to look into my directory structure and I can then do path equals data. I guess I need to put in quotes data forward slash. Let me do V star. Probably not. Let's see pattern equals and then what I want it to match. And so that's going to be count tible. And then I want full names equals true. Maybe it worked. Let's see. So if we look at tible files. Yeah, didn't work. So let's to test. Let's do v4. And see if that worked. That worked right. So now we have a listing of all the files, the count tible files in the v4 directory. And we can use this to test things out rather than having to run things through the command line and test things out. We'll do it directly in here. All right. So if I now do names tible files, it's not happy because I need to run library tidyverse. Okay, that's loaded. And I can then run this. And now if I look at tible files, I see that I've got the regions. Maybe something else I'll do is I'll wrap that second dot star in parentheses, which saves the match to a variable. Like we did here, we match this first one, which was the region. So this is stored as back back one. So the second one will be Mac stored as back back to right. And so now if I look at tible files. See, I've got the region underscore and then ESV or the ASV definition. So that's good. All right. So let's see what happens when we run this map DFR line. And it runs through. No errors are given. But something I see is very interesting that we've got region genome, ASV count and ESV. And you know what, I bet if we looked at the tail of this, that here that you have the ESV. And so what's happening is that when we read in the ESV file that has a column called ESV that has the name of the ESVs. And if we read in something that's like, you know, 001, that's got a column called ASV that has the name of that ASV. So when map DFR tries to put these rows or data frames together as rows, it has problems because it can only do it well if the same columns are found in all the files it's reading in. So we could go back and we could give all of our count tible files the same column names. But that's kind of tedious because if you recall last time I did that took about five or six hours to run all the data. And I don't really want to do that right now. So what we'll do instead is we will write our own function to put here in place of read underscore TSV that does everything we want, including fixing the column name. Okay, so how do we do that? Good question. We talked about this in a previous episode of how to create a function. And so I'm going to create a function called read underscore count tible count underscore tible. And then we say function and then the argument, the data that's coming into the function, which I'll say is file name. And, and maybe we could say count tible file, make it a little bit more descriptive. Okay. And that will work. And we can then say read underscore TSV count underscore tible file. And to test things I'm going to do count underscore tible file. And I'm going to put in the name of a test file. We had those up here. So I'll grab this one for now. And don't forget to remind me to remove this line. Because sometimes I do this, I put this tester line in and then I run it. And then I basically run with the map DFR and I append the same data frame over and over and over again, because every time through count tible file has the same value. So I do that a lot. And so I'm familiar with it. So hopefully I'll remember. Okay. But I want to use that file to help me to develop and test this function as we go along. So I'll run those lines. And you'll see that I've got genome ASV and the count. Good. So there's a couple pieces of information that I'd like to get. I would like to get the region. Right. So that's going to be the v4. I'd like to get the threshold. And I'd like to rename ASV or ESV to be ESV for the column name. Okay. So how are we going to do all this? The first thing I'm going to do is I'm going to borrow code from down here that because we're not going to need to worry about the names. I'm going to use this up top here to get the region. Okay. And so I can say region equals string replace string. And then it's going to be count tible file. Let me tab this back over a notch. And I'm going to remove this parentheses and replace it replace all that with back back one, which we saw before was the region. And now if I do region, you get v4. So good. We're in great shape. The next thing I'm going to do is the same kind of thing to get the threshold. What we'll do is threshold and instead of mapping matching or saving that slot, the first slot for the region, I'm instead going to match the second slot. And this will give me my threshold. And what I'd like to do would be like zero point and then whatever is there because we didn't save the zero point. We only stored the numbers to the right of the decimal point. What I'll do first is create a variable that I'll call type. And we will then do if else, which is a function we've seen in previous episodes. So if else count tible file. And I think what I'll use is str detect. And that's a function that allows us to detect some value in that string. So if I'll string detect, and I'm going to say ESV, then I'm going to return the type to the ESV. Otherwise, I'm going to say it's an ASV. And so this should come back as an ASV. Well, let's see what it says. Type ASV. Good. And if we change this to be we change the 005 ESV. Then the type should be ESV. Yes, we good. Okay. So I put this back to 005. And we now have our type, which is good. And so my threshold, I'm going to say, if else type equals equals ESV, then I want to return ESV. Otherwise, so that's false. I'm going to return all this. And it's complaining because I think I'm missing a parentheses. Good. So I run that. And now my threshold should be 0.005. Awesome. We're in good shape. So you'll see that this has quotes around it. And that tells us it's a character. I don't really care if it's storing this as a character. It's going to be in the same column as ESV. So even though it's really a numerical value, I don't care if it's storing it as a character, because I'm not going to do any numerical manipulations with it, right? Like I'm not going to multiply this by two. I need to know what threshold was used to define the ASV. Good. So we've gotten the region. We've gotten the threshold. And now we'd like to rename ASV or ESV to be ESV for the column name. We're also going to build out this read TSV line. And if we look at the output here, we have genome, we have ASV, we have count. I'd like to add a mutate to add region equals region and threshold equals threshold. And we now see that we've got region and threshold. So that's great. Maybe I'll put these two things on separate lines. That can stay up there. And we can then pipe this to the next step, which is to take care of that ASV or ESV line. So what I'd like to do normally would be to say rename and rename. We might do the new name. So ESV equals the old name ASV. Right. And so we run that and that's going to get it right. Yeah. But, you know, of course, if we run ESV data through, then we want that to also be EASV. So we could also do EASV equals ESV. Now, if I run this, I get an error because I don't have an ESV column in this data frame. So that's, that's a problem. Right. And so if we look at the documentation for rename, we'll see that there is a special version called rename with, which renames columns that we specify with a given function. Okay. So this is where we're going to start learning about what's called an anonymous function. So we will do rename with and we want some function. So function and maybe I'll call it some function. And maybe I'll, I'll call it rename function. Be a little bit more descriptive and we're going to collect the, select the columns that we want it to rename. And so we're going to select those columns that end with or ends with SV. Right. And what we can do, so I'm going to show this two ways. So I'm going to do it with a named function and then I'll show you what I mean by an anonymous function. So we need to create a function that we will call rename function function. And this needs to take an argument X, but I'm not going to use the value X. Instead, I'm going to return in quotes ESV. Right. And so I create that. And now again, if I run this, I get ESV as the heading column. And if here I have ESV, then we'll see that if I've got ESV there that I also have ESV there now as well. So great. Let me run this so that it's using that example. But this function is kind of silly. Right. Like there's a lot here to simply say ESV. What we can do instead is what's called an anonymous function. So here instead of having renamed function and defining it up ahead is that I can create a function that doesn't have a name. It's anonymous. Right. And so I can do function X as a parameter. And then the body of the function I can say ESV. Right. So this right here is an anonymous function that rename with wants a function here in this first slot. I'm giving it. I don't need to name the function. Because no one cares. Our doesn't care what the name is. It cares what the value of the function is. Right. So instead of naming it and then basically plopping in the body of the function here. I'm giving it the body of the function directly. And this is again, what's called an anonymous function. And we run this and everything looks good. If you don't believe me that this is what's happening. I'm going to go ahead and comment out that function. And let's go ahead and we'll load this function or big function for read count tibble. And then we need to do some cleaning up down below here. And so we'll use those tibble files. The function we'll use is read count tibble. I don't really care about any of these things here. One thing that I forgot to do is maybe use the call types up above here and read TSV. And so that looks good. And I almost forgot to remove my test code. So again, I'll reload that function. And now let's give this a shot. Let's run map DFR. And it's looking good. Like I don't see any extra columns. We have our the genome that the ESV or ASV came from. We have the name, we have the count, the region and the threshold. And again, if I look at the tail, or we have the ESV data, that also looks good. Great. To finish this, I want to output this as data processed RDB ESV count tibble. We'll run that. Very good. And I can go ahead and save my R script. I got rid of it's back to a black text. And I can come back now to my make file. And I think everything's in good shape. And I will go ahead and do make dash N on that to see what all it's going to run. And you can see that we're going to run this R script as well as feeding it all of the count tibble files. And we run it. And it'll take a couple seconds to run. Boom. It's done. I look at LSLTH data processed. And I see I've got my RDB ESV count tibble file. We're in good shape. Let's take a breath. So if you haven't already, be sure you like and subscribe this video. Subscribing will let you know when future episodes are released. I have big plans for this project as well as plans for subsequent projects. And I want you to know what's happening with those and let you know so you can come and watch them and participate as much as you'd like. Great. So the one thing that we now have a problem with is that all of these exploratory files down here used ESV count tibble as a dependency. So I need to change that to ESV count tibble. And I need to then look in my exploratory file exploratory directory rather for all of those RMD files. And I'm going to open these up and I'm going to have to quickly modify them to make sure that I've got everything in good shape. And let me start at the very beginning. Back on 99. I'm going to go ahead and close this count combined tibble files. And you'll recall that we're giving it ESV. That's our count tibble file. I think. And what we're going to want to do is filter. For ESV or threshold. Equals ESV. And then we'll also do a select minus threshold. Right. And so, you know, if we ran this, again, if I run this, it's going to read in the data frame. And it's complaining call types should have the same length as call names. And that's because up here, we're reading everything in as character, character, character D. And then we have an extra column for the threshold, which I think is a C. So if we run this, it's not happy. I think we fixed this in a subsequent episode where we did what we did default count character. Yeah. So I'm going to copy these two lines. I think this was a better way of doing it and pipe that. Okay. And so then we see count tibble is exactly what we'd like genome ESV count region. That's in good shape. And I want to look to see if we used ASV anywhere in here as for the column. And I don't see that because we'd like to replace that then with ESV. For that column. Okay. So let's go ahead and look at this next. And I'm going to probably just copy this to these subsequent scripts. And again, we're going to join ESV, which is that. And if I again look for ESV, that's metadata ESV. I think we're good there. So I'll go ahead and close these two, save some space. And back up here for ESV. We'll read this all in. And again, look for ESV. And this needs to be ESV. So that's good. And yeah, I think we're good there. So close that window and fix this for ESV. Again, I'm kind of looking for anything. So here's one. So ESV is looking for the column rather than ESV. And so that all looks good. So I'll go ahead and save that. And ESV again. So minus ESV and then ESV here for each ESV. So I think we're in good shape. So one thing I really want to point out here is that we're not changing our inner join call because this is joining the data frames. And our data frame here is called ESV, not EASV. It's a little confusing because there's also a column in there that's called EASV. So I'll save that. Close that. And then one last file. We'll save that and look for ESV. And this should be EASV. Here's an ESV. There's one. And I think that's good. OK, so I'll close that. And I'll come back to my make file. And we'll build the exploratory. So let's do make dash and exploratory. And we see that it's going to render these six exploratory files. And so what I'm going to do is I'm going to go ahead and run this. It's going to take a couple of minutes. I'll probably edit it to speed it up. We might run into a glitch where I forgot something. So sure enough here, column ESV is not found in 20299. So let's grab that. And which? Line 78 to 84. Here we go. EASV. Try it again. And it looks like I forgot in 929. I forgot to update the name of the file. I wonder if I forgot to do that for some of the others. So that's 929. Let's look at 929. Right there. Let me open up the others, double checking. So looking at the top. Yep. EASV. And EASV. I think I forgot all of them basically. You're supposed to remind me. EASV. Wow. So these are all good. And let's try again. So for this one, I think what I have done is I'm reminded that I had another one of these lines for testing that I hard coded in. So all the data was v4. And that exploratory file actually needed v19. And if I'm only giving v4, then it's going to complain. So I'll go ahead and delete that line, save it. This is kind of tedious. Run it again. And hopefully everything works now. Wonderful. So it took a few minutes to run with a little bit of fits and starts with editing things along the way. So what I'm going to do is I find when I'm changing something about an upstream file name, but again, using make makes it a lot easier to keep track of all those dependencies and get things, get things running as quickly as we can. So we're done. Finally with issue 34. I do a get status to see where we're at. And I'm going to do get to make file. And then exploratory, all that stuff. Get status. Everything looks good. And I'll do get commit dash M. And I will say combine count tibble files closes number 34. Close that and then get checkout master get merge issue 34. Great. Get push. And we've closed that issue. Finally. So again, what we covered today was reviewing how to build functions in our, we saw that read CSV or TS read TSP got us really far. But when it came to today's situation where we're combining call data frames that have different column names, we needed to add some extra stuff to make it easier to combine those data frames together as rows with map DFR. So we wrote our own function, right? Read count tibble. The second thing we talked about were anonymous functions. And so again, these are functions that we may only, we are only going to use once. We don't, so we don't need to name them, but we need that functionality, so to speak, of a function. And in this case, it was a function that all it did was spit out the value ESV and applied that to any rows that ended with or ends with SV. And so we got that to work. And then finally went through all the TDM of updating our exploratory data files and getting those all run and all situated and closing out our issue and pushing it up. So again, please keep practicing with us and tell your friends about Code Club. I'd love to see what kind of functions you've built. Everything, everything in R is a function. Even the assignment operator is a function. Even addition is a function. And the great thing about R is that I can create functions. You can create functions. And if there's fun, if there are functions that I think other people might find useful, I can package those together and make my own package. And so then you can use it as well. Lots of people do that. And that's really one of the really empowering things about R is that we can all share these functions with each other. So again, till next time, keep practicing. And we'll see you again for another episode of Code Club.