 Part of the reason that I like working through a project like we've been over the past few months on YouTube is that it really forces me to be on my best behavior. I'd like to think, or maybe I just like you to think, that I have my stuff together and that I always practice what I preach. The reality is that we all have room to improve. A few episodes back I talked about the value of having pseudocode to help separate the what from the how of programming. I mentioned that the nice thing about pseudocode is that it gives you some ready-made comments to make your code more readable later on when you come back to it after say a couple days, a weeks, months, maybe a year. Well, folks, I have to come clean. But before I go all confessional on you and I have to beg you for your mercy, let me tell you I'm Pat Schloss and this is Code Club, where we learn about reproducible practices to improve our data analysis skills. Please be sure to subscribe to the channel and hit that bell icon so you know when the next episode is released. But don't hit it too hard. You might need that mouse a little later on. Alrighty, here we go confessional time. When I write my code, it rarely has any comments. There's a lot of debate online about how many lines of comments you should have per line of code. Some people think you should code so that you don't need comments. Another way of saying what they're trying to get at is that your code should be self-documenting. Self-documenting means that variable and function names are descriptive enough so that when you read your code, you understand what's going on. Others think that you should have lots of comments. Others are just jerks and they think the number of lines of code should be inversely proportional to your experience writing the code. Jerks. It's probably somewhere in between having no comments and having more comments than lines of code. But one principle that I've read that I really like is that your comments should indicate why you are doing something rather than what you are doing. After all, the code you are writing should show what you are doing. Something I want to emphasize, however, is that we all think that our code is crystal clear when we write it. The variable and the function names, they all make perfect sense and the logic is just flawless. The reality is that a day, a week, a month, again, maybe a year later, a lot of that clarity and logic is just gone. Don't get me wrong, the code might do exactly what was desired. The problem is, we don't remember what was desired. This happened to me recently. I was thinking about what I should do next in our analysis and saw that I had this file called doodle.r, really, it's called doodle.r. I'm sitting on my desktop that had our code analyzing something about the frequency that different Amplicon sequence variants or ASVs appear in genomes. Basically, we've shown that E. coli has more than a thousand ASVs, but are those all equally likely or is there one that's dominant over the others across the E. coli genomes? Looking at the code, I realized I wasn't really sure what was going on that I had written in the code that I had written maybe only a week ago. I suspect I'm not the only one that doesn't comment enough or that has come across some old musty code that's poorly commented. For today's episode, I'm going to show you what I do when faced with this problem. It's also a common problem where we get code from somebody else that perhaps hasn't commented and we're trying to figure out what's going on. So we're going to improve the documentation of this code. In the process, hopefully we'll learn a bit more about our coding. I don't think there's anything novel here. There's no new syntax, but it'll give us another chance to see how mixing and matching different functions can allow us to answer a diversity of questions. Even if you're only watching this episode to learn more about our and don't know what a 16S RNA gene is, I'm sure you'll get a lot out of today's discussion about commenting. Please be sure to check out the blog post that accompanies this video where you'll find instructions on catching up, reference notes and links to supplemental material. The link to the blog post for today's episode is below in the notes. Let's go ahead into our project route directory and we'll open up our studio. And you'll see that my master branch there is red indicating there's something in here that hasn't been committed. And as I mentioned in the introduction, I went ahead and I copied my doodle.r file that I found on my desktop into my project and looking at it here. You'll again see there's how many lines of code, maybe 50 lines of code or so. And I've got some like conclusions listed. I've got some notes about what's going on, but it's it's not really clear what what I'm doing. Right. So I talk about like genome basis, operon basis of the kind of the motivating question. Maybe there may be there are many ASVs, but only a handful of them are dominant genomes numbers. Right. So I think I meant to have this as being self documenting, but at the same time, I'm not really sure what these things mean. Also metadata ASV is here, but where did I define that? No doubt, as we've seen in previous episodes in the exploratory data analysis, we redefined metadata ASV many times. So I'm going to go ahead and create a new R Markdown document that'll copy all the stuff into. Let me double check how what my convention has been for naming these files. So it looks like I'm using the date and then a brief title. So I'll go ahead and do new R Markdown and I'll do 2020-1102-02 and we'll say dominance. Commonness of ASVs. I'm sure there's like three M's in there. Yes, that's definitely misspelled. Let's go ahead and put in two N's just in case. I think it's just one. I don't know how to spell. All right. And we'll go ahead and click OK for that. And we've got this basic R Markdown document. I'm going to come back to a previous R Markdown document that we used from the past couple of episodes and I'm going to copy a lot of this material from where we defined metadata ASV and all my other and all my other pre-settings. I'm not sure what happened there. So let's see. I'm just going to remove all this stuff, put that in there and yeah. So let me fix the date. So it'll be 11-02 I'm patched loss. And we'll say quantifying the dominance and commonness of ASVs. All right. So that should be good. And all this other stuff from output down. I'm going to merge my two sets of headers. And I still prefer the chunk output to go to the console. It just works better for me for some reason. And again, here in this first code chunk. We have the libraries that we're going to load, as well as our metadata, the information about the taxonomy for each genome in our database, as well as a count of the number of times we see each ASV in each genome, metadata ASV then merges that information together. And now what I'm going to do is I'm going to go back to my doodle.r and copy and paste. Okay. And before I forget, I'm going to go ahead. Before I forget, I'm going to go ahead and save this. I don't know why it didn't save it as the file I wanted it to be saved as, but we're going to do it into exploratory. 2020 11-02 dominance, commonness of ASVs. And so that's saved as our markdown document. And I'm going to convert the comments I have to being text. And I'll go ahead and put in my headers for my code chunks, as well as closing those out. Right. And I think there's another one here where I do the analysis in a different approach. Close that out. And then these are my conclusions. And I think before my conclusions were what level? Third level. Okay. So I'll put three pound signs. And then this is all the conclusions. And I'll go ahead and why does it do that? Remove those comment signs. And there we go. So I had been using comments to say what we saw or what I saw previously when I was doing this. And in our markdown, then that becomes the main text. And of course, as we've seen before, our code goes into these code chunks. And as you can see, I don't really have much by way of comment in here. So what percentage of genomes have an ASV? I think all the genomes have some ASVs, right? That this focuses on the most common ASV for each species. So I kind of get a sense of what's going on. But the overall question as I put into the comment was I want to know if we look at, say, all of the E. coli genomes, what is the most frequent ASV? So what ASV shows up in the most genomes and what fraction of the genomes is that? So I'm more interested in how many genomes or what percentage of the genomes does the most frequent ASV appear in, okay? So is there one ASV that's in everything? Another way of thinking about it is, you know, so if we say E. coli has a thousand ASVs and there's like 900 genomes, so there's, you know, if there's seven copies and there's like a total of like 6,300 operons, E. coli operons in the database of that 16S gene. So how are those thousand distributed across the 6,000, right? So is there again one that shows up in nearly, that's like 90% of the ASVs or the operons in the dataset? And I think those two distinctions of what shows up in the most genomes is probably what this genome basis was about. And this of what is the most common operon or ASV across operons is this operon basis. So let's go through the code and try to understand what's going on. All right. So this first one, genome numbers, again, I'm not totally sure what this is going to output and what I prefer to do is not assign things to a variable until I'm ready. Right. And so what we're going to do as we go through here is we're going to think about how can we add commentaries about why we're doing things as well as can we perhaps improve the naming of our variables and variables, we don't have any functions we've defined, can we improve the naming of variables to give a better sense of what's going on here? Okay. All right. So now if I run this code chunk, sure enough, it outputs it to the console. I got asked it to. And so again, trying to figure out what this genomes numbers is, we've got the regions, got the species, the number of genomes and the mean RNs. So this appears to be a summary table for each species, the region, as well as the number of genomes for that species and the number of copies. So something else that I've talked about in the past is I could perhaps do filter species, a species I perhaps know something about like Escherichia Coli run that and we should then get, yeah, so we have 958 genomes for E. Coli seven ish copies per genome. Okay. So that makes sense of what's going on. So I'm not sure why I need this at this point. So what we can then say is we're going to want the, so let's say we want the number of genomes and average number of 16s RNA gene copies per genome for each species. So that's what we want. And so genome numbers, perhaps what I'll call this instead of genomes will be something like N genomes, copies, RN copies per species. That's really long, but it's really descriptive. So let's run with that and we'll have to update our code later when we see that we use it. And so this is our input. And so what we're going to do is that we want to select the region, the genome ID, the species ASV in the count, because we want to know, need the data, need the counts by region and by genome for each species. Okay. And so that's why we're pulling those out. We're going to then, we want to group our ASVs by region and species. Yeah. And we then summarize. And so when I copied and pasted, my tabbing got kind of wonky here. So I'll go ahead and fix that a little bit, make it look a little bit cleaner. And of course that filter is something I added on to make things look a little bit clearer for me as I'm testing. So we have n genomes and distinct. So this gets the number of distinct genomes per region species. You'll put that here. And then this is the average. I'm kind of saying the what instead of the, the why, but here we're, we're adding summing up the number of operons, basically, 16S RNA copies across all genomes for a species and region and divide by number of genomes. So it looks pretty decent. Can worry a little bit too much about the formatting of your comments. Okay. And then we're going to drop out all the comments. I'm sorry, drop out all the grouping. And let me go ahead and we can assign this to that variable. And so now we see that our code, at least for this initial chunk looks pretty good, it's commented and is much more explicit about what's going on. I think this n genomes, RN copies per species is more descriptive. So an RN copy is another verbiage for the 16S. The operon includes the 16S, 23S and 5S. Maybe I'll say RN copies per genome. Okay. So that looks good. This threshold, I'm not sure what that's about so far. And let's see what percentage of genomes have an ASV. This focuses on the most common ASV for each species. And again, we've called it this thing genome basis, which I'm not really sure what that's about. Maybe what I'll do is I'll go ahead and run this and we can then look at genome basis. Yeah. And it's unhappy genomes numbers not found. And that's because we changed it. So n genomes are in copies per species. And so we have this inner join here on my line 61. So if I save that and I forgot to run, I guess we needed this threshold equal five. So if I run this again, we get some information about the summarized function. Let's look at genome basis as the plot. And we see in the lower right, we get these histograms for each of the four regions showing the maximum relative abundance. And I think these are counting the ASVs or perhaps the species, the species, perhaps, yeah, we can see X is Maxwell bond, but I'm not really sure what it's counting. So we'll have to figure that out. And then down here, we have the region and the, basically the percent of species, perhaps we're the most abundant ASV or most common ASV is more than 80%. Okay, so let's see how we got there. Genome basis. And I'm going to comment that out so we don't, our studio is not tempted to run it. And so again, we want to do our analysis for species level at each region of the 16S RNA gene. And we're going to want to count the number of the number of genomes where each ASV appears. Okay. And so, and then we'll report, I think, the dominant ASV for each species. So I want to do our analysis for the species level at each, right. For each 16S. So we want to do our analysis for the species level for each region of the 16S gene. We will want to group our ASVs, or want to group our genomes by ASVs and species to count the number of genomes that each ASV appears in. Right. And so I think that gets these next two rows pretty well. And so here we are then counting the number of genomes that each ASV appears in for each region and species. Okay. Okay. So that's good. And you know, you can always add white space. White space also helps with readability of our code. Again, you kind of develop your own style. Just think of you six months from now or a week from now trying to reread your code. The other thing that we had gotten was a warning about groups. So I'll say drop. And so let's look at what this all looks like through this point. And we see that we've got our region, our species, the ASV, and the number of genomes that it was found in. Okay. So I think that's a nice variable. We can also, like I said, filter species equals extra Rikia coli. I'm looking at this. We then see there's a lot. Right. So this is the V one nine, the species, the ASV and the number of genomes that it was found in. And so we'll want to know which what, what is the percent, what which ASV shows up in the most E coli genomes. I don't care about the name of the ASV. I want to know the percentage, right? And so this tells us the number of genomes that it's found in, which is useful for E coli. And then I'm doing an inner join to bring in by my region, the number of genomes, as well as the number of copies per species. Okay. So as I'm saying this, I'm thinking, why didn't I just do this with like a group by or summarize? And so maybe we'll think about that. We could think about refactoring the code. But for right now, I really just want to figure out how does the code work, right? So bring in the number of genomes and RN copies per species. So we can scale our number of genomes that each ASV appeared in to get a percent dominance. Okay. Great. So that reads that in. And here, then we do that mutate to get the relative abundance. And maybe I'll, I'll change this to be dominance because that's more than relative abundance. It's a measure of dominance. So each ASV is going to tell us the dominance. So for example, this one with 127 is more dominant than this one with one. Okay. Well, then we want to return or we want the most dominant, the dominant. We want the dominant ASV for each region and species, which is what we're doing here, right? So we're summarizing to get the max number of genomes. Not sure why I did that. Oh, I know why. So I did end genomes, I want the dominant ASV for each region and species and the number of genomes, or let me make this like a separate bullet point, right? Want the number of genomes for each species and region so that we can filter our data. To focus on those species with more genomes than the value of threshold, right? And so if we only have one genome in our species, then every ASV is going to be dominant, right? It's going to have a dominance of one. That's not really meaningful. So we want to set a threshold, as I do here, to get some critical number of genomes, right? And so you'll recall up above, I had threshold five. And so this is set the minimum number of genomes per species. And so maybe I'll say min and genomes per species. So these names are really long, even for me. But I think that's definitely more descriptive, right? And so here, then I can say filter and genomes greater than or equal to my threshold. And again, I'll do my groups equals drop. And that should be good. Okay. So let's see. Does it run? Uh, relevant not found. And, um, right. So I changed the variable here to be dominance, right? So max, um, relevant max dominance. I'm going to say, um, dominant, um, dominance, max, I'll say max dominance. Okay. I think that works. And if we run all that, ah, I'm still getting an error. So max and genomes, min and genomes per species not found. Oh, and I probably forgot to run that up here. Now let's read, run that now it's happy. Um, and I think what I'll do is I'll go ahead and save this. So genome basis was what I had previously called it, um, not super descriptive. So let's say, um, let's rename it, uh, dominance, ASV dominance per species. And so what percentage of genomes have an ASV? So maybe I'll, what I'll say instead is, uh, dominance is the, um, percentage of genomes that an ASV is found in. Okay. So if a ASV shows up in 20 of 25 genomes for a species, uh, it's dominance is 0.80. Okay. So again, this gives flavor, it gives context, gives more. More of the why of what we are doing. Okay. And so then, um, the other problem with these long names is that I forget them. And I seem to have the short-term memory of a flea. Uh, so we want to change where we had genome basis before to ASV dominance per species. Um, we also, if we again look at the output here, have, um, max dominance instead of max relabund, we want to fill by region or bin width is going to be 0.05 running that. Um, we then see that the count is the number of species. And so perhaps we can improve the way this looks by doing something like a Y, uh, labs, Y equals, um, number of genomes. And then X is going to be, um, dominance of ASV per, uh, species. Sorry, this isn't number of genomes, number of species. And that, that looks pretty good. And so then we want to know, um, what percentage of species have a dominance, have a max dominance over 80%. Okay. So that's good. And this is going to be then max dominance. And if we sum, so if, if we do a logical and if it's true, then the numerical value of true is one, the, um, numerical value of false is zero. So if we sum up all those truths, we get ones, we divide by the total number, we get the fraction. Um, and I don't think this really helps. Um, so I'll delete that. And this then gives us a fraction. Um, but again, because, um, I forgot to group by, um, or because I dropped the groups up here, the group by went away. So I want to group by, um, I think region, right? Right. And so here we see that the V one nine, um, has very few species that are dominant, right, that have more than 80% show up in more than 80% of the genomes, whereas like the V four, V three, four and four, five, um, have more species where they're, uh, most dominant ASV is more than 80%. And then we have our conclusions here, right? Um, and great. So let's make sure we still agree among the sub regions, a majority of the species have an ASV that is found in more than 80% of the genomes for the full length sequences, the V one nine, uh, which was, um, yeah, which was this plot, right? Um, only 25% of the ASVs are found across more than 80% of the genomes. Uh, the most abundant ASVs, those that account for more than 80% of the RN copies, um, maybe I should say most common, right? Um, ASVs account for 46 to 74% of the genomes in the sub regions and only 12% for the full length ASVs. And that's from this number here, right? So 46 to 74, 12%. Uh, this underscores the problem that a single ASV is unlikely to be represented of the diversity of the ASVs within a species, right? So if an ASV was pegged, you know, so like you might imagine that these ASVs for V four are indicative of the diversity of a species. Um, but if we do it on an abundance basis, then we see that there's kind of a more skewed distribution. And certainly as you look at like the full length, uh, that, that story definitely goes away. So I'm happy with these conclusions. I'll go ahead and save this. And, um, I need to make a rule, uh, in my make file to generate this. And, um, I'm going to get status so I can copy the name of the file here. And we will open up our make file and let's make that rule. Um, so this is the, um, the dependency, uh, and we're going to make the MD file, uh, my file name is a little long and we've got the same stuff, uh, the same other dependencies and recipe. Uh, for generating the dependent or generating the target. Um, uh, and this is the file that we want to render. So that's all good. I'm going to go ahead and copy, uh, the target and then we'll make it. Excellent. And I can look at the files. Um, I could do say open exploratory, 20, 21 dominance, um, files. And then the figures, see if preview opens it for me and I can see what they look like here. Uh, their PNG files, the resolution is pretty crappy. Um, there are of course ways within, uh, using GG save to improve the resolution, but we see the plots that we had previously. And again, we could make this look prettier, but for now we're just doing this exploratory data analysis. Ah, and I noticed a chip problem that I have dominance of ASVs per species rather than commonness. So let me fix that. All right, I'll save that. I think we'll be good. I'll quit our studio, remake it. All right, let me like a double check that that took. So dominance, commonness, we're good. Um, get status. I'm going to remove my doodle file because that's not necessary anymore. And then I will get add my make file. Oh, I never actually moved my issue branch. So let me go ahead and do get branch issue 32 double check. That was the issue number 32. It's probably silly to do this at this point. Um, but whatever for keep, keep the practice of having separate issue branches for each issue. Um, and I'll do get add make file exploratory, 20, 20, 11, uh, put a star to get all those when you use the star, it's really important that you run get status so you know what files you're committing. Sometimes people will use star or they'll do get add all and they're adding a lot of stuff that they don't want to add. Um, there's lots of headaches and heartaches over people being overzealous in what they commit. So that all looks good. So I will then say get commit dash M, calculate dominance and commonness of, um, ASVs for each species and region closes number 32. That's good. Get check out master, get merge issue 32. We're good. Get push. And as we've seen, uh, we'll come up and we will close out the issue. Again, what I hope you take away from this episode is that it's really important to comment your code so that you send little notes to yourself of why you're doing things that you are in your code. Also think about the readability of your code and whether or not your code is truly self-documenting. Sometimes people use variables like X or I or J and those are not descriptive, right? They don't tell you anything, um, other than that you're maybe lazy and you like variables with one character. Um, try to have descriptive names. And as we saw, yeah, maybe your names can be too long and get too descriptive, but if you come back to the code in a couple of weeks, which you most certainly will, then, um, it helps to have that comments, those comments, it helps to have those descriptive names for your variables. So anyway, I would encourage you to go back, look at your code, see how well you've commented some things, you know, perhaps go back to some dusty old piece of code, see if you remember what it did, and then do what I did here and kind of go line by line through your code, trying to figure out what it did. Um, you know, I kind of identified maybe one or two places in here that now that I've thought about it for two weeks, maybe I do things a little bit differently and I could go back into the code, refactor it and, and perhaps look at things a little bit differently. Anyway, I encourage you to adopt this practice. I will do my best as we go forward to do a better job of commenting my code, um, but I fully expect to slip, um, and I expect that, you know, you're not perfect either. None of us are. We're all trying to get better as we do this. And so that's why I think that these videos have been really important for people is to help show them other ways of doing their coding practices. And as we learn more, hopefully we'll get better. Hopefully our analysis will get more reproducible and more robust. So keep practicing. Please tell your friends about these code club episodes, uh, like and subscribe to the channel. We'll see you next time for another episode of code club.