 Welcome back for another episode of Code Club. I'm your host, Pat Schloss, and I'm leading you through a series of episodes where we do a reproducible data analysis. In the last episode, we started doing an exploratory data analysis using R. We did that in a file called an R script, and we know it's an R script because the extension is dot R. As we discussed at the end of that episode, we're going to be developing that analysis further to look at different taxonomic levels and to loosen our definition of what we mean by an Amplicon sequence variant. By the way, if this is all gibberish to you, don't worry, there's a lot of value if you stick around. But what we saw with that R script is that it wasn't an easy way to record the results in the script. We could copy and paste the output from the R console to the R script, and then insert a series of comments. But that really gets kluji as the code develops and as it changes, and it really doesn't lend itself very well to incorporating figures. Alternatively, we could also create a word or Google Doc file that integrates our code and figures, as well as a narrative to describe what's going on. I've done that as well. But if I want to change an upstream parameter or an upstream data file, I'll have to edit the document and make sure I've updated everything that has changed upstream. Again, this just gets really kluji and cumbersome. So we need a better solution to these. And so there's a tool that's been created in the last few years called R Markdown. It's gotten a lot of support from R Studio, and it's been really well-developed. In earlier episodes, we've talked about Markdown, which allows us to indicate things like bold, italics, headings, and so forth, using light formatting within a text file. Obviously, we've also talked about writing R code to do our analysis. R Markdown allows us to take the text and the code and marry them together, along obviously with the output. We can use R to do the coding, or we can really use any other language. I'm going to focus on using R, because I know R. But you could also use Python. You could use Bash. You can use other languages. So using the R Markdown package, we can use R to render an R Markdown file, denoted by the extension RMD, to generate a Markdown file, an HTML, or web file, a Word file, or even a PDF. And there's a lot we can do with these R Markdown files. In the past, I've used it to create better Readme files, as we'll do in today's episode, to document our data analysis, kind of like a series of blog posts about our ongoing data analysis, to create websites. If you've looked at my Riffamonus website, it's one of the tutorials on there, all of those tutorials are written in R Markdown. So the output that you see is generated by the code in the tutorial. I've also created an entire slide index for semester-long courses. And over the last few years, my lab has used it to generate our manuscripts. So we've already talked about using Markdown. In this and future episodes, we'll see how we can use R Markdown to make increasingly sophisticated documents like a manuscript. But we need to start slow. So today, we'll make a better Readme file, and we'll also rewrite our R script from the last episode to be an R Markdown document that describes what's happening in the code, as well as reveals the output of the code. By the end of today's episode, I'll hope that you understand the difference between what a code junk is and what inline code is and why we might want to use one over the other. We'll also see how to generate Markdown in HTML files as output from our R Markdown files. Finally, we'll learn how to tell R where to put the output file. Specifically, we will want it in our exploratory directory. So there are actually two issues that we're going to file and solve today. The first issue is I want to update my Readme.md file to reflect what I'm using on my computer. So I'd like to replace the version numbers of my R packages with R code so that the version numbers are dynamic. What we'll see is that if I update, say, the tidyverse package, it will automatically update the version number for the tidyverse in my Readme file. So that's one thing. You can put in these little checkboxes. And then also what I want to do is to output session info. I believe I showed this in the last episode or perhaps the previous episode to show what packages and versions I have running on my version of R. Preview that to make sure it looks good. We're great. So we'll submit that new issue. The other new issue that I want to do is I want to convert my R script from 2020, 0909 to a R markdown. What I'd like to do is add narrative to my R script and generate a markdown version of the document that I can read on GitHub. So this might not make sense right now, or you may not know how we're going to do it. But this is what we're going to work on for today's episode. All right. So again, returning to the issues, we're going to open up a branch for issue 21. Get branch issue 21. Get checkout issue 21. And we're on branch issue 21. Returning to R Studio, what I'm going to do is I'm going to go up in the left corner and you'll see under that white file, white square with the green circle and white plus sign to create a new R markdown document. And for now, I'm going to leave all the defaults here. I will be updating this as we go along here. So you don't actually have to do these steps to create a new R markdown file. R Studio has some bells and whistles to make things easier. What I'm going to do then is open up my readme.md file that you'll see down here in the bottom right corner. This is my title for my project and also the title that I want to be including in here in our markdown document. I'm the author. This date is good. This is the date that I'm recording this video on. I think it'll come out on the 14th on Monday. And I'm going to go ahead and delete all this other code. We'll talk about many of these things as we go along today. And I'm going to go to my readme file. And I will then highlight all this text and copy it, paste it into my untitled one file, which I need to save. And I'm going to save this as readme.rmd. rmd is the extension people use for our markdown documents. I'll save that. Now, this basically has the same information that was here in my readme.md file. I'll go ahead and close the md file for now. And what I can do is I can click knit. And when I click knit, by default, it's going to generate as output an HTML document. So I'll click knit. And you'll see the output has my title. And if I make the screen a little bit bigger, my name, the date, and a variety of different bits of information that were all in my readme file, which is excellent. So one problem with this is that HTML is great because it's viewable on the web. But if I put a HTML file on the GitHub, it's going to render it as HTML code, not the output of HTML. So instead, what I'd like to do is render it as markdown. And so what I can do, instead of HTML document, what I can do is GitHub document. And if I save that and knit, it will actually generate the GitHub document, as well as a preview here in HTML. And in fact, if I look down here in my lower right corner and I sort these by the date modified, I'll see that I've generated an rmarkdown file and an md file. I thought it should have produced the HTML file, but it doesn't seem to be producing that as the preview. So maybe we'll see that pop up later. Anyway, so now this is outputting it as a markdown file, which then wrote over our previous markdown file. All right. So so far, we haven't really gained anything over our markdown file that we previously had been working with. Down at the bottom here, I'm going to create a new heading. And I will call this my computer. And in here, I'm going to create what's called a code chunk. So in this readme file is a bunch of text, right? It's all text. There's no code. But if I wanted to put in put a chunk of code in here to produce output, I can do that with three back ticks. And the back tick is the key above the tab on your keyboard. It's on the same key as the tilde. And then I can do the curly braces. And inside of that, I can put r. And then I can close the code chunk with another set of three back ticks. I can then do session info as a function and put the r code into this code chunk, right? If I run this, save it and knit it, and then scroll to the bottom, I will see that it generated ran the command session info. And it inserted the output of running session info on my computer, right? And that way then, if it's in the readme and you're wondering what all I have or what operating system I'm using or what version of R I'm using, it's here in this output from session info. Now, something I noticed that's not in here are the tidyverse, right? And so these are packages that we know that we've already used because we have them listed here in our dependencies. What I could do is up here, I could say library tidyverse and library data.table. And obviously I'm using library r markdown. So one of the nice things that RStudio does is that it automatically runs library r markdown for you and it comes installed with it for you. It installs it for you. So if I click knit, again, I will see down here at the bottom that it ran the library tidyverse and then it gave us all that output that it normally gives us. Again, for library data table and then it runs library r markdown. And then what you'll see down here are additional packages that have been loaded. So now we see data table and we see tidyverse along with all the packages that are part of the tidyverse. Now, I'm not such a big fan of all this output being spat out to my document and what we can do inside of this curly brace with the R is put a comma and then do message equals false. And what that will do is that that will tell our markdown don't output any messages that come to the screen. If I again knit that and scroll down here, what you'll see is all that information about conflicts and whatnot from tidyverse and data.table are gone and that they are now, it's much cleaner and that we again see those packages are attached and enlisted in here, which is great. So this is a code chunk and we'll see code chunks more when we move on to updating our R script. Another cool feature of our markdown is the ability to put inline code. So you'll see here that I have R version 4.0.2 but perhaps I'll update my version of R and I would like to get, have this number updated or if I update tidyverse or data table that I'll update these other things, right? I also mentioned that I'd like to include in here our markdown, but I don't know what version it is. I guess I could dig through it, but that's kind of tedious, right? I'd like to do this programmatically. What we can do is we can make inline code and so if I replace that line where I had R I can put in single backticks R and this tells our markdown, run what's in these backticks as code and then insert it as text. And so I can type R version.string and if I came down to my console here and did R version.string, I would get this string as output and so it would then insert R version string here again because it sees the backtick R. If I knit that, you'll see down here in my dependencies I have R version 4.0.2, the same output I had here. We'd like to do the same thing here for tidyverse data table and R markdown. I can do that using the package version function but I messed up here. I've got to put this in backticks and I've got to include the R, so package version and then as the argument to package version I'm gonna put the name of the package and I need to put that in quotes. So I'll do tidyverse in quotes and I'll copy this down and I will instead of tidyverse I'll put data table and R markdown. Now if I save this, what we should get is in the parentheses and after the V the version number for these three different packages. Let's see, what do we get? So we get 1.3, 1.13, 2.3 for our version of R markdown and that's pretty slick. I'm kind of surprised as I look through the R markdown document that didn't yell at me and that's because these packages I guess they're installed but they're not quite loaded and so what I'd like to do to be a little bit cleaner is to instead move these library calls up ahead and I will make a code chunk to do that. I'll do that up here before dependencies. So I'll do R and then the three back ticks and I'll include those three library calls. Again, I was surprised it didn't complain because those packages weren't loaded but perhaps package version doesn't require the package to actually be loaded with a library function for it to get that value. I also, so I can move that message false back up to here and so I can name my code chunks after the R but before the comma, I could put library calls to give my code chunk a name. I could call this session info. These are pretty basic names for these chunks but it does help to organize because you can see down here, RStudio is nice in that it gives you the outline and structure of your document along with the different code chunks that are in there. Makes it easier to navigate. I also don't necessarily wanna see that I've called the libraries, the library commands. What I can do to, so let me just run this to see what it looks like, knitting that. So I see these library calls. Again, message equals false was called as part of defining the code chunk so I don't see all that output from tidyverse and data table. I do see the good versions and then I see session info as cleaner down here. What I'd like to do is because I don't really wanna see these library calls, what I could add is echo as an argument and say false and what echo does is it will echo as the code. So if I say echo equals true, it shows the code, echo equals true is the default, echo equals false would turn that off. So if I go ahead and knit that and scroll down here in my document, I see that that library section has been muted and it runs the code, but it doesn't show the output. It doesn't output the code itself, right? And we could do the same thing here to not show session info. Maybe I'll go ahead and do that so I can say echo equals false. And again, scrolling to the bottom, I see all the output from session info but I don't see the command call. I think it gives it a little bit nicer, cleaner look. It all depends on how transparent you wanna be in showing people how you generated the output. And I think for this readme file, what we've got here is perfect. And we'll go ahead and save this. Now I'd like to keep this under control of my make file and at the bottom here, at some point I should come through and organize things. I'm gonna have readme.md as the target and readme.rmd as the dependency, obviously. And to run this, the recipe will be r-e and that hyphen e tells our run what follows from the command line. And we'll do it in quotes. So we'll do library r-markdown because we have to load that library before we run it. And then we wanna do render and we'll do readme.rmd. Now something that occurs to me as I do this is that I tend to use double quotes a lot but I've got double quotes nested inside double quotes. So what I probably wanna use instead on the inside would be single quotes. And so what this will do is do library r-markdown and then render that. And if I go ahead and save this and I can then do make readme.md, tells me it's up to date. So I need to force it, so maybe I'll remove a line here, save it and then run that. So it's gonna run it and generate the files and let me look at what has changed. And I see that I have this readme.md file and that it also generated this readme.html file which it says is the preview file. I'd like to turn that off because I don't wanna keep track of the HTML files as we go through. And what we can do is we can change this header and this header is called yaml, y-a-m-l. And I can then do preview underscore html equals false save that and that should turn off the production of the HTML file, run this again so it's not happy and I realized that I used preview html and what I really want is html preview. Let's give this a shot. Great, that went through, that worked. Again, if I do lslth, I can see that the html file is older than the markdown file so it didn't get created. I'm gonna go ahead and remove that and I'll go ahead and that's all the changes I wanna make to my readme file. So I'll go ahead and git add make file because we added that as a target. Readme.md and readme.rmd, git commit and then we'll say make readme a dynamic file and git checkout master, git merge, but it's complaining at me. Yeah, so it thinks I deleted the files or something or been moved. So I wanna close that and let's see, we'll do git merge issue 21 and we see we've got these three. I forgot to say that this closed issue 21. Those of you that have been around a while know that I do that frequently. Let me see if I can do git commit amend. And then I'll say closes number 21. Get status, we're good and I'll do git push. All right, and if I look at my issue 21, I did both of these things, I'll check them off. What's the point of having a checklist if you're not gonna check things out? All right, so let me come back to code. And now if we look at my readme file, which is here actually, I see that I've got my updates and then I've got my output here. And so GitHub does some nice formatting for output of code. It gives it this kind of subtle gray background and we can scroll to the right if things go off the side of the page. I think this episode's gonna go a little bit longer than the past few episodes, but I think it's important to show the evolution of these R Markdown documents and get to a point where in the future, when we do these types of exploratory data analyses, we can write them in our Markdown documents rather than as a kind of vanilla R script. Okay, so back here at my terminal, I'm gonna go ahead and create the issue for issue 20, the branch for issue 23. So I'll do git branch issue 22, git checkout issue 22, great. And you'll recall that if I do git status, the one thing we haven't tracked is this R script. What I'm gonna do before we go too much further is go ahead and rename this to be an R Markdown document. So I'll do git move and then I'll change it from a .R to MD file. And if I look, I'm gonna go ahead and open up both my readme file as well as my new R script, which isn't much of an R script, it's an R script and an R Markdown document, I mean. What I'll do is I'll go ahead and copy this header and put that at the top of my code here. I'm gonna change the date to the ninth and I'm gonna change the title to analyzing the sensitivity for discriminating between genomes. That all can stay the same. And I'm gonna go ahead and make an R code chunk for my library and I'm gonna turn off messages. So that will generate. And so what I'm gonna do is I had comments in here and I'm gonna turn those into narrative text as well as headers. And I'll use three pound signs to make it kind of a third level hiding, not ginormous. And so we'll ask need to determine the number of RN operons across genomes. And so we'll say our analysis will use full length sequences. And so we'll read this in. And so that's the code chunk we'll use to read the full length. And so then what we'll say is can we, if we want it to count and plot the number of copies per genome. We'll make that a code chunk. Again, we define the code chunk with these three back ticks and the curly braces for R. Maybe I'll give it a title of N, RN. And then we'll close it with those three back ticks. And I think that actually looks pretty good. And I'll go ahead and knit this and we'll see what the output looks like. And it tells us that this does not exist in our current working directory, which is exploratory. Now, if I go to my console and do get WD, we saw this before that we're in our project root directory, but unfortunately a quirk of our markdown is that it uses the directory that the file is in as the working directory. So to solve this, we need to use a package called here. And I'm gonna do library here. And one of the things I've noticed is if you put library here here and you don't have here installed, you'll get a message across the top here saying, it's required for this script, you want to install it. And so that's a really nice feature of RStudio. Of course, you could also come over here to packages on the lower right here and install it there as well. So that's great. What here does is that here, the package, will look for the Rproge file in your project. It will use that as the root for your project. And if you look at files, there's no Rproge file here. So if I go up a level, it says, ah, your Rproge file is here in Schloss R and analysis. And so this is gonna be what it considers the root of the project. Now, I can go ahead and wrap any of my paths with the here function. And if I run this and knit, that error message will go away. And what we'll see is that it ran library here and then it reads this in, right? And so everything looks good to this point. We see that it's inserting markdown for the plot. So maybe I'll turn that off so we can see what it looks like. But we see the type of output we had before from the last episode. Let me go ahead and turn off this HTML preview part and re-knit it so you can see what it looks like inside of the document. When it's rendered up on GitHub, it will show the plot. And so we can see the plot here as well in the output. Which is great. And I'll add some commentary. Sometimes people will send me the output of their R Markdown documents and it's just code, right? It's a wall of code and output. It gives me no context of what's going on. I'd like to get some text in here to give me some interpretation. So we see that most genomes actually have more than one copy of the RN operon. I wonder whether those different copies are the same sequence slash ASV. Which is the next section, right? So determine number of ASVs per genome. We'll say considering most genomes have more than one copy multiple copies of the RN operon. We need to know whether they all have the same ASV. Otherwise, we run the risk of splitting a single genome. I'm gonna just ignore my typos for now. A single genome into multiple ASVs. I can't ignore it. All right, so again, we're giving some text and context the results. And again, I'll use my code chunks. You could name them if you want. I'm trying to kind of quickly go through this so we're not spending too much time watching me type and insert all sorts of crazy typos. And what we found in running this, again, if we scroll down, was that that's the operons, that we found that the number of ASVs actually increases out of a rate of about two ASVs per three RN copies. So surprisingly, or not, the number of ASVs increases at a rate of about two ASVs per three copies of RN operon in the genome. RN operon in the genome. Again, it gives us some context and some interpretation of the data. The next question we were interested in was determine whether an ASV is unique to genomes they are found in. So we might have multiple ASVs per genome, but are those ASVs found across genomes, right? Are they specific to those genomes, right? And so instead of looking at the number of ASVs per genome, we want to see the number of genomes per ASV, right? So we're gonna flip the question, if you will. Add that R. It's really nice that RStudio throws in spell checking for us. And what you'll recall from last time this number sticks out is that we see that with full length sequences, the number of genomes that an ASV, that as long as 82% of the ASVs, were unique to a genome. Now this number 82 kind of bugs me from like an aesthetic perspective. We won't do this here, but in future episodes, what we could do is replace that 82 with actual R code to calculate the value from this code chunk. And that turns out to be really nice, say when we're writing papers, or if the data changes, we don't have to worry about updating every single number. And again, if we look at the output here down towards the bottom, we again see 0.824 and our text is there, okay? So we'll then say, does the sensitivity and specificity change if we at a shorter region? So we suspect or we know that the V4 region is actually less diverse than the full length sequence. So does the number of ASVs per genome differ than for full length? And does our ASVs as specific when using the V4 region compared to full length sequences? Okay, so those are two questions. And so we'll take on this first one in this first code block that was much like what we had before. And so what we find is that the number of ASVs per copy of RN operon is lower than for full length. We find 1.5 ASV per copy of the RN operon. Let me just double check that's right. Again, we could insert codes. We don't have to worry about, are we getting the right value? So rendering this, ah, we get the same problem again where it can't find the file. So I forgot to wrap my path in here. That here function is really slick. I'm really grateful for it. If you didn't use here, you'd have to use these dot dot forward slash and the relative paths would just get really, really messy. All right, so if we scroll down to the bottom, we find that, let's see. Yeah, so was it one and a half? I guess it was like one and a half per 10, right? 15, two? Yeah, so maybe per 10 copies of the RN operon. Great, so we can update that. Again, it would be much better to insert the code to calculate those numbers, but that would get us deeper into R than I really wanna go today. So next, let's look at the specificity of an ASV for a genome. We'll create that code chunk, output that. And I believe what we found was that it was about 76% of the ASVs were unique to a genome. And so yeah, so we found that 76% of the ASVs were only found in one genome. All right, so about at the V4. Now we talked about a few things in the last episode that we could do with this. And so I'm gonna call this as like to be determined. So can we correct for overrepresentation? We might do this by consider analysis at species, genus, family, ETC levels. So instead of looking at the genome level, which isn't really the level that I think even like the hardcore proponents of Amplicon sequence variants claim, they think it's more at like a species level. We might also consider looking at a more broad definition of an ASV. So what we've done here is really exact sequence variants, but again, even the hardcore proponents of ASVs or ESVs aren't really looking at exact sequence variants. They allow a little bit of slop in their definition of an ASV. So these might be things we do in the future. So I'll go ahead and save that and knit it. And this then results in our final document as we have it. And so again, what's really nice is that we've integrated text and code to give a pretty pleasant output of what we're doing. And you could take the HTML file, share it with your advisor or with your collaborator and they can see the code, they can see what you're doing. But for my purposes, I'm gonna put this up on GitHub. And so I'm gonna say HTML preview false and save this. And what I'm gonna now do is put this into my make file. And let me get my paths. That's the RMD file. And the output of this will be an MD file. So that's one dependency, that RMD file. The other dependency is gonna be data V19, RMDB, countTibble, as well as data V4, countTibble. And we're gonna largely do the same thing we did above here. Again, remember that we need tabs here. So maybe I'll put another tab in to make it clear that that's not the recipe. And I'm gonna replace readme.rmd with the name of my readme, my RMD file. So that looks good. And I can now do, if I've saved this and I've saved my, our markdown document. Yep, that's saved. I can then do make that and it complains. Why is it complaining? Let me see if I've got, yeah, I've got a tab there. I think the YAML prefers spaces. Is that the problem? Yeah. So sometimes when you're copying and pasting, tabs get inserted instead of spaces. This looks good. It creates our output, sendSpec.md. And indeed, if I do get status, I'll see that I've got an RMD file, the output MD file. I don't have the HTML file, which is good. And then I've got a new directory that actually has the images that are being generated through this for the plots. So let me show you what that looks like. So LS that, and I see that there's a figure here, GFM. That's a directory also. And so you'll see that there's three directories in here. And if I had named my code chunk, then you'll see that we would give the PNG files, these image files would have those names, but big deal. If I go to Atom and I look at my exploratory file, the MD file, with the markdown packages installed, I can do control shift M. And sometimes it gives this error, but if I hit control shift M again, it will then give me a representation of what the output looks like, right? So this is largely what it will look like on GitHub, which is pretty slick. Control W to close that window. And I'm gonna go ahead and commit all this stuff. So I will do get add make file, and then exploratory, and then 2020, zero nine. And I'm gonna put a star at the end of that line to commit all three of these files and directories. Get status. I see my various files, the new files, as well as the modified make file. And I can do get commit dash M, and I can say convert R script to RMD, to view on GitHub, closes number 22. Get checkout master, and then merge the issue, and get push. And if this worked, again, I'm getting this complaint. I don't wanna, yeah, I'll close that file. I'll look at and see that it's closed my issue. And if I come back to code, exploratory, and then my MD file, I see it's rendered with images in tow as we saw it from our studio, which is pretty slick. Again, we've got the code, we've got the context, we've got the output. This is just really wonderful, and ultimately leads to a much more reproducible analysis where you can see the code that was used to generate the data. If I update the data, all this will also get updated because it's under control of my make file, all right? So this is where I want to stop us for the day for today's episode. Thanks so much for hanging out with me. I know this one went a little bit longer than the previous episodes. I've been trying to keep them short, but there's just so much good stuff, and this is such a powerful tool using R Markdown. I've really been eager to share it with you. If you like this content, and you want to see where we take this going forward, please be sure to subscribe to the channel, click on the bell to like it so you're knowing you know when the next episode is released, probably in a couple of days. Also, please tell your friends about what we're doing here on Code Club. I'd love to get more people involved. I'd love to get more input. So please leave a comment below if you've got any questions. I'd really encourage you to think about could you create an R Markdown document using what I've shown you already today to generate a figure from your latest paper. I think that would be a really powerful and compelling use case to see why you might want to use R Markdown. So give that a shot. If you run into any problems, feel free to let me know, we'll see if we can work through them together. Until next time, keep practicing, and we'll see you for another episode of Code Club.