 Hey folks, last year I was working on a paper that has been posted as a pre-print looking at the effect of removing rare sequences from data sets and the effect that that would have on inferences we might draw from microbiome data sets. If you don't know what that means, don't worry, stick around, I think you'll still get a lot out of this episode. Now, fast forward, more than a year, and I want to recreate a figure that I included as a supplemental figure to that paper. I'm talking about this figure here in which I have 12 different data sets arrayed across the y-axis, and then on the log scale, log 10 scale across the x-axis, I have the number of sequences for each of the different data sets. You'll notice that some of the points are black and some are red, actually they're transparent black and transparent red. If a sample was red, it was removed from that study when I was analyzing it. So I want to recreate this figure because I have another project that I'm working on that uses these same 12 samples, and I would like to recreate this figure. Sure, I could copy and paste over the code, but, you know, I want to see if I were to write the code to make the same figure twice, more than a year apart, how similar would my code be? So that's my goal for this episode. I want to recreate this figure and then look back at the original code and see how much has changed. Again, to kind of think about how has my understanding and thinking about how I code to make figures, how has that changed over a year and a half? All right, I'm doing this work in Visual Studio Code. I'm doing it on a high performance computer cluster here at the University of Michigan called Great Lakes. I can use RStudio, but it's just a little bit clunky to do that on a remote server. And so I typically work in Visual Studio Code when I'm doing my work up on the high performance computer cluster. And so you'll see that I've got my code up in the top part of the window, and I have R running in the bottom part of the window. And one of the nice things about this is that I can go ahead and do something like 2 plus 2, and I can hit Command Enter, and then that will run that code down here in R, right? So that's pretty cool. It's not quite as slick as RStudio, but it's not so bad. Thinking about the figure I want to generate, again, on the x-axis I have the number of sequences. On the y-axis I have those samples. So if I come over to my bash shell, I have my data in a directory off my project directory called data, and if I do ls data, I see that I've got my 12 different data sets in here. And if I then look in each of these different directories, there's a bajillion different files. And so the two files that I'm interested in in here, let's see, let's do ls data mice forward slash data dot group dot count. And so this has the number of sequences in each of the samples. So if I do head on that file, I see that I've got my mouse samples here right in the first column. And in the second column I have the number of sequences. And so this pattern of data forward slash the sample or the study name like mice, followed by data dot group dot count is going to be really important to me, because that's going to tell me the number of sequences I have in each of the samples for each of the data set. So I'm going to go ahead and copy and paste that up here into my R code. And again, if I were to do ls data forward slash star data dot group underscore count, that star will represent all of the directories within the data directory. This then gives me 12 different group underscore count files that again, for each of those 12 studies has the names of each of the samples, as well as the number of sequences in that sample. Cool. The other thing that I need to do is to track down the file that has the removed samples, right? So if I do ls data forward slash star forward slash, and then I know it has acnose in it, acnos, I forget where exactly the acnos falls. And so what you'll find is that I now have those 12 files, and it's data dot remove dot acnos. Great. And so again, if I look here, I've got this mouse sample. And let's go ahead and see what that looks like. So if I do head on that, I see that I've got what six 10. So it's showing me 10. But if I do cat, it'll show all 10 of them head only shows the top 10 in Linux. And so I think there's maybe 12 samples here getting removed. But those sample names are in a column. So that's really good to know. So I want to get a data frame that has the group counts, as well as the name of the removed sequences. And then I'm going to want to concatenate those 12 data sets together, again, one for each of the different studies, each of the different data sets, right? Cool. So let's start out by getting the group underscore count data read into our so I can use glue to go ahead and make these different paths to these files, right? And so I need to make sure that I also have library glue loaded here. And so what I can then do is go ahead and grab this. And I'm going to call this group count files, right? And then I will then plop that in there. And instead of mice, I'm going to put in curly braces data sets. And then I need to put this inside of the glue function, right? And so now it's complaining because I haven't loaded any of this stuff. Let me go ahead and load all that. And so if I look at group count files, I see I've got the 12 different group count files files, which is great. And so now I also want to get the removed files. And so I'm going to call these remove underscore files. And here again, the extension isn't going to be group underscore count. It'll be remove underscore acc nos. And again, if I look at remove files, I see those different paths that I want. So I have these two vectors now of file names for the group counts, as well as the samples or groups that need to get removed. I think what I'll do instead, perhaps, is think about this a little bit differently. And I'm going to comment these out for now. And go ahead and just start with a vector that I don't have to name. I don't need to name the vector of values that contains the file names, right? And so again, if I run that, that glue function, I get all of the different group count files. And if I create this as a named vector where each seat basically in this vector has a name, then I can use map DFR to iterate or to map read TSV over each of the values of my count files here, right? And so I can use some of the cool aliases from magridder to do that. So I could do set underscore names, and then to say data sets, right? But to use that, I need to add in library, magridder, make sure I've got that loaded. And then I can go ahead and run this. And so that then sets the names. And I can now pipe this into map DFR. And so map DFR will take a function and map it over different values of the vector coming in. And each value then outputs a data frame. And then map DFR will concatenate those together to be a big data frame, right? It does it row wise, it concatenates those row wise. And so the first argument then is going to be period. And the argument name there is dot x, right? And so I'll just leave that then there to make it explicit what's going on. And then I'll do tilde read underscore TSV. And then I'll say dot x. And that's the file name position. And my files. So like data stream data group count, right? So if I had to take this and then do like read TSV on that file, what I'll find is that the the columns don't have names, right? So I need to give them names. And so I can do that with call underscore names. And I can then assign that using a vector. So I can say group and n underscore seeks. And I'm going to go ahead and assign those call types. So I'll do the call types. And then that takes the calls function as an argument to call types. And I can do group equals call character. I know that one of the data sets I have, the group names are all numerical. And so if I don't set the group to be call character, then I'll have some data sets where group is character and somewhere it's a double. So if I say group equals call character, then for all of the files, that group column will get read in as a character. And then we'll do n seeks equals call double. And so then map DFR will also take in an argument which will be dot ID to create an ID column corresponding to the name of the seat in the vector, right? So the data sets. And so then I'll call this ID equals data set. And now if we run this, we should get out a three column data frame. Sure enough, we get data set group and n seeks. That's great. If I were to do like count on data set, this should give me a 12 row data frame. Sure enough, we get our 12 rows. Now we have read in all of our counts. I'm going to go ahead and call this group underscore count. And we'll load this so that it's stored in our memory. I'll go ahead and delete this temporary file. And I think I'm going to do the same thing for the remove files, right? So again, we'll go ahead and let's, let's do this, let's go ahead and copy most of this down. And so instead of data group count, we'll put in their remove act knows, right? And so we don't have n seeks, but we do have group. So we'll go ahead and leave that. And I'll remove this n seeks for call types. And yeah, that should look pretty good. And so now we should get a two column data frame, which will have the name of the data set, as well as the groups that need to be removed. And sure enough, we now have those that data set and the group. And so we're not starting with bioethanol because bioethanol didn't have anything that was getting removed, if you recall, right? So Lake was the first if this is alphabetical, Lake was the first thing to have samples getting removed. Great. And so now I will call this remove groups. And we will go ahead and load this. And I'm going to go ahead and modify remove groups to make a third column that I'll call state. Okay. And so we'll do mutate state. And anything that's in this data frame, any groups that are in here are getting removed. So I'll go ahead and call this remove, right? And so again, now if I look at remove groups, I should have those three columns, right? And now what I can do is I can join remove groups with group count. And I'm going to do a full join here. So normally I do an inner join, which keeps the rows that are shared between two data frames, a full join keeps all of the rows that are shared between the two data frames, right? And so I can do full join on group count and remove groups, I guess I could do a left join that would get me kind of the same thing as a full join. Let's do that because I don't know that I've done a left join in a long time, left join. So we'll join by the data set and the group. So we have two things to join by. So I need to define that as a vector, right? And so we're joining again by two variables, the data set, as well as the group. Joining that together, we now see for like bio ethanol that all of these samples, these groups have a state of an a. So I need to go ahead and modify that. And we'll go ahead and do a mutate on state. And I'll say an if else is dot na on state. So if state is na, then I'm going to call that keep. Otherwise, it's going to be remove, right? And so now what we can do is go ahead and run that. And that then fills those na's with keep values. And I could go ahead and do a count on state to kind of see that we're in the ballpark for a number of samples to keep versus remove. And we said there's 39 things to remove 2225 to keep. Again, if we look at those red dots, I'm reasonably confident that there's about 39 there. Okay, so now we have that column of state to color our points by and we are now ready to do some plotting. So let's do that. Okay, so we'll do a ggplot on this AES. And so on the x axis, we'll put n six on the y axis, we'll put data set. And then our color, we will put state. And I am going to do a geom jitter. And that reminds me because the geom jitter in this case on the y axis is going to be randomly placed within each data set, I need to set my seed. And so I'll come back up to the top here and do set seed. And I'll do 1976, 06, 20, the best day of the year. What don't you know? So I'm going to write the output of this pipeline here to a variable that I'll call jitter plot, right? And then I can save that using gg save. And I will then say, figures seeks per sample dot tiff. I'm going to save it as a tiff. And the object we're going to save is jitter plot, right? And I'll do width equals, let's do six, height equals four, believe it or not, that looks good. And so we notice that there is quite a spread in the number of sequences. Some things like mice are right up on kind of the y axis. Other things like bio ethanol and human are pretty spread out. And so that's why we put things on a log scale. So let's go ahead and do that. So we'll do scale x log 10. And so that puts our data on that nice log scale. Again, thinking about what we had originally, we went from one to a million. Whereas here, we're going from labeled at least 10, up to maybe 200,000. So we can adjust that log scale by doing breaks equals and we can give it a vector of values, so I'll do 1, 10, 100,000. You know, I'll use a scientific notation, e to the 4, 1, e to the 5, 1, e to the 6. And then I'll do labels of 1, 10, 100, 1, 0, 0, for 1,010,000, 100,000, and 1 million. And so it's losing the million off to the right there. So I need to be sure to include my limits here, I think. So again, I'll do limits, c1 to 1, e to the 6. So that looks right. We now go from 1 to a million. Again, if you're using a log 10 scale, you can't have 0 as one of your values. So it makes sense to start at 1 and then go out to some further limit. So I'm going to turn my attention now to those points. And we can come in and add another scale. So we'll do scale, color, manual, and we'll do name equals null because we don't need to indicate the name. And I'd call these kept and removed breaks. And then the values here are going to be keep and remove. And my labels are going to be kept and removed. And I need to stylize that to be capital K. And then we'll do values for the colors. So I think it was red and black. So the kept ones are black. The removed ones were red. And so I'm I notice I misspelled removed. And I need to turn down the alpha. We've got the right red and black, but the points in the original plot were transparent. So you could kind of see some of the over plotting. So we can do that back up here in GM jitter, where I will then do alpha, let's try 0.2. And I need to fix removed there. So we'll go ahead and try this. So that looks pretty nice. I'm not sure that I've got the exact same alpha that I had before. But I think it looks okay. I'm not going to fuss with the alpha too much more until I make the background white like I had in the other version. So let's go ahead and do that now. And we can come back up into our pipeline here for making jitter plot and add theme, we'll do panel dot background equals element blank. And then we'll do panel dot grid line. I can't remember if it's grid line or grid lines. It's one of the problems of not working in our studio is that I don't have this help. And so I'm going to go ahead and pull up the theme function so I can remember what those arguments were. So I'm kind of maximize this. And let's see, if we scroll down here, so panel grid wasn't even close grid major, or just let's do panel grid, because I want to get rid of all of those grid lines. Actually, you know what, maybe I will keep the major y grid lines, right? So maybe I'll do panel grid major x and grid minor x and turn those to element blank. And so again, I could do panel dot grid, major dot x to be element blank. And then panel dot grid dot minor dot x to be element blank. And then I'm going to make the horizontal ones, the panel dot grid dot major dot y to be element line. And I'll do color equals black. Wonderful, we got rid of those vertical grid lines. And we have the horizontal grid lines. If I look back at the original, however, the horizontal grid lines were gray. And so let's go ahead and turn those back to gray. Again, we can do that by changing black to gray. And let's go ahead and rerun all this. That looks good. One other thing I'm noticing that we can change with the theme is the legend background. So let's do that. We'll come over here. And we'll do legend dot background, element blank. And that didn't change the color, I think it's legend key, not legend background. So let's do legend key. And we'll try that again. Again, without the nice helpful tips of our studio, I'm a little bit lost at times, sure enough, that did get rid of the background in the key. Now let's go ahead and turn on our axes, right? So that's something else that we can do within the theme function, where we can come down here, and we can do axis dot line dot x, or let's just leave them both on right, we'll do element line. Good, we've got our nice solid axis lines there. I'm happy with that. One thing I'm noticing is that we keep getting this warning message that it removed two rows containing missing values, that generally happens when I set a limit that is too constrained for the data that we have, right? And so I'm noticing that Lake here, we have one point. And up here, we have two points actually for Lake and one for mice. So I wonder if there's something weird going on where, you know, we have it set to one, and it's just it's just falling below that limit for whatever reason, maybe it's applying some type of small x jitter that we just don't appreciate. So what I'll do is I'll come back up here. And instead of going from one, let's do say 0.8. And let's try to run this again, and see if that warning message goes away. So sure enough, we now have the point for the mice and the two for the lake, and we no longer get that warning message. I am noticing that the clouds are pretty thick. And so what I'd like to do is let's come back up to geom jitter, and let's set the height. I think the height is already set to like 0.5. So let's do say like 0.3. And let's do width equals zero, because there shouldn't be any left right jitter. I think we might want to make that a little bit more narrow. So let's come back up to our jitter here. And so our height was 0.3, let's do 0.2. And I think that'll probably be good enough. I think that looks a lot better. Now what I'd like to do is flip my y axis, you'll see that it starts at stream and goes down to bioethanol. In the original, it starts at bioethanol and goes down to stream. The other thing you'll notice is that these data set labels are capitalized, whereas mine is all in lowercase. To flip the order, we're going to set data set to be a factor. Okay. And so again, what we have at this point in the pipeline through 32 and 33 here is that we have a table with a data set group and seeks and state that data set needs to be mutated to be a factor. So I'll add that to this mutate line. And we'll do data set equals factor of data set. And we'll do levels. And I'm going to do the reverse of data sets. And so again, data sets are the 12 data sets. And if I do the reverse of those 12 data sets, it flips the orders, right? So this won't solve the problem of capitalization, but it should flip the order that we're working with. Wonderful. We now have bioethanol at the top and stream at the bottom. I think we're in good shape there. Now what I want to turn our attention to is capitalizing those different data sets. So I think what I'm going to do is I'm going to go ahead and copy these and I'm going to manually capitalize them. So I'll call these pretty data sets, right? And then I'll just capitalize manually the first letter. We could write a function to do this, but you know, I think this is already getting a little long. And so I just assume not have to worry too much about that stuff. So we'll go ahead and change all these. Alright, so then we have our pretty data sets. Great. And so now what I can do is down here in our factor statement, I'm going to go ahead and do labels equals rev on pretty data sets, right? And so if we look at this, we shouldn't have like lowercase bioethanol should now be uppercase bioethanol. And sure enough, we now have the right labeling, right? That's pretty cool. I don't know that I've shown that in a previous episode of Code Club. But that's another way that we can change the appearance of a factor value, right? We now have our capitalized data set names. Let's go ahead and clean up our x and y axis names. So we'll do that back here. And I'll do labs. And we'll do y equals null. Let's see what we had for x before. So there's a number of sequences per sample. Alright, great, we've updated the labels on our x and y axis. I feel like this version is a bit wider than what I have currently. And so maybe what I'll do now is come down to my gg save. And let's maybe make our width eight. That looks good, but it's almost too wide. It's probably seven. So let's do seven. That's cool. So this looks pretty good. It looks pretty similar to what we had originally for the original data set. I'm pretty happy with that. One thing I do want to take a look at though is the size of this file. And so if I come back here and to my bash terminal and do a LS dash LTH on figures forward slash seeks per sample TIFF, I see that this is 7.3 megabytes. That's pretty big. I know like the American Society for Microbiology limits the size of the figures you can submit. And so what I'd like to do is compress this. And so what we can do is add an argument here of compression equals and then LZW to use that LZW compression. And so now if we rerun that command, we now see it's 261 kilobytes. So it's quite a bit smaller. So this compression LZW will compress the TIFF to make it much more palatable to those online submission systems. So I'll go ahead and save that. And let me minimize that screen. And so we can see basically a what we wrote about 59 60 lines of code here. There's a few things here we could probably get rid of. And so let's see what we had previously. So this is the code as it stands up on my GitHub repository for that project. You'll see that all I used was the tidyverse. I did set the seed. I got the removed samples by writing a function. I think these remove ACNOS files were stored differently. I think before I had it as a series of sample names separated by hyphens rather than a column of sample names. So that's a subtle difference. Again, I did my thing with studies and nice studies. So one thing I can see different right off the bat here is that previously what I did was I created a vector of nice studies and then named it with studies. And so I think if we look down below at scale x discrete, I set the breaks and the labels rather than what I did this time around of creating a factor and setting the levels and the labels, right? So this is a little bit of a different approach to what I took today. We all again did a map DFR over the different studies to get the samples to remove. We again did counts data where I guess I created a data frame before we're at a column of the studies and a column of the file names. And then I did a group by nest mutate and ungroup. I think I was into a real kick back then of doing nest mutate ungroup whereas really all I needed to do was to iterate over each of these values without having to do the group by right so I probably didn't need this nest and ungroup. I could have basically done a straight up map like I did in today's episode and so that I didn't really need to worry about all this stuff with a nest, right? And then again, I did something very similar. I guess today I did if else whereas I could have done replace NA on status to change that to be kept whereas today I did if else is that NA on status and then if that was true then I turned it to kept otherwise we removed it cool. And again here we have a factor and I sorted the studies in decreasing order rather than flipping the order because I entered them here in alphabetical order already so anyway and then we built out the plot. I'm noticing that I added vertical lines and so I'm wondering I did vertical lines right but this plot is horizontal and so one thing I noticed is down here on line 48 previously I did chord flip so I basically built a vertical plot and then I flipped the coordinates so that x became y and y became x whereas today I built everything as you see it right so that we have the number of sequences across the x axis right whereas here it was in the y and today I did the study on the y axis whereas before I did on the x axis right and so you'll notice these geom v lines are the grid lines right whereas today I did that within the theme function so that's pretty cool right? I did pick the right width I used 0.2 back then I'm using 0.2 today so I used 0.33 back then today I think I used 0.2 I don't think it really matters and then for the most part a lot of this other stuff looks pretty similar and so again I think it's interesting to compare and contrast how our coding styles evolve one of the things I think is really cool about this is that although again the code is different the plot is basically the same right these two figures are more or less the same right there's subtle differences I think in terms of like the alpha but aside from that they're basically the same figure so I think it's useful to kind of chart your own growth by perhaps remaking the same figure at different time intervals to kind of see how things change I know back then I was really on this kick of trying to play around with group by and nest and mutate with map and unnest and all that stuff and I think what we did today frankly was a lot simpler right and so this approach where again we did you know just in like five lines of code for both the group counts and the remove we're able to read it in without having to worry about creating a temporary tibble without having to do the group by the nest the mutate you know all that stuff right so I think this is a lot more elegant and I'm pretty happy with how this has changed over time and one of the reasons that I'm coding differently today than I did back then it's because I'm making these videos right and so you know what we saw in a recent episode that we can use these mag ridder aliases like set names to do this all in a pipeline so I don't have to make temporary variables like I was set up to do of this like group count final names right and maybe take a plot that you made a year ago and remake it today without looking at the old code and see how it's changed let me know what you find I would love to hear your results anyway keep practicing with this and I'll see you next time for another episode of code club