 When I'm analyzing my data, one of the things I frequently will run into is that I'll have different pieces of information stored in different files as separate data frames. My challenge, though, is how do I read those in, clean them up, and join them together? Well, that's exactly what we're going to talk about in today's episode of Code Club. Hey, folks. In recent episodes, we've been looking at a variety of ways that we can visualize a single variable across different categories. So we've been looking at, say, a diversity measure like inverse Simpson diversity index across three different disease groups, people with and without diarrhea, people with and without a Clostridioides difficile infection. And so I'd like to take a step back from the pretty images and cool graphs and animations and whatnot, and think about, how do we get the data into R? How do we clean it up? And how do we join it together? These are all functions that we can do within the de-plyer package from the tidyverse. Now, five, six years ago, before I saw the light and came to the tidyverse, I would do this all in base R using the built-in functionality in R. And it worked, but it was pretty messy. And it was pretty error prone, I found. One of the things I really like about de-plyer is that it makes things really safe and gives me a lot of confidence that things are working the way they should, not like I thought they were and often weren't when I was rolling my own functions with base R. So we're going to go ahead and dig into that in today's episode of Code Club. I'm starting here with a pretty minimal script. You can get this script, if you want, from a link down below in the description. There's a blog post that, again, will have these five lines of code. If you can't type it yourself, that's cool. If you want the files that I'm working with and let's see how I set things up, again, there's a video that I'll link to across the top. And if you find the content that I'm presenting here interesting and exciting and you want to learn more and dig deeper into it, also down below in the description is a link to my materials that I call Minimal R, which is the basis for a three-day R workshop that I teach that really digs into this type of content in a lot more detail. All right, so let's get going. Here, again, we are reading in the tidyverse and we're reading in Read Excel. Read Excel is installed with the tidyverse, but we need to load that into our set of libraries. And that will allow us to read an Excel spreadsheet in as metadata. Read underscore TSV will allow us to get a shared file. And a taxonomy file. So let's read these in. So if we look at metadata, again, this is a data frame. And one of the easy ways that we can look at the data frame is by typing the name of the variable at the prompt. Another thing that we could do is say View Metadata. And that will open up a interface that looks a lot like a spreadsheet. But you can click on it, but you can't edit it. It's Read Only. And you can sort the columns and whatever in this view mode in this tab that's created in the upper left panel. And so in this mode, at least, you can scroll across, see all the different columns. There's a lot of information here. This is a file that we uploaded with the data when we submitted the data to the Sequence Read Archive. This is called a MIMMarks file. And so you'll see things like our primers and PCR conditions and all sorts of different things. The data we originally generated from 454. And then you'll start seeing things like the people's age, race, gender, whether they'd take an antibiotic, proton pumps, so forth. There's a lot of other metadata in here. Clinical data about the patient. The metric we're interested in is this disease stat column. And again, you can kind of scroll down and see all the entries. So both the view and typing the name of the data frame at the prompt will help you to get a sense of what the data look like. So the metadata has the data, the data about the data, the clinical data on each of these patients. We can also look at OTU counts. This is a data frame that has counts of different OTUs, operational taxonomic units. If you're new to this field, think of these as bacterial taxa. If you want, if you have to, think of these as bacterial species. They're not really, but that's not the topic of this conversation. This label is how the OTUs were defined. The group column is the subject ID that corresponds back to this sample ID from our metadata. And then we have the number of OTUs. These are the number of columns that follow the data frame. And then we have the names of the OTUs going across the columns. The values in each of these cells then, this 365, says that OTU 2 showed up 365 times in this patient, whatever the third patient is in the study. So we see now that we've got a table with counts of taxa by person, by subject. We also have the metadata for each of those subjects. The third piece of information that we don't have is an identity for those OTUs. And if you're looking ahead, you might say, well, maybe that's in the taxonomy file, Pat, right? And sure enough, if we look at the taxonomy file, we see a column for the OTU. We see the size, which is the number of sequences in the study that were in that OTU. And then the taxonomy, a string indicating the taxonomy for that OTU. So we have these three data frames that we would like to join together. And perhaps you're noticing, like, you know, I could use the sample ID and the group name and join those two data frames together. But those column names are different. So I'm not quite sure how we might do that. Or you might notice, well, in the shared file, the OTUs are across the columns and in the taxonomy file, they're across the rows. In addition, the column names are weird, right? And they just perhaps don't line up. Also, the taxonomy column in my taxonomy file isn't formatted in a way where it's really easy to get access to, say, the phylum or the family or the genus or whatever taxonomic level I might want to look at. Finally, you'll notice there's also columns in these data frames that I don't need, right? Like, I don't need the size. I don't need the total number of sequences for that OTU across all samples. I don't need the PCR adapters. I don't need all that extra information. And it would be nice to clean things up and remove those pieces of information. Again, that's exactly what we're gonna work on in today's episode of Code Club. I'm gonna start with my metadata data frame. And again, to remind ourselves, this gives us a data frame that has a sample ID and also a disease stat column. I really only need those two columns for the analyses that I foresee myself doing. So to get the sample ID and disease stat column, I can use the select function. So I'll pipe the output from read Excel, which is currently this data frame metadata into a select function, where I will then say sample ID, sample underscore ID, comma disease underscore stat. Run that. And now I can look at metadata and I can then see that I've got the sample ID and disease stat for everybody in this study. One of the other things I might want to do with metadata is then to do metadata and then pipe that to account on disease stat to see what the distribution looks like. So the output of this shows me that I have an NA value. So that's NA 14. So there's 14 rows in this data frame that have an NA. Again, I could do metadata and pipe that to filter and then say disease stat equals equals quote NA. One of the things I notice is that NA is not read. In this case, it's black here in our studio and it's in a column of type character. So I could then disease stat equals equals NA the string. I don't really want NA stored as a string. I want it stored as an NA. And so we'll come back and see how we can fix that. And sure enough, what I'm seeing here is that these NAs are actually correspond to MOCs and GD, which is what we call a generous donor sample. So I want to go ahead and get rid of those. To do that though, I need to go ahead in my read Excel statement and do NA equals quote capital NA. And so if I run that and then I repeat my count, I now see that my NA here is red. And so that's in good shape. And now what I can do is I can do drop underscore NA and put in disease stat. So now metadata will lose all of those rows that have an NA value in the disease stat column. So now I'm happy with the way metadata looks and we're ready to move on to our OTU counts. So again, our OTU accounts, you'll recall has different samples in our rows and our different OTUs in the columns. There's two extra columns in OTU counts that are at the beginning. One is label and one is numOTUs. And so there's two ways to get rid of those. And so let me show you those quickly. So I can do select and I can do minus label and minus numOTUs. And so whereas up here we used select to get the columns we wanted, if I include a negative sign before the column name, I then drop that. Takes a moment or two to run because the data frame is really wide. In R, frankly, hates really wide data frames. There's other packages we could use to read it in that'd be a lot faster like the F-read package. But for our purposes today, this is good enough. We see now that we lost that label and numOTUs column. Another way that we could do this, if I comment that out and I could do select group and then I could do starts with starts underscore with as a function and then OTU with O being capitalized to you paying lowercase. And then if I run this, and so I get the same output, right? I get the group and then the OTU columns. This is the same output that we got if we use select with the negative sign. I'll leave both of these in here. I personally prefer to be positive and telling R what I want, not what I don't want because if the output were to change or the input were to change somehow, it's I think safer to be positive indicating what you want to get out rather than what you don't want to get out, right? One thing that I would like to do perhaps is change the column name of my group to be sample ID to match what I had up above in metadata. So I can do rename and then I can do sample ID equals group. And now when I look at this data frame, I see I have sample ID in that first column. Looking ahead to my taxonomy file, my taxonomy file had those OTU names in rows under a OTU column. Here I've got my OTUs in columns. So this is not tidy, right? I would prefer to have my sample ID, my OTU number, and then the count for each OTU in each sample. So a three column data frame rather than a data frame that as we see here has close to 2,500 columns. Again, R really struggles with wide data frames. And so we can tidy this using the pivot longer function. So I can do minus sample ID and that will pivot longer all of the columns except for sample ID. And the names will go to names to names underscore two equals and I'll say OTU and values to alcohol count. And so as we can see now in OTU counts, we have the sample ID, the OTU and the count. So we can see that OTU three shows up in DA 00006 250 times. And that's now a very long data frame with over 840,000 rows and three columns. And so R just loves working with these tidy data frames. And again, this is the plier, right? We're part of the tidy verse and we're working with data in this tidy format. Now we wanna go to our taxonomy data frame where again, we've got our OTU column, our size column and our taxonomy column. And I can start to clean this up by doing select and I will do OTU and taxonomy. Again, you could say minus size, whatever works for you. And so we've dropped that. My column names are all uppercase at this point. I really prefer to work in lowercase column names so I don't have to wonder like, was that capitalized or not? If I just know everything is lowercase and I don't have to think about it and there's a lot less kind of cognitive burden on me. So I can do rename all and then the argument to rename all, I can say to lower. And what this will do is this will apply the to lower function to all of my column names. And we see now that I have OTU and taxonomy being lowercase. And so now I can see that I've got sample ID that I can use to join metadata and OTU counts. And I now have the OTU column that I can use to join OTU counts and taxonomy. Before I do those joins though, one last thing that I would like to do is get this taxonomy column into a format that's easier for me to work with. And so I might want like the phylum name here Bactroides or I might want the genus name Bactroides, right? And so let's go now and see about how we can clean this up a little bit further. Because this is gonna take a few steps, I'm gonna go ahead and remove that taxonomy variable name so that when I run this, it outputs it directly to the screen without me having to write taxonomy over and over again because I get kind of tedious. And so here we're gonna go and use some functionality from a package called string R. And the first thing I wanna do is remove those parentheses with the number in the side of that. That number indicates the percent of those sequences in that OTU that had this taxonomy. And for my purposes, I don't really care about that for here. So I'll do mutate on taxonomy and I'll then say str replace all. So I'll remove every instance of what I'm gonna be looking for and replacing in the taxonomy column. And I will then say the pattern I want to find is something that starts with a open parentheses. And so to match an open parentheses, I need two backslashes. And then I wanna match a digit. And so to match a digit, any digit, I need back back D. And if I wanna match zero or more, I would use a star. And if I wanna match one or more digits, I'd use a plus. So I'm gonna go ahead and use the plus. And then as you can see in my pattern, I need a closing parentheses. So I'll again do back back parentheses. And so this argument is a regular expression. And there's so much you could do about regular expressions, but we'll save that for another day. But you can hopefully see that this pattern will match any number in a pair of parentheses. And because I'm using that back back D, if it ran into a letter, it's not gonna match it because a letter is not a digit, right? And so that will work. And then I'm going to replace that actually with nothing. And you'll notice I'm not matching the semicolon. And that's because I'm gonna use the semicolon for something here in a moment. So let's go ahead and run this and see if it does remove those numbers in the parentheses. And voila, we got rid of those numbers in the parentheses. And that's very nice. Okay, so the next thing that we're ready to do is to separate our columns into the different taxonomic levels. And I can do that using that semicolon that I left in here. So I will pipe this into the separate function and I will separate taxonomy and I will then say into, and I'm gonna give it a vector of column names that I want. And so again, King Philip came over for good spaghetti. That's how I remember a kingdom phylum class order family genus species. So we'll do a kingdom phylum, King Philip came class order family genus. And we don't have species information. So I'll leave that there. And then we will do a sep equals and then in quotes I'll put a semicolon. So we'll go ahead and run this. So we get a warning message, expected six pieces, additional pieces discarded in 5,445 rows. That's like every row there is something missing. So let's go ahead and let's look at what is in that taxonomy column for say, O2U1. So I'm gonna go ahead and add to this, filter O2U equals equals O2U001. That's not getting me the full thing, unfortunately. So let me then do pull taxonomy and that will give me the value that's in that taxonomy column for O2U1. And what I can see is there's a semicolon at the end of my string. And so what's happening I think with separate is that it sees that semicolon and it creates an extra column at the end. So I need to go ahead and also remove that final semicolon from all of my strings. So I'll leave that there actually for now to test things and make sure things work. So I'm going to add another string replace function called on taxonomy. So I'll do taxonomy equals strreplace. And I'm gonna use replaced because I'm only gonna make one replacement on taxonomy. And here we're gonna use another regular expression which is a dollar sign. And what the dollar sign means is match the previous character occurring at the end of the line. If I used carrot B that would mean match the B at the beginning of the string. But if I use a dollar sign that means match it at the end of the string. And I will again match that with nothing. And I think I need a comma there, not an M. So let's go ahead and run that. And we now see that we lost that semicolon at the end of our string and we're in good shape. So now if I remove that filter line put the separate back in there and run it. We're good to go. We get all the columns that we expected to get. And we are now ready to join everything together. So I'll reassign this to taxonomy. So how do we do all this joining? Well, I thought you'd never ask. So we've got metadata, we've got OTU counts and we have taxonomy, right? And so these are the three data frames that we'd like to join together. Well, to join those three data frames we're going to use a function called inner join. And what inner join does is it will take two data frames and join them together on a column. And you can give it one column, you can give it two columns, you can give it multiple columns, whatever. And it's only gonna output those rows from the two data frames where the column you're joining on is found or the value from that column is found in both of the data frames. So if I do inner join metadata and OTU counts. So now I see in this output I have the sample ID, the disease stat, this is coming from metadata. I have the OTU number and that's coming from my OTU counts data frame and then count, which is also coming from the OTU counts data frame. So that's pretty slick. I've joined two data frames together. If I was concerned about what might be missing from one data frame or another, I could do anti-join metadata OTU counts. And this would tell me what's in metadata that's not in OTU counts. And I see that it says tibble zero by two, nothing is missing. I could also then flip the order here to do OTU counts metadata and see that nothing is missing there as well. And again, running that, one of the things that I see is that it says joining by sample ID. I prefer to be explicit in telling inner join what I'm joining on. So I could do by equals sample ID. Again, to make that explicit, it's pretty smart at figuring these things out for you. But again, I find it safest to be explicit. So now when I run that, I no longer get that message. Okay, I'm gonna move this anti-join up ahead here. And what I need to do now is I need to join in the taxonomy information so that I can add that to my OTU information. And to get that, I can now pipe this to another inner join. Right, and so now you're saying, well, I've got taxonomy and I wanna join by OTU. So I'm gonna join by the OTU column, but what am I gonna join to taxonomy? Well, in the plier, when we are using the pipe that we get from MagRitter, you can use a period comma and the period will indicate the data flowing through the pipeline to that point. So I can put period taxonomy, again, the information from this first join in the left side of the output a data frame and then joins to it the taxonomy using the OTU information. And if I were to kind of flip this order, so the taxonomy comma period, then what we'll see is that we get the taxonomy information first and then the metadata second. I tend to put the period first. I don't know why. I think it's easier for me to read that the information from the pipeline is going into that first slot. So we now have this data frame where we have the subject sample ID, their disease status, the OTU for each subject, the count, and then the taxonomy information for each of those taxa. This data frame looks great. One last thing that I would like to add to this, however, is a column for the relative abundance. We have the count, but I'd like to convert that to a fractional number. So how do we do that? Well, again, we can use more functions from the plier. So I will say group by group by allows me to take my data frame. And in this case, I have 841,000 rows and it'll allow me to chunk those rows of the data frame according to some variable that I have represented as a column. So I want to group by my sample ID because I'm gonna wanna get the total number of counts for each sample for each subject. And then I'm going to count the total. So I'm gonna get the sum across all the counts and then use that to divide as a denominator to get a relative abundance. So we'll group by sample ID and we will then do mutate relabund and we will then say relabund equals count divided by sum of count. So it's only gonna do the sum within each subject and we run that and we now see we've got this relabund column off to the side. One last thing about this data frame is that it's still grouped by sample ID. I find that this grouping tends to cause problems down the road and just to be safe, I always like to keep things ungrouped as much as possible. So I'll add ungroup to this pipeline. And now when I look at the output, I see it's no longer grouped. I will then save this data frame as O2U underscore relabund. I could probably even go ahead and then select minus count to get rid of that count column. If I wanna double check that my relative abundance is for each subject add up to one, which is always a safe thing to do. I could again do O2U relabund and then I could do group by and I could group by sample ID and I could then pipe that to summarize and I could say total and I could then do some relabund and then all of the values should add up to one. And so we'll run that and we see that we get one for everything. One last thing to comment on with this data frame is that for some purposes it might actually be considered wide because we have all these taxonomic levels as separate columns. It might be preferable to kind of make that tidier still by making a column for say taxonomic level and then taxonomic name. So let's go ahead and do that because we never know where we might need to use this. And so what we can do is we can again do pivot longer. So I'll do calls equals because my typing as you all know is horrible, I'm gonna copy these down and I'm also going to add O2U as another taxonomic level type and then I will add to that names two equals level and then values two equals taxon, right? And then we need to close in parentheses to close that out. So now we see that we've got a tidy day frame with our sample ID, disease status, relative abundance, the taxonomic level and the taxonomic name. This will serve us very well as we go into the future and thinking about doing analyses at different taxonomic levels or with different taxonomic groups. And so again, coming into subsequent episodes here, we're gonna look at a bunch of different ways, some good, some not so good, ways of visualizing this relative abundance information and leading us forward to looking at more types of data. And instead of looking at a single variable like the diversity measure, looking at many measures of a community and looking at these different taxonomic groups and the relative abundances. So I hope you stick around with us. Please be sure that you've liked this video and that you've subscribed to the channel so you know when these upcoming videos are released. I'm trying to do two or three episodes a week. Things do get busy, but you know, I think this is important. I really have a lot of fun doing it. If you like this type of content and would like to learn more, please be sure that you check out therifomonas.org website. On there I have two tutorial series, one called Minimal R, which is this type of material directly for microbial ecology. There's also a general R material that's up there that covers this material, but not with microbial data. Also know that I teach this content in three day workshops. I have a couple of workshops coming up this summer in June and July. Would love to have you participate. And again, you can check that out atrifomonas.org. Anyway, keep practicing. Please tell your friends about Clodeclub. It's been great to see the growth of this channel. We'll see you next time for another episode.