 Hey folks, in the last episode I mentioned that we are going to start a series of videos talking about how we can use machine learning tools to analyze microbiome data. We're going to use the micropml package that my lab developed, but before we can do that, we first need to get the data into the shape that it needs to be so that we can analyze it using micropml as well as some other tools. We did something like this previously with data from a paper that my lab generated from Alexandria Schubert looking at Clostridioides difficile. So we're largely going to repeat what we did in that episode again, but with data from Neil Baxter's study, again looking at data collected from stool samples of people with different degrees of lesions in their colons. This is such an unheralded part of an analysis of getting your data into the right format, making sure everything looks good before you go on with the cool machine learning stuff. So that's what this episode is going to be about. I'm going to talk to you about how we can reformat the names of the columns, how we can reformat the data in those columns, how we can make sure everything looks good, and then how we can join together all those data frames to have a single data frame that tells us the number of times each taxa, each genus shows up in a community, and whether or not that community is from somebody that has something wrong with their colon or not, right? Whether or not that person has colon cancer or an adenoma, or whether it's perhaps what we'll call a screen relevant neoplasia or just a lesion, right? So that's the goal of today's episode. And so we're going to head over to our studio now and I'm going to show you how we can do this. I'm here in my micropml RStudio package. I'm in my working directory off the desktop. Again, micropml demo over in the lower right corner, you can see the various files. Again, I have this project being watched, so to speak, by version control. And so up in my git tab in the upper right corner, you see nothing has changed. And so everything is current and up to date. I'm going to go ahead and save this new R script that I'm going to be working on in my code directory. And I'll call this genus analysis.r and we'll save that. And so now what we see again in git in the git window that we have a new file that's not being watched. And before we're done today, we'll go ahead and commit our changes and then push that up to GitHub. But we need to get going writing code in our genus analysis.r script. And as we always seem to do, we'll start with library tidyverse. We now have all that great tidyverse functionality loaded into our studio session here our session. And the first data frame that I want to read in is my shared file, the shared file is generated by mother. And the rows are the different samples and the columns are the different otus or the different taxa and then the cells are the number of times each otu is seen in each sample. We can do read tsv because it's a tab separated values file. And then we'll do raw data because it's living in the raw data directory backster dot sub sample dot shared. This takes a moment or two to load, because the data frame is very wide. r does not like working with really wide data frames. It much would rather have a much longer data frame and a very narrow or what we'll call tidy data frame. And so if I look at this output here, you'll see that we've got 490 rows and almost 5500 columns. So it's a really wide data frame. What I'd like to do is tidy this so that I have three columns. So I'll have a column for the sample or the group, a column for the otu designation, and then a column, a third column for the number of times that otu showed up in that group. So the first thing I would like to do is I want to dictate to read tsv what my different column types are. It does a very good job of guessing. And so if we look at the output here, it tells us that the default for all the columns was called double. I can specify it by adding an argument called types equals calls, and then I can say group equals call character. And so that will make the group column of type character or sort of like a string. And then I can say dot default equals call double, which is what it had already been doing. Now when I read that in and look back up at the data frame, I see that group is of type character and everything else is of type double. You'll notice that actually loaded a bit faster, right? Because it didn't have to go through the algorithms of trying to figure out what all the different columns data types were. It just runs a lot faster. The next thing I want to do to clean up my data frame is to make all of my column names lowercase. I like to do that because then I don't have to think about the capitalization of my columns. So I can do rename all, rename underscore all, and then you give it a function that you want to apply to all those column names. And so I can do two lower. And now I see all my column names are lowercase already. And so that is good. I'm going to go ahead and add a select to keep the group column, as well as all of the columns that starts with O2U. This should get rid of that label and them O2U columns. And again, if we look back up here, sure enough, we have our group and our O2U columns. Now we're ready to tidy our data frame to make it long rather than wide. And to do that, we can do pivot longer. And we will do minus group. So we want to pivot longer all of the columns except for the group column. And we'll give names to a value so that column names need to go to a column. And so I'll say O2U and values. So the values in the cells need to go to a column. And that will be count. Now we see we have a three column data frame with a group column, O2U column and a count column. And it is tidy, right? You'll notice that it's about 2.7 million rows and three columns. R will just love this, right? It far prefers long to wide data frames. So I'm pretty happy with the way this is structured. I'll go ahead and save this to a variable that I'll call shared. And now we are ready to move on to thinking about the taxonomy data. And so to do that, that's also a tab separated values file generated from mother. And so we'll do read TSV. And we'll say raw data forward slash backster.cons.taxonomy. And we'll load that and see what it looks like. So I have three columns in this data frame, O2U size and taxonomy. I'm pretty happy with the data type for each of those three columns, a character, double and character. I'm actually going to get rid of that size column here pretty soon. The first thing I'd like to do though is take those column names and make them all lowercase. Yes, I realize the irony of the fact that I wrote mother. And I have these odd column types, column names that I don't like the formatting of. It's okay. So we'll do rename all again on and do to lower on those. And so we see we've got O2U size and taxonomy. That's great. I'm going to go ahead and get rid of size. So I'll do select on O2U and taxonomy. I could have also done minus size rather than specifying O2U and taxonomy. And now I'm going to do a mutate line. And mutate allows us, as you've seen in the past, to mutate or change columns, we can change an existing column by writing back over that column, or we can add new columns. So first, let's write over the O2U column. And I want to make all those O2U names lowercase, because way up ahead here, where I had that three column data frame, right, all those O2U names are lowercase, which, you know, maybe we shouldn't have done rename all on the shared file, but whatever, it all works out. So we'll do O2U equals to lower on O2U. And then this column, after I run this should have all those O's being lowercase. And sure enough, we see all those O's are lowercase. The next thing that we want to do is to work with this taxonomy column to clean it up a bit so we can get genus level names. So this is going to take us a few steps. And at some point, I should do an episode on regular expressions and using things like strReplaceAll. I'll kind of give you a brief demo here on how to do that. So again, we're going to modify the taxonomy column with stringReplaceAll. And we will use as input to that the taxonomy column, right? And the first thing I want to remove are these numbers in the parentheses. The numbers in the parentheses tell us what percentage of the sequences in the O2U had that classification. Ultimately, I want to get a genus level name so that I can pull all the O2Us that are from the same genera, okay? To do that, let's start by giving it a pattern that we want to match and that we then want to replace. And so I can do back, back, slash, two backslashes and then an open parentheses. So an open parentheses and closed parentheses are special characters for regular expressions. But if I use two backslashes, that means match an actual parentheses. Then I can do back, back, D, which is a special meta character that matches a digit character. So like a number, right? And then if I do plus, that means match one or more of the preceding character. So match one or more of a digit, right? So this should match like 153, 26, 83, right? Any number that has one or more digit to it. And then I can close that out with a closing parentheses. I want to then replace that with nothing. So now if I run this, I get rid of all of those numbers in parentheses as well as the parentheses, and we're in good shape. One thing I noticed, however, is that the data that I got out of whatever version of mother I was using has unclassified as a, in this case, a family name and the genus name for this O210. And so if I cluster things by the genus name, then at least O210, the genus name is going to be unclassified, right? I'd rather it to be unclassified. Cluster the alleys, right? Or cluster the alleys unclassified. If you look at output from newer versions of mother, we do that automatically for you. But we're going to do that here in R because hey, we can why not, right? So it's going to take a little bit of trickery to figure out how to do this though. And this might take a few us str replace all commands. But again, we'll be, we'll be able to figure it out. I'm going to go ahead and do another taxonomy. And we'll do str replace. I am going to look, I'm going to use str replace will to match the first instance of something. And so again, it's going to be taxonomy. And I'm going to match that semi colon unclassified that we have here. And I want to replace that with an underscore unclassified. And so if I do semi colon unclassified, unclassified, make sure I spell that right. I then want to replace that with an underscore unclassified. Right. And so now I see you've got cluster the alleys unclassified, semi colon unclassified. I also know that if I do a tail on this, that there's another one at the end here, right? Or I've got room in a CACA unclassified as well. And so let me do a select on taxonomy. So we get more information from that taxonomy column. And so again, I've got cluster the alleys room in a CACA unclassified. The next step is that I want to remove everything except for that final taxonomy name. But even if it's like cluster the alleys underscore unclassified, I want to get if there's a unclassified after that, I want to get rid of that as well. So I want to kind of focus in on the last classified name that I have in my string. To do that, we'll add another line here to our mutate. And I'll do taxonomy equals str replace all on taxonomy. Again, and I will then do unclassified. So if I replace that unclassified with a nothing, then that will kind of work. But that will turn this like room in a CACA unclassified into room in a CACA underscore semi colon, which isn't what I want. So I want to I want to find anything that has a semi colon unclassified, and I want to remove all cases of that. So let's go ahead and see what this gets us. Again, this might take some trial and error, but we'll get there eventually. I think one too many. Oh, I forgot a comma at the end of this line. So we didn't lose our room in a CACA unclassified after running that. If we look at the first 10 rows, let's see what we get here. Again, we've got cluster the alleys unclassified. And that seems to have worked pretty well as also. Now what we need to do is let's get rid of that final semi colon. And that would that allow us to do that is to kind of chew up everything to the last semi colon, the semi colon before the name that we want, right? So for example, with this Rosberia line, I'm going to get rid of this final semi colon. Then I can do another str replace all where I then get rid of everything going up to that semi colon. Okay, so again, this can get a bit tedious. But as long as we get the right thing in the end, we're in good shape. So we will take that final semi colon, and we'll get the final semi colon by putting a dollar sign at the end of it. And so we can then also get everything up to a semi colon by doing period star semi colon and replacing everything up to that with nothing. And let's make sure that we've got all that. And so now what we see is that we have our genus names or the last name that was named and nothing else upstream or downstream. And that's in good shape. So let's go ahead and add a tail. And we can then also see that we've got that room in a caucasia unclassified there as well. And so the taxonomy stuff all looks good. I'm pretty happy with that. We can then remove this and see that we now have our otus and our taxonomy names. And I think we're good. So I will go ahead and save this as taxonomy. And we'll sign that data frame to that name taxonomy. Again, these five steps here get a bit messy. And if we'd have used more modern version of mother to generate these data wouldn't be such an issue. But at the same time, hopefully this kind of shows you the logic of working through using string replace string replace all to kind of force a string into a certain pattern that you can work with to get output the way you want it to look. Next, what we want to do is join together our shared and taxonomy data frames. We can do that with an inner join on shared and taxonomy and then by equals otu. We now have our group or otu account in our taxonomy. What we'd like to do now is collapse together our otus that have the same taxonomy from the same sample. And so I can pipe this output to a group by group and taxonomy. And we can then do summarize count equals some on count. And I'll go ahead and do that groups equals drop. And now what we get is we have our group or taxonomy and the count. We do have zeros in here indicating that like this ABO trophy was not found in this group. 23650, but a city or micro car Casey unclassified was found in this groups 23650. Okay, what I'd like to do now is get a relative abundance, and I can get a relative abundance by grouping by the group, and then getting a total count for each group. So to do this, we can do mutate relabund. So we'll create a new column relabund that equals count divided by the sum on count. So you'll notice we're not using summarize we're using mutate instead, because I want to get out as many roses I'm putting in. And so the data are grouped by the group right this this patient ID. And so this sum count will be the sum of all the counts within each of the groups. And we run that. And now we get relative abundances within each of the groups. And again, this is grouped. So to kind of show that this worked I could do summarize some on relabund. And let me call this total. And so now all of the groups have a total of one, because the relative abundances within each group, again, add up to one. That's not what I want. I'm going to go ahead and feed this into an ungroup. So that I remove that grouping variable on the subject ID. I'll also go ahead and get rid of the count column. So I'll do select minus count. And now I have my group, my taxonomy and my relabund. I'm pretty happy with that. And the next thing I want to be able to do though, is I need to know for each group, or each subject, what the metadata is. So is patient 2003 650. Do they have a screen relevant neoplasia or not do they have cancer they have an adenoma? Do they normal? Are they male, female? How old are they right that type of information to get that I need to read in the metadata. Again, metadata is a tab separated values file, and I can do read TSV. And I can then say raw data forward slash backster dot metadata dot TSV. There also is an Excel workbook version in the raw data directory, but let's let's stick with the TSV. This then gets us our data frame and you can see our data frame columns have all sorts of different capitalizations. I'll go ahead and do rename all to lower. So if nothing else from this episode, you should now be familiar with that rename all to lower step, where now we see that all of our column names are lower case. I'll also go ahead and make sample a character column. And again, we can do that by doing call types equals calls sample are actually it's going to be upper. I forget if it was uppercase or lowercase originally. Let's look back here. And we see that sample was lowercase before. So this should be sample lowercase equals call character. And now we see that sample sure enough is of type character. And I'm going to leave everything else the way it currently is. There's variables in here columns in here that aren't properly formatted. So this HX pre as previous history is zeros and ones and it's coding it as a double when it should be a logical. I'll leave that for you to figure out. And again, if you want to learn more about working with this data frame and getting the formatting right, what you could do is go to my minimal are instructional materials on the rifflemonis org website. And you'll see all sorts of instructions or kind of tutorials on how we can format this data frame to get it to be kind of the way you'd officially want it if you're working on a paper and really wanted to be working with all of these different variables. But for our purposes, the data frame is in pretty good shape. I do need to add a couple columns. And so I need to get a screen relevant neoplasia column as well as a lesion column. Because down the road, I'm going to want to use the microbiome data to predict whether or not somebody has a screen relevant neoplasia or any lesion at all. If I do count DX bin, this will then tell me what is in my diagnosis bin column. And I can then use that to mutate and make a variable that we'll call SRN so screen relevant neoplasia screen relevant neoplasias are diagnoses where we have advanced adenomas or cancers. So I can then do screen relevant neoplasia equals DX bin equals equals ADV adenoma or DX bin equals equals cancer. And again, if I then pipe this to count SRN, I now see that I've got 261 non screen relevant neoplasias and 229 screen relevant neoplasias. That is good. And then I can also make one called lesion, which will be largely the same formula, same expression. Except I'm going to go ahead and add adenoma here, DX bin equals equals adenoma. And then let's count lesion. And so now we have 172 falses and 318 truths. Okay, so I'll remove that count lesion. I'll then call this metadata. And that is now saved to that object. And I can then do another join here to do an inner join on sample or group. So I'm recalling that to this point in my pipeline, I think, I think, yeah, so I've got group rather than sample and sample is what I had in metadata. Right. So yeah, sample. So I need to go ahead and rename sample to become group. So I'll do group equals sample. And then let's look and make sure that that renames the column to be group. Sure enough, we're group. So I'll go ahead, I'll go ahead and save that to metadata then. And then in my join, I can take that inner join I can take the data coming through the pipeline and join that with metadata. And I'll say by equals group. And I couldn't I didn't have to change the name of the group in my metadata I could have left that as sample. And in previous episodes, I've shown that you can use by equals and then use the C function to say like, in this case would be like group equals sample to tell it what columns to use in the two data frames. And I always said, well, you can also rename the columns and do it the simple way like this by equals group. So that's, that's what I'm trying to do here. Now what I have is I've got the group column, the taxonomy and the relative abundance, as well as all of the patient metadata. Again, for all those comparisons, which now allows me to do things like looking at whether or not each genus is different across the different disease status groups. I'm going to go ahead and call this composite. And I'm happy to have composite be my ending point for today. Really what we've done in this episode is take those three different data frames, clean them up a little bit. We had to do a little bit of work on that taxonomy data frame to get it to be properly formatted so that we're not, you know, creating a unclassified genus that we're pooling across. Instead, we renamed, you know, those things like Rumenokakeca to be Rumenokakeca underscore unclassified to make that its own genus because the family level in that case, or maybe the order was as far as it could be classified. The last thing that I need to do is save my R script. And I've got this code genus analysis dot r file that I need to go ahead and stage and add to my repository. I can go ahead and do commit for my commit message. I'm going to have kind of a pithy direct statement about what I did in this code chunk. And so I'll say add code to clean up and join data frames to get relative abundance for each genus. Okay, and I can then say commit that went close and then I can push. Now this is up on GitHub, and we have a permanent backup for it up there on the cloud. And anybody that's trying to follow along in a future episode will be able to grab that commit to get going from where we start on the next episode. So again, this isn't the glamorous data science work of working with machine learning algorithms are doing amazing statistics are making beautiful visuals. But it's very important, right? It's very important to be able to work with data to massage the data to get it to be in the right format that we can join different data frames together to get, you know, a data frame that we can then use as input to downstream analysis steps. And that's exactly what we're going to do in the next episode. So be sure that you're subscribed and you click that bell icon so you know when the next video is released. In the next video, we're going to kind of motivate the need for using machine learning algorithm tools. What we're going to do is we're going to look at each genus individually and say are there genera that are significantly different between different disease status groups people with or without a screen relevant neoplasia without without a lesion. This is stuff we've kind of done before with the C difficile study. But again, I want to see it in this context, so that when we go to machine learning, we'll see, you know, what is the difference between doing individual taxa level analysis versus what we can do with machine learning where we're looking at really the whole collection of genera or otus or families or ASVs or whatever level you want to look at. This is a good stopping point. And we'll see you next time for another episode of Code Club.