 Something I find just super tedious is manually editing text. That's going to appear in a figure or a table Why is it so tedious? Well, if you're like me, you don't just do an analysis once you might do it five times And so if I'm manually editing that text five times, it just gets overwhelming and invariably I miss something So that's why I really like to as much as possible code those modifications and so how do we code these modifications to update text and make text look pretty Well, we'll use something that's called a regular expression and today I'm going to talk about how we can do that in our r scripts for making really attractive visuals in ggplot 2. Hey folks I'm patch loss and this is code club today. I'm gonna let you in on one of my pet peeves What is that? Well what I've been noticing a lot of figures recently for microbiome studies is that they'll have a bacterial name For any taxonomic level and it's clear that they didn't know how to italicize the bacterial name Because it'll be in vertical font Kind of a standard upright font rather than being italicized now journals vary in how they handle Italization of bacterial names at the American Society for Microbiology ASM their journals instruction to authors calls for all taxonomic ranks to be italicized at the same times Sometimes we might have unclassified and then you know, basilis right and so what I'll see is that you know Sometimes both unclassified and basilis will be italicized or neither of them And what's even worse is that sometimes I'll see an underscore between unclassified and basilis come on people As we proceed to working with data for operational taxonomic units We might also want to indicate the OTU number and so you might want to have something like Unclassified intervectory AC and then in parentheses like OTU 23 There's a lot of formatting that has to happen to make that all look good and that's exactly what we're gonna do in today's episode We're gonna use something called regular expressions using functions from the string R package We'll also revisit our old friend the glue package And we'll ease into this area of regular expressions because it's an area that I really find to be powerful It's also very confusing sometimes and very easy to screw things up But don't worry you won't really screw it up But you'll you'll just kind of have to iterate a few times before we get it right So I'm gonna introduce you to a few of the more basic Things within regular expressions and these are called quantifiers if this all seems new to you. Don't worry I'm gonna go through it with you all so let's head over to our studio so we can get going So I'm gonna start with this vector of OTU names In our actual data for the data we've been working with in the past episodes. There's a few thousand Different OTUs and this kind of gives us a general feel of what the different OTU names might look like In our data set. So again, we have OTUs And what we'd like to do is I would like OTU 0 0 1 to instead be OTU 1 Okay, so what I want to do is make that OTU 1, OTU 10, OTU 100, OTU 1000 And I want OTU to be in all caps and I want there to be a space between OTU and the number And I don't want that leading zero as I mentioned We're gonna get this to work using functions from the string R package The string R package is part of the tidy verse. So I'll do library tidy verse I can use STR replace and for STR replace I'll give it the string that I want to match and so let's start with OTU 0 0 1 Maybe one more zero in there and then we give it a pattern and then we give it a replacement value Okay, and so the pattern is what we want to match and the replacement is what we want to replace that with so If you've done find all replace all in something like Microsoft Word It's the same idea, but this is gonna end up being far more powerful than what you can typically do with Microsoft Word So the pattern that I might want to use To get OTU 0 0 0 1 to be OTU 1. I could do say um To you 0 0 0 1 right or let me remove that one And then my replacement I'll do to you space and then let's see if we get it So that outputs it as OTU 1 good. We're winning right all right Well, what if we had that second value in my vector of OTU 0 0 1 0? It doesn't do anything right because it can't find this pattern in my string that I get And we could do it with the other values of our vector as well Well, you know you could say well Pat you could just rerun it with you know one fewer zero and then you'd get OTU 10 out Right so definitely that works But I don't want to make a different regular expression for each value that I'm looking at right So if I come back to this original pattern that I had Something that I could put in here would be OTU and then a plus sign So a plus sign means match the preceding character so that zero One or more times right so I want to match that zero one two three four all the way up to whatever you want it to be Times so now again remember this OTU or to you 0 0 0 Didn't match anything up here right and so with that plus sign. We should now get out OTU 10 And sure enough we do right so that works and again we could replace this with OTU 0 1 0 0 and that works as well right so that plus sign again matches One or more instance of the preceding character. So if I come and do OTU 1000 Well this work No, it should not work and why shouldn't it work? Well, it shouldn't work because we're expecting the pattern to match zero One or more time right so we want to match that zero character one or more time and there's no zero character after the U Right, so again that didn't work. So what can we do in that situation? Well, there's another Quantifier besides the plus sign that we can use which is the star And so the star matches the preceding character zero or more time So the plus is one or more time and the star is zero or more times So that works now, right? So that's that's pretty wonderful. So let's put it all together and run str replace on our OTU vector To use zero star because we want to match zero or more times and the replacement Equals to use space. Ah, it was OTU. This is not OTU And so then we get the nice formatting of our four different OTU labels, right? And so I didn't have to manually go in and change those at all. So that that was pretty nice One other quantifier that I want to briefly show you is the question mark And so the question mark means match the preceding character Zero or one time where this is typically used is for things like color, right? So in US English at COL or whereas in say British English at COL. Oh, you are right So you could say match color with a question mark after the you and that would match both the the US English and British Spelling of color. So if I put a question mark After the zero that should only change Well, let's see what this does because I think it's going to give us some funny results So the results are a little bit funky and so let's look at what it did, right? So the question mark again matches that zero zero or one time And so for OTU a thousand it matched at zero times because there was no zero between the you and the one and It did it for OTU zero one hundred because it matched that zero one time, right? Now it did the same thing over here. It did exactly what we told it to do But it replaced that TU zero With the capital TU space, right? But then it's still left in those leading zeros for OTU one and OTU ten, right? So again, what we clearly want is that star character to quantify to match That zero zero or more times. So again, these are three Quantifiers that are really powerful for telling your pattern how many times to match a preceding character And especially when you don't really know how many times we're going to be seeing that character So I could give this regular expression this pattern to any number of otus, right? I might have 10,000 otus I might have 10 otus and this pattern would still work. So as we go through today's episode Try to remember these three quantifiers of the plus sign the star and the question mark as we go about modifying our Figure from representing genus level data to OTU level data All right So we're gonna go ahead and work with this code that we've been working with over the past Episodes again if you want to get a copy of this I'd strongly encourage you to go to the link down below in the description where there's a blog post That's associated with today's episode you can get this starting chunk of code So you can work along with me as I modify this code You can then use that code to do your own experiments and then ultimately you can take this code that we work on together And apply it to your own data to really make it your own And I would say that if you can take the code that we work with and apply it to your own data to get the figure You want that is that that's perfect That is exactly what I want you to be able to do because not only will you have something that's useful to you But you will also have Demonstrated some level of mastery of the material and that's that's what we want to see all right So in this code we read in these libraries. We get the metadata. We get the OTU counts We kind of figure out our limit of detection We get our taxonomy information here as well And then we're kind of joining this all together to generate our OTU relative abundance data further down in the code We see level equals genus and that means that we're filtering our data to only look at the genus call the genus rose I'm gonna want to change this to be OTU. So we'll go ahead and change level to be OTU and this then joins all of our data together with the taxonomy data right and we are pooling to only include those tax so that have a High the maximum median relative abundance within each of the three disease status groups of greater than 1% So if the taxa has a median relative abundance for all three groups lower than 1% We're gonna pool those together and so again the three disease status groups that we're looking at these data We're collected from a study looking at people with and without C. difficile infections and we're looking for biomarkers to indicate You know, can we predict who has C. difficile? So we have people that are healthy people with diarrhea, but that don't have C. diff and people with diarrhea who do have C. diff Okay, so we join all this together and then this builds our nice pretty plot Let's I'm gonna change Schubert genus to Schubert OTU and let's give this around and see what we get great So we have a figure like we've been seeing But instead of having the taxa names on the y-axis. We have our OTU names And we're gonna use those regular expressions to see if we can't clean them up and make it look better So we'll go ahead and start by modifying these labels to be OTU space, whatever, right? So remember what we did before so we'll come back up to our code where we ran the filter function and running these two lines we see that we have sample ID disease status relative abundance the level and then the taxa name as OTU 001 right or OTU whatever right and so I think what I'll do is after that filter line I'll do a mutate and we will do mutate and I'm gonna do tax on equals and we will then say STR replace and then our string will be tax on our pattern and This is what we practiced at the beginning there will be tu zero star and then replacement Equals capital tu space and be sure we got a pipe at the end there And so now if we look at it's not happy about something. I forget close off the pipe If you look at tax on relevant We now see that we've got our nice formatting of that tax on color column And let's go ahead and run everything else and see that our figure looks the way we want very good We have our otu's labeled otu space 3 otu 2 and so forth without kind of that weird Capitalization and those leading zeros so good. This did exactly what I had hoped it to do one thing I would like to do though is back up here where we're pooling our data We're looking for things that have a media maximum median abundance greater than one to not be pooled As we go to finer and finer taxonomic levels The amount of the total data that we can represent by pooling at 1% is going to drop So I'm going to reduce this to 0.5% and so we now see that we have a few more otu's included But this other category is still probably around 50 or 60 percent of the data And so that's kind of the breaks of what happens when you have a large number of features or otu's There's just so many ways to split the relative abundance data So the next thing that I want to worry about here is that we've got our otu's But we don't have any taxonomic names to go with them, right? So I'd like to combine my taxonomy with my otu information All right So I'm going to come way back up to the top here and I might end up revising the code We had just inserted so that we can combine both the taxonomy information with that otu and to remind you what taxonomy looks like Is that we have an otu we have the kingdom phylum class family order genus blah blah blah, right? And we've got 5,445 otu's represented what I'd like to do is create a column that goes with all this for a pretty otu name that has both that perhaps the genus name as well as the otu so in here I'll put a mutate to create a new column that I'll call pretty otu and This is going to be the code that I had down below here Yeah, this str replace I'll cut that out and move that up here and Kind of do some more cannibalizing of the code here So we'll put that in here and so now if I look at this what I do wrong Oh, I didn't want tax on I wanted otu if I look at taxonomy I now have my otu all my taxonomic levels as well as the pretty otu code So we're in good shape. What I would like to have is my genus name and my pretty otu Merged together so remember that down below when we make the plot that y-axis labels is taken from the tax on column Which we actually create down further below so I'm going to use mutate to create a tax on column up here and I will use the glue function, which we saw a number of episodes ago We can do glue and then in quotes I'm going to then put in curly braces the genus column and then in space I'm then going to put in round parentheses inside of that. I'm going to put pretty otu I also need to make sure that I've loaded the glue package and Let's see what this all looks like If we go ahead and run our taxonomy data frame I'm running mutate from within mutate which is not right Okay, we then get our otu all our different taxonomic names are pretty otu and then the taxon Which is kind of truncated to clean up that output a little bit I'm going to do a select with otu and tax on that way I'll have the pretty tax on name with the genus and the otu label associated with the original otu names that way when I do my joins with things like counts and whatnot That I'll be able to map those together and if I look at taxonomy I now see I've got that otu and the cleaned up name now One of the things I noticed right off the bat are those blasted underscores, right? So I've enter back Tracy underscore unclassified now in a previous episode. We did clean this up So I want to go back through that again because I'm spending a little bit more time in this episode talking about regular expressions And I will add a mutate for my genus I don't need to run mutate and mutate again for genus. I'm going to do str replace and Here we're going to use as our string the genus column and our pattern will need and Our replacement will need and our pattern again It's underscore unclassified and one of the cool things that we can do with Regular expressions is that we can match different parts of a string and we can save it to memory and so I can save things by putting the things in parentheses and a I'll do a character star and then underscore unclassified And so what period means is match any character, right? And then the star means match that zero or more times, right? And then we've got that underscore unclassified and what we're doing is we're saving the stuff before the unclassified and so we can then replace that with unclassified and Then space and then we can do back back one so backslash backslash one Means put in that stuff that was saved in that set of parentheses And then we'll put a comma at the end of that and so now we see we have unclassified and our bacteria a ca Unclassified room in a cock a ca and that's all good looking at this though There's one more thing that I worried about and that's my italization, right? So I want the focacola to be italicized, but not the ot one I want the inner back trace a to be class italicized, but not the unclassified so to fix this I think I'm going to modify our genus mutate line a little bit And so I'm going to do genus equals str underscore replace and then I will do string equals genus Again, we'll need a pattern and a replacement Right and then I'll come there so what I'm going to do is I'm going to take the genus name And I'm going to wrap it in stars so that we can use gg text to make it italicized so I will then do again in my parentheses period star and We will match the whole string and I will then do star back back one star And then that will come into this next line, right where we'll have the underscore unclassified Star right so it it'll it'll start and end with unclassified and Maybe I'll put star Dot star on underscore unclassified star and the stars here are the actual characters And so where this gets a little bit messy for patterns is that this is not being used as a quantifier And so if I want to use it as the actual character the star I can put two back slashes in front of the star So the back back star means match the actual star and then I can do unclassified and I can then put star around The star back back one star and now we see that we've we got it right so we have our taxa name our Genius name O2 but our genius name is in stars And also down here unclassified enter back to ACA the enter back to ACA is wrapped in stars And so that's going to be italicized and so at our O2 relevant Let's run these two inner join statements and see what we get so it looks like what we want So we'll kind of continue on with the pipeline here here. We'll go ahead and get the relative abundance data And then we get sample ID disease status O2 count tax on and relevant We can probably go ahead and get rid of the count column like we had here I don't need the pivot longer because I'm already looking at the taxonomic level. I want I don't need to filter it further In the next step. So now if I look at O2 relevant great, so we have all the columns. We were expecting I'm not totally sure I need this O2 column, but I'm gonna leave it there just in case because you never know What might happen? And so if we look at tax on relevant This is where we did the filtering I don't need that so I can go ahead and comment this out for now one thing I noticed that we do do for relative abundance as we multiply it by a hundred to get it into percent So I'm gonna put that hundred back up here where I calculate the relevant and so now I've got O2 relevant which is exactly what I wanted and Here instead of tax on relative relevant I'll do O2 relevant where we'll then this is where we kind of figure out Which O2 use to be pooling and then here for the inner join we have tax on relevant still so we want O2 relevant So we're getting a complaint about problem with mutate input tax on false must have a class character not class glue character and so Where was I doing something up here? So up here where I'm pooling things If it was labeled as you know pool being true, then it gets the name other otherwise It gets the name tax on but tax on is a of type glue Not character. So what I can do is I can wrap Tax on in as dot character and so that way then again when it gets to these two lines the output Tax on will be of type character. So this looks good We've got our genus name or the family name that you know We're best able to classify to along with the O2 designation Again, we got that by using str replace as well as the glue package to kind of do all the formatting and make it look pretty One thing I'm not totally a fan of is having the other category in the middle It also doesn't seem like there's any great ordering to the data here So what I'm going to do is maybe order it by the maximum relative abundance of that O2 in any of the three disease status groups So to fix the order, let's come back up here and I did have a factor Reorder FCT reorder Using the the order by the median in a descending non descending sort And so I now see I have median up here and I'm using the minimum I think I'd rather have the max of the median and let me look where I'm defining median up here I'm looking at the median of the medians. So here. I think what I'll do is I'll go ahead and put the max of the medians And so now we see our O2's are ordered by The maximum median relative abundance for our three disease status groups and for those of you that haven't been watching I realized just now that I haven't told you what we're looking at here The ball indicates the median across all subjects in the study for that disease status group Right and so you can then see we have kind of a nice line Curve across our disease status groups kind of descending in terms of the median relative abundance For, you know, whatever is the largest across the three disease status groups and then we get other To be positioned at the bottom. So we're in good shape there. So I like the ordering here Makes me happy. One thing I'm not totally a fan of is that some of these names get rather long And so what I might like to do is to put a break in between the genus name and the O2 you label to do that if we come back up to where we had our regular expression that Right here in my glue statement I could put in a BR and so the BR in the angled brackets greater than less than Tells GG text Down below here in our theme we had Access text y element markdown that that then will go ahead and Impose markdown or HTML formatting of our text. So it's really slick to know just a little bit of HTML So that you can get the right look for your figure So that BR will put in a break a line break Between the taxonomic name and the O2 you and so that looks pretty good One thing I'm not totally a fan of is this unclassified rumo caca aca gets really long And so I'm kind of multiple opinions about this I'm not totally sold that I need to say unclassified rumo caca a I think if you're talking to microbiologists that study the gut microbiome They know that rumo caca a is not a genus name that it's a family name And so unclassified rumo caca a isn't totally necessary But at the same time I also appreciate that I know a lot right and so maybe not everybody knows that That's unclassified rumo caca a so maybe we'll leave it But maybe so that I don't have such a long label for that one But not everything else I'll go ahead and put in another line break between unclassified and rumo caca aca And again that was back up here where we were met at modifying the code and so in this replacement I can do unclassified break And then the name of the the genus or the family or whatever it was that was deepest classified And so that looks a little bit tidier there on the left side with that y-axis label I could see this strategy of having multiple lines per label Becoming a little bit unwieldy if we had more taxa than what we have here So again the challenge in this episode was how do we make an attractive plot at the OTU level as we go to finer and finer taxonomic levels The relative abundances of those levels gets finer and finer and smaller and smaller saw that by going down to that half percent Relative abundance. You could probably go even smaller if you wanted But the challenge then of looking at OTU data was that we have both OTU information like OTU one as well as a taxonomic name And so we need the taxonomic name because the OTU number doesn't really mean anything between studies, right? So OTU one in my study is this bug I've never heard of before and in your study. It might be bacillus, right? The other place where it matters to have both pieces of information is because as we see here both OTUs two and five are Bactroides, right? So if I talk about OTUs two and five across my study, then I want to They might behave differently, right? And so it'd be nice to know that OTUs two and five Are both Bactroides, but they're perhaps different entities. And so their behavior Their their frequency abundance and distribution might vary across the study And so I might want to talk about those OTUs separately as we go through the study in this case It does appear that OTUs two and five kind of have the same Relationship to each other, which I don't know causes me to do a little bit of head scratching But anyway, again, that's the the value of being able to show both the taxonomy information and the OTU information And again, if you're doing something like Amplicon sequence variance, well, it'd be the same idea except instead of OTU one You might have ASV one whatever you want to do there, right? But it's again the formatting and the idea of mixing different types of text together Is the same as what we've done in this episode So again, don't settle for underscores in your figure labels Don't settle for vertical text when it's supposed to be italicized Those it's just not necessary, right? And so hopefully you get something out of this episode is that when you make your next relative abundance plot You don't feel the need to keep those underscores in there or to keep things in was it Roman or vertical Typeface to use the italization. Okay. Anyway, I really hope you dig into this Try to apply this to your own code and making your own figures more attractive and more presentable and give it just that little bit more of Polish Anyway, let me know how you fare down below in the comments Keep practicing be sure that you've subscribed and you've liked and you've told everyone you know about Code Club It's really been awesome to see the growth of interest in the channel More subscriptions and views and everything and I'm just over the moon and really happy with People's positive reception. So keep practicing and we'll see you next time for another episode of Code Club