 Hey folks, in this episode of Code Club, I am going to introduce to you the concept of a regular expression and how we can implement these using BASAR as well as functions from the string R package, which is part of the tidyverse. Now, what is a regular expression I hear you asking? Well, if you've ever been working in Microsoft Word and had to do a find replace function, you know something about regular expressions. Odds are good, though, that when you did this in Microsoft Word, you had a very specific word that you wanted to find and something very specific you wanted to replace it with. Well, I'm here to tell you that regular expressions can be so much more complicated and powerful than that. If you've been following along in recent episodes of Code Club, you know that I've been on a kick of using BASAR functionality to read in a distance matrix using exclusively functions from BASAR. In today's episode, I'm going to show you how we can use functions from BASAR, as well as from the tidyverse package string R to work with regular expressions. So my sample names of this distance matrix include a variety of pieces of information. They include the sex of the animal, a identifier for the animal, as well as the number of days post weaning where the sample was taken from that animal. All that information is in one name. And this really underscores, I think, the value in giving your samples meaningful, easy to parse names. What that means is that it's easy to design regular expressions to pull information out of a name. If your names are kind of random or just, you know, garbage names, then it's not going to be easy to parse to extract that meaningful information out of those names. So we're going to do that in today's episode as a way to gently work our way into learning more about regular expressions. Here I am in our studio. I've got my analysis dot R script open, which is inside of my code directory of the overall distances project. If that doesn't make sense to you, check out the link down below in the description to help you get caught up. I've also put a video up here in a link that shows you exactly how you can get caught up to where we are for today's episode. So I have code to go ahead and read in this mice break Curtis dot dist file that was generated in the mother software package. If I go ahead and run all this, it will read in that distance matrix using all that great tooling that we've been developing over recent episodes, and it will create a table. So now if I look at this TBL, I see that it is a 348 by 349 table. That extra column is this samples column, which again has the identifier for all of my samples. And you can pick a random sample here like F3D 13. This is female three on day 13 post weaning. And if you looked further down, you'd know that there's some other samples that start with an M. And those are the male samples. What I would like to do is to create a separate data frame that has a column for samples, as well as the sex, the animal ID and the day post weaning. And that way, if I have a separate data frame, then I can always do a join back to this distance matrix to bring in that metadata. Okay, so we'll get started by doing this to TBL. And we'll then pipe that to a select where we will then select on samples. And so if we look at this, we again see that we get a one column data frame. Now, it's not so easy to see how I might pull out the F or the three or the day of this animal, right? So regular expressions are something that people have rarely seen before. And so it's really useful to use the cheat sheet that our studio provides for working with the string R package to get to that cheat sheet. The best way is to go up to help cheat sheets, and then browse cheat sheets. Our studio has all sorts of just wonderful cheat sheets that I don't think enough people know about. And if you go down a little bit, you'll see a variety of different cheat sheets here for a lot of great different packages. And the one we're most interested in is the string R package. So if you go ahead and download that, and it's a two page cheat sheet, where the first page tells you about the different functions that come with the string R package. And the second page, I find to be really useful, because it helps me think through how to design a pattern for a regular expression. And so today, we're going to talk about matching characters explicitly, we're going to talk about matching meta characters, and we're going to talk a little bit about quantifiers. Again, there's a lot in this one page of the cheat sheet. And so we'll come back to this a few times over subsequent episodes. So I'm going to create a new column. And to create a new column, we can always do mutate. And let's go ahead and let's have a column that we'll call test, I want to play around with the different functions from string R and base R, and the regular expressions to help us to learn about how to use regular expressions before we go in and creating the columns that we actually want as part of this lookup table. So let's go ahead and we'll do mutate to create test. And we can then do sub. So sub comes to us from base R. And it allows us to substitute one thing for another. And so you can see in this pop up window, you give it the pattern, the replacement, and then the string that you want to do that substitution on. So I could do a substitution on F. So I'm going to look for an F. And then I will replace that with, let's say female. Right. And so that's the replacement. And then our acts will then be samples, right. And so again, I could be explicit in putting in the argument names. So I could say pattern equals F replacement equals female, and then X equals samples. So we've run that. And what you'll see is that everywhere that an F showed up, we've replaced that with female. If I go ahead and look at the end of this by using tail, I think that's where the male samples are, we see that we have Ms that have not been changed to female, right? Because well, we didn't find an F. And so we couldn't change that F, we couldn't replace that F with a female. Okay, so I'm going to comment this out. And I'm going to show you the string our way to do this. And that would be again, to do mutate, we could do test equals str underscore replace. And again, the syntax is going to be very similar. Right. And so we can go ahead and take the pattern, the replacement, the acts. And let me go ahead. And we don't need that tail. So it's complaining because there's a slight different argument for str replace versus sub. And that is that x isn't the argument name, the argument name is actually string. So now we run that and that all works. So the other difference to note about sub versus str replace is the order of the arguments is a little bit different. If I look at the help page for str replace, I see it goes string pattern replacement, right? And so I've got string at the end, which is the order that sub used it. So it's a subtle difference between these two functions. But that's really the main difference between using sub and str replace. So again, I could take this and to show you what this would look like without the argument names, I could again, remove the pattern replacement. And I could put samples upfront. To me, it seems more intuitive to put the string you're modifying first, then the pattern, then the replacement. Perhaps I've just used str replace for so long that that just seems natural to me, right? So again, that works without those argument names. And this is usually how I might write this function. So what's the difference between str replace and str replace all or sub versus g sub. So sub and str replace, they will find a pattern once and replace it once. If that pattern repeated twice or more times, it's only going to replace that first time that it finds it. In contrast, g sub or str replace all will replace all instances of it. Again, the g and g sub stands for global substitution. I feel like str replace all this just basically a better name, right? So that's again, one of the reasons that I prefer to use the string R version of the function, rather than the base R. I realize that's perhaps a superfluous difference. But again, when you're trying to remember the names of these functions, it helps to have function names that are a bit more intuitive and logical in kind of the way the text works, looks as you're as you're typing them out. So to illustrate this, if I were to perhaps do one and replace that with one, the spelled out, I now see that I have F3D1, F3D141. So it didn't get that second one. Of course, this is silly. But we could do str replace all. And so now what we find is that both the 141, both of those ones are replaced with spelled out ones, right? So again, that's the difference between str replace all and str replace str replace only replaces the first instant, all replaces both instances. And the same is true for sub versus g sub. Again, this shows us how we can match a specific character or set of characters in the string that we're looking at. If I go ahead back to str replace. And so let's do F3. I'm going to replace that with female three. Now I've replaced a longer chunk of the samples value, so F3. And that's then replaced it with female three, right? So I can give as my pattern a specific string, and it will then match it. The problem, of course, as I showed earlier, is that if I look at the male samples, or if I look at another female mouse, it's not going to make that replacement. But I might want it to say male six. And I might want r to be able to figure out what it should be without me having to repeat this line of code a dozen or more times. So again, this is where the power of regular expressions really comes in handy, because I don't have to be so specific to get my regular expressions to work to generalize the regular expression a little bit more. I can use a set of what are called meta characters. And so a meta character is a character that stands for another character very meta. So what we could do is instead of matching say F, I could go ahead and do back back w. So backslash backslash w will match a word character when you see that w think word character. So we can then replace that with let's go ahead and put in an underscore for now. So again, if we run these lines, we now see that again, because we're using str replace, it replaces the first character with an underscore. And if I had again use all and rerun this code, I now see that it replaces all of the word characters, because it's counting 3d13 as word characters. So I'll go ahead and remove this all. And another meta character that I frequently will use is the D. And so the D is short for digit. So if you see w think word, D think digit. And so now what you see is that that first character, I'm looking at the end of the data frame now, because I've got that tail, that first digit, that six is being replaced to an underscore. And again, if I did str replace all, it would replace all of the numbers with that underscore. So the w and the D are both really powerful. Another powerful meta character is the s the lowercase s that is short for space. So think word digit space, we don't have any spaces in our name here. So I'm not going to be able to illustrate that. Those are all using the lowercase character to create the meta character, right, the WDS. If I make it an uppercase character, right? So if I do uppercase D, that's going to match anything that's not a digit. If I use an uppercase W, that's going to match anything that's not a word character, right? So let's go ahead and run this. And we find that that m again, that's the first non digit character gets turned into a underscore. If I did str replace all, then it should replace the m and the D with underscores. Again, these examples are somewhat silly to help illustrate the different meta characters that are available to us within our we're using str replace and str replace all. So we could make a more complicated string where we could use D as in the m six d five, right, that D for the day, not the meta character. And then I could add on the meta character, right? So I could do back back D. And I could then say, let's replace that with an underscore. Before we run this, make a prediction in your head, what is going to happen, right? So you should be thinking it's going to match a D and a digit. And it's going to replace both of those with a single underscore. So sure enough, it replaced the D and five with an underscore. But what you'll notice is that D 65 gets replaced with underscore five, right? So how do we get that five? Well, if I go ahead and put in another back, back D. Well, I took care of that D 65. But now I've got D five, right? And there's others in here that are perhaps from like day 150 that it wouldn't match to. So we could again, repeat this line three times to get, you know, single, double and triple digit numbers. But that's that's not what we want to do. So what we can do instead is use something called a quantifier. Now, there's a variety of different quantifiers that you'll find on that cheat sheet. But the one I want to introduce to you is a star. So a star will match the preceding character. And again, that could be a meta character or an actual real character, it'll match that preceding character zero or more times. So this should match the character D, followed by any number of digits, we now see that we've gotten rid of that D five D six D 65. And if we looked for it, we would also be that D 150. So again, that's really powerful. Now, let's go ahead and remove that underscores and give str replace as the replacement pattern, an empty set of double quotes. What do you think is going to happen here? That's right. Instead of an underscore, we're going to here have M six. And so we might think of this as the animal identifier, right. And so very quickly, I hope you can see and appreciate that we can use these regular expressions to extract information from our sample names. And again, if I remove the tail, so we look back up at the female samples, we again see that we have our animal identifier, right, F three. So I'm going to go ahead and call this animal. I also want to get other information out of this. I want to know the animals sex, so we could do sex. And let's do str replace. And again, we'll give it samples. And we're going to need to come up with a pattern and a replacement. Something that we'll want to think about is, how do we now design a pattern to only get that first character of the samples name. Now with all regular expressions, there's a variety of ways to do it. And the most important way to do it is the way that works. So the approach that I will show you will be very similar to what we did to define the animal. If you want, go ahead and pause the video and see if you can come up with the pattern yourself as well as the replacement value. All right, hopefully you gave that a shot. And so what I would do would be to do back back D star. So I want to be able to match that three and F three. And then I want to match the capital D, right, which is there, the character itself. And then I'll match the back back D star, right? So what this should do is match the three, the D, and then any number of digits to after that D. And it's going to replace all that with nothing. So I should get an F or an M depending on if the mouse was male or female. And sure enough, I now get that sex column as F for the females. And if I look at the end, I will then see that the males have the M as well. And so we're making progress, right? I could come back later and perhaps do a recode function to change the F to female and the M to male. But for now, I'm pretty happy with the way this looks. So the last bit of information I want to extract from the samples column is the days post weaning. So again, I'll go ahead and add another column and say day equals str replace and do samples. And again, we're going to need a pattern and a replacement. I'll spoiler alert, we'll leave the replacement empty. Again, I would encourage you to pause the video and see if you can write the regular expression to extract the day from the samples column. Okay, so hopefully you had a chance to do that. And so now what we could do is we could do back back W back back D. So it'll match a word character, like the M or the F. The back back D will match the animal name actually don't remember how many characters or how many mice I had that were males or females. So I'll go ahead and put a star in there. And then I'll go ahead and put in a D. So that would match in this case, say the M six D and should leave behind the 65. So sure enough, what happened was that it matched in this case the F three and the D and replace the F three and D with nothing and returned a 125 to indicate 125 days post weaning. Again, there's a variety of other ways that you could build these regular expressions to extract these different pieces of information. If you know those awesome, go ahead and try them in place of the code that I've inserted here. I'm going to go ahead and clean this up to remove these two lines. The other thing that I'll point out is that my day column is of type character. Again, that's really nice and easy to see because we're doing this as a table. And so what I could do is I can convert this to a numerical. And so I could do as dot numeric as a function around my string replaced to get the days. So if I go ahead and rerun that, I now see that my day is of type double, and we're in good shape. So I'm going to now call this sample, look up, we'll save that in the next episode, we'll come back and we'll do a little bit more work on this code as we march through looking at different ways of analyzing and working with distance matrices using base are as well as the tidy verse. Keep practicing with this and we'll see you next time for another episode of code club.