 Hey folks, in the last episode, I gently introduced us to the concept of regular expressions in R and showed how we could use functions like sub or gsub or str replace or str replace all to identify a pattern and replace it with another string. Well, today we're going to go a little bit deeper in learning about regular expressions. The regular expressions I made in the last episode had the goal of pulling out different pieces of information from a column of strings that contained an animal's sex, their identifier number, and the day of the sample that was obtained from that animal post weaning. That makes sense? Anyway, there was a lot packed into that sample identifier. If we look back at the code that we generated in that episode, you'll notice that we talked about matching specific characters like I have here for this D that we wanted to match the D in the sample name, that we talked about meta characters like backslash D or backback W or backback S to match word and space characters respectively. We also introduced the concept of a quantifier where the star character means to match the preceding character zero or more times. Now, one of the challenges with this code is that it's not very specific. It works just fine, don't get me wrong, it works fine for the samples column that we are working with, but it's not very specific. And by making the patterns more specific, then we lessen the risk of accidentally matching one pattern that we didn't really mean to match. Let's go ahead and rerun our code. If we go ahead and look at this table, we see the first column is samples, and then it's followed by the rest of the distance matrix. And what I'm trying to do is create sample lookup that has the samples identifier, as well as a column for the animal sex and identifier number and the days post weaning. So that later on, when I've kind of done some analysis with the distance matrix, I could then join back in these little pieces of metadata. So we started this by looking at select samples. And I'm going to go ahead and copy this down below, so that I can show you how we can make this pattern search a little bit more specific, right? Let's go ahead and bring down this first pattern match. So I want to show you how we can make this a little bit more specific. Say instead of giving my animals numbers, I'd given them a letter. So I had f a fb fc fd. Can you see a problem we might run into here? So fd d would match the pattern that we currently have, because that second D would basically be a number repeated zero times, right? So we can replace that star and in certain stead and a plus. And that plus will match the preceding digit one or more times. Alternatively, a question mark would match the preceding character zero or one times. That's clearly not what we want here. So we'll go ahead and use that plus sign. Again, this is on the second page of the string R cheat sheet down here in the kind of lower right corner of that second page of different ways of looking at quantifiers. I find that that plus sign is really valuable because it's often I want to match something I'm just not sure how many times. And so this plus quantifier being able to match it at least one time, I find to be really useful. And again, this makes our search a little bit more specific. Another thing we could do to make our search more specific is to use anchors. And so an anchor allows us to tell where in the string we want to match the pattern. So we can use a dollar sign to make sure that our pattern matches at the end of the string. Again, nothing is going to change here, because it's matching the D and then one or more digit at the end of the string, right? So it's matching F3D11. And so that works really well for us, right? Let's see if we can come up with a more specific search pattern to extract the sex from our animals. Again, if we run this the way we had it before, we now see that our sex is F as we expected. But, you know, perhaps this again isn't as specific as we might like. I'm going to go ahead for now and remove the search pattern. So the dollar sign matches the end of the string, we might want to match the front of the string to get our sex because that first character is going to be the sex. So to match that at the beginning, I can put a carrot. And so a carrot will match the first character of the string or make sure the pattern occurs at the beginning of the string. So we could then do back, back, W to match a word character. So that would match the F or the M. And then we could go ahead and we could then do another back, back, W star, or we could do back, back, W plus, right? To match every other character in that string. Now the problem is that, well, I don't really have a way that I know of, of capturing the character that's matched by that back, back, W that first one, right? And so what we can do is we can actually wrap that back, back, W that character in parentheses. And what that parentheses does is it says, Hey, our save this bit of information. And as the replacement pattern, we can then do back, back one. And so what it's going to do is at the very beginning of the string, because we've got that carrot, it will match a character. Then it's going to match all the other characters in the samples column. And it is going to replace basically the entire value in the sample column and replace it with the first character. So let's go ahead and give us the shot and make sure it works. Sure enough, it did work, right? So again, what we're doing is we're matching that first character, we're saving it, we're matching everything else, but we're not saving everything else. And we're saving it by putting it in those parentheses. And then we're outputting it with that back, back one, an alternative to that back, back, W might be to put in a period. A period is another meta character that will match any character. Again, we run that. And that works really well. So let's say that for some reason, I had a sample in here that didn't start with an F or an M. It was from some other sample and it didn't have that sex information embedded in the sample name. Well, this would still give me a sex in the sex column, even though it might be something that didn't quite make sense. So I might want to be more specific about the character that it's matching at the very front of that string. And to do that, I can make my own meta character. So to create your own meta character, what you can do is you can use the square braces. And so inside those square braces, you could then give the characters that you would like it to match. And so what this is saying is the first character needs to match an F or an M. It's not matching F M. It's matching F or M. And so you get the F or M by putting it again in those square braces. And so again, we see that we now get that sex. And just for good measure, if we look at the tail, we also see that we get M as the sex for those male mice. As an example of what else we can do with these parentheses is I might ramp the second part of the search string in parentheses as well. And so I now have two things being saved. The first character and the rest of the string. So what I could do would be to do back back to hyphen back back one. And so now what I see is that I have basically flipped things around. And I could actually get rid of that hyphen and just totally jumble up the string that I had in my samples column as the sex. Of course, that's not what I want to do for the sex. All right, let's go ahead now and look at the day and see if we can't make the day a bit more specific as well. So we'll go ahead and grab that. And again, looking at this, we saw that we could extract the day. Remember, we use string replace to remove the front part. Well, let's go ahead now and try to match the full string. So we'll go ahead and make sure that our string starts at the first character that we then have the digit, right, the mouse number, the D. And then in parentheses, I'll go ahead and do back back D plus closing parentheses, and that will give me the day. And then I can then output back back one to then again, get the day out. And again, I'm being a bit more specific. And I should go ahead for good measure, put that dollar sign in to make sure I'm matching all the way that star will match zero or more. And it is what's called greedy. So it should match all of the digits that follow. We shouldn't be worried that it might only match two of three digits say, but that dollar sign again, make sure that we're matching our pattern at the end of the string and that carrot matching it at the beginning of the string. And we can of course see that this works. So the final thing that I'd like to do with you is to revise this code so that we can use a special function called separate separate is a handy function that will take a column of strings, and it will then separate that column into different columns. If that column has a delimiter in it. And so what I want to do is create a new column that will have a delimiter to separate the animal from the sex from the day. And then we can use separate to split those to separate those into separate columns. So let me show you what I mean. We'll go ahead and I'll copy this on down. And again, we've got our first column with our samples, and we'll then do mutate. And I can then do a delimited as the new column, and I'll do str replace. And so now what I want to do is I'm going to run so str plate on samples. So for my pattern, I'll go ahead and put in quotes. I want to match at the beginning of the string, an F or M, I want to match a digit one or more times, followed by a D for the day, followed by backslash backslash D plus to match the day post weeding. And then I'm going to put in the dollar sign to match at the end, then what I need is the replacement value. So what I'm going to do next might just blow your mind, but it's going to build off using these parentheses. Now you can use multiple sets of parentheses in your pattern to assign parts of the search pattern to different memory slots. The slots are assigned from left to right. And what we'll see is that you can actually nest these memory slots within each other. So what information do I want out of here? Well, I want the sex. So I'll go ahead and wrap that in parentheses. I also want the sex and the animal number. So I'll go ahead and wrap that in parentheses. And then I also want the day post weening. So I'll wrap that in parentheses. Again, these memory slots are assigned from left to right. So if I want the sex first, I need to do back back to I will then do hyphen because I'm going to use a delimiter so I can then separate those three columns apart. Then back back one to get the sex plus the animal number to have a unique animal identifier hyphen and then back back three for the days post weening. Running all this, I now see I've got a delimited column, or again, I have the sex, the animal identifier and the days post weening. Now what we can do is we can use the separate function club, where I can do separate and the column I want to separate is delimited. And then we're going to do into and I will then say sex, animal, and then day. And then my separator is going to be the hyphen. Now I see that I've got my sex animal and day. I think before I maybe had animal sex day, whatever, I now see that my day is again still of type character. And I can give separate a special argument, which would be convert equals true. And so what convert equals true will do is that if it convert the column into something else, it will. And so sure enough, what it does here now is it converts day into an integer. We have now seen yet another way that we can take that samples column and split it into three columns for the animal identifier, the sex and the days post weening. Again, in this strategy, we leverage what we learned today about making our searches more specific and the ability to store parts of that search pattern into memory and then use that as part of a replacement pattern by making it a delimited pattern by putting those hyphens between the three different sections, we were able to then take the delimited column and split it into those three columns that we want. And we could then use this as our alternative for sample lookup. These different approaches to generating sample lookup are all just as good for the data we have. I'm kind of proud of this version a little bit more because it allows us to show off a few more of our skills. But in the end, yeah, it doesn't really matter. I'm going to go ahead and delete these other two approaches and use the version that we made in today's episode for moving forward. Again, I encourage you to keep playing around with these different features of regular expressions. This is a great cheat sheet. One of the things that you can always do is define your own strings, and then see if you can design a pattern to match whatever part of that string that you want it to match. So I encourage you to keep exploring the second page of the cheat sheet. See if you can't improve your skills in developing regular expressions. I'm sure you will with practice, and we'll see you next time for another episode of Code Club.