 I believe you can answer your own data analysis questions. Do you? You should. Stick around for another episode of Code Club. I'm your host, Pat Schloss, and it's my goal to help you grow in confidence to ask and answer questions about the world around us using data. Since starting this project, we have yet to really leave the command line interface. You've learned a lot of bash syntax, mkdir, cd, ls, pwd, touch, rm, rmdir, nano, git, make, cd, pipe, if, mv, and probably a few others. If you've got all those down, then you're in great shape and are likely seeing the value of using these bash commands to automate a reproducible workflow. But if these still seem a bit challenging to you, don't fret. We're going to spend a few more episodes in bash to strengthen our familiarity with these commands and with my general workflow. In today's episode, we'll see many of those same commands and some new ones to help solve a problem I found in our analysis. In the last two episodes, we used special patterns called regular expressions with said to extract information from our file names and paths. If you did the exercises in the said episode, I showed you how you can run said on the contents of a file rather than its name. But said isn't the only place that we can use regular expressions in bash. There's another probably more popular tool called grep where we can use regular expressions. Heck, the name grep is short for globally search for a regular expression and print the matching lines. After the last episode, I was looking back through our files and noticed that mother had changed our sequence names because the names had spaces in them. Have I mentioned how horrible spaces that are for bioinformatics work? I also noticed that although most of our sequences start and end at the coordinates that we trim them to, there are a few for each region that don't. In those cases, mother starts the sequence with a series of periods to indicate the missing data. Later on, we might decide to toss those sequences because they're weird. But for now, I prefer to have those periods be hyphens to represent gap characters. Instead of opening these files in a text editor and replacing all the spaces with underscores or replacing the periods with hyphens, we can fix the information using said. Along the way, we'll learn a few extra commands to keep things interesting. These are the commands that I often use to diagnose problems or do simple analyses of data in my files. Even if you're only watching this video to learn more about bash commands and don't know what a 16S RNA gene sequence is, I'm sure you'll get a lot out of today's video. Please take the time to follow along on your own computer and attempt the exercises. Don't worry if you aren't sure how to solve the exercises at the end of the video, I'll provide solutions. If you haven't been following along but would like to, welcome. Be sure to subscribe to the channel and click the bell icon so you know when the next episode is released. Feel free to leave a comment even if it's just to say hello. Please check out the blog post that accompanies this video where you'll find instructions on catching up, reference notes, and links to supplemental material. The link to the blog post for today's video is below in the notes. Well, here I am in the root directory for our project. Again, I do LS, I see the directories we're used to and if I double check on our status for version control with Git, we see we're on the master branch, everything's up to date and we're good to go. So in my introductory remarks, I comment on two problems that we're gonna address in this episode of Code Club but I wanna show you how I got there and how I found the problem to start out with. So my first question was, how many sequences do we have in our data set? Now, you might think, well, that seems like a very simple question. How would I even go about figuring that out, right? Perhaps you'd think, well, we could load it into R and we could kind of count the number of sequences that way and if you think about it, that's actually not that trivial to do. And it turns out that there's a few very simple ways within bash, it's simple once you know the commands and the syntax to figure out how many sequences there are in your data set. And that involves using a command called WC or using grep. So WC is short for, at least I like to think of it as being short for a word count. So we could say WC data raw forward slash RNDB, underscore 16S fast A. What this output tells us is the number of lines in the file, the number of words in the file and the number of characters in the file. A word being any text that's separated by a space. So this would tell us, well, there's 160 or 1.625 million sequences in here. I don't think that's right or 1.62 million lines. And so what this occurs to me then is that a fast A file can be formatted in one of two ways. So in the first way that I typically deal with them, each sequence is represented by two lines. The first line is the header, which is denoted with a greater than sign like that. And that tells you the name of the sequence and perhaps other information about the sequence. And then the second line is the actual sequence data. But other people have the same header and they represent the sequence as multiple lines with each line being maybe 80 or 100 characters wide. So I bet this one is the latter format. So how can we figure that out? Well, one option is to use the head function. So we can say head, data raw, RNDB 5.6, 16S fast A. And we see sure enough that each line, you can kind of see it here. You can see how my header row line kind of scrolls off beyond 80 characters and here each character, each row is 80 characters wide. And so this tells me that each sequence is represented by many lines. Okay, so WC isn't gonna work, but don't forget about WC, we'll see it here again. Another option is to use grep, which again is short for something about something regular expressions and then printing the lines globally searching. I think of it as get regular expression, although I know that's not the actual meaning of the name. That's what I think of it as. So what we could do would be to do grep and then in quotes, we put our pattern and then we give it the file name that we wanna search for. So I'll say data raw, RNDB 16S.fast A. And what we want for the pattern is something that maybe only occurs once for each sequence. And so if you look at this sequence, the first sequence in the dataset, you'll note that the first character is a greater than sign. And so as I mentioned, that is the definition of the header for fast A. And so I will put grep greater than sign and then the name of the file that I wanna search for. And so what grep is going to do is return every line that matches that greater than sign. And watch out, here comes a tidal wave of data to us. We can't possibly extract all that information without some extra tools. And so one of those tools I just talked about, if I use that pipe character that we've learned in previous episodes, I can pipe this output that just was kind of vomited out to the screen to WC. And if I use hyphen L, or if I leave it with WC, this tells me that there were 77,530 lines, 374,000 words, and 7.6 million characters. So what I really want is only this 77530. And what we can do to get that is to add the dash L or hyphen L argument. And this tells us that there were 77,530 sequences in our dataset. Now, another way we could do that is without the WC dash L, but giving grep another argument. Another argument we can give grep is hyphen C. And the hyphen C tells grep to count the number of lines, don't return the lines, but count the number of lines that match this character. And we see we get the same output using hyphen C versus using WC dash L. In my own use, I go back and forth between these two options. I'll frequently use the WC L more often because often I'll kind of be building out a pipeline where I'm stitching together different commands using the pipe. And if I want to know the number of things that match a hit, or hit a match or whatever match a hit, how many times that happens, then sometimes it's easier to only remove the WC dash L and then continue on with the pipeline, if that makes sense, versus if I had dash C, then I have to go all the way back over and delete the dash C to have an output text that I'm then taking on to the next step of my pipeline. Again, both of them will give us the same result of the 77,530 sequences. That works for something like our Fast A file where there's a defined structure. Again, every sequence has a header and every header starts with a greater than sign. That's very nice. Well, when we looked at our data V19 files, we recall that we had this rndb-bad-agnos file. And if I give it ls-lth, it tells me that this has 1.4 kb file. This is a 1.4 kb file. I might want to know how many bad sequences were in there. Well, I could do head-rndb... I have to get the path, right? Data V19, rndb-bad-agnos. And it's a series of names of organisms. And that isn't going to be very easy to search a common element of all those lines in my bad-agnos file. Instead, what I can do, hopefully you're remembering this, is we could use the wc-l. So I could say wc-l, data V19, rndb-bad-agnos. And I see there's 103 sequences that did not match, that were bad sequences. And these were sequences, again, that did not span the full coordinate range of the V1 through V9 variable regions of the 16s gene. So again, you can see that wc-l is useful in other contexts outside of using grep. Looking at these names, I noticed, as I was doing this, a problem. And the problem is that these are the genus names of my sequences that didn't span the full length. I can use nano, V19. And I can see that if I scan down through here by hitting Ctrl-V, these are all genus-level names. But I'm pretty sure I have many examples of all of these that I have species names that should be in those files as well. So this got me thinking that there might be a problem with how the data are being stored in these FASTA files. Perhaps you're thinking, well, let's use nano and open up our FASTA file. So I could do nano, data V19, rndb-align. And it immediately dawns on me that this is probably a really big file. And nano is really struggling to open up this file in memory. And so it kind of lags here for a bit of a time. And what you see is that greater than sign, the genus name, and then the aligned sequences that follow. And it tells us that there's 153,148 lines. So this tells me that there's a problem. But again, I think that that genus name is not unique in those headers. So let's look at, let's try using head instead. So we could do head, data, raw. And we'll go back and look at the raw data that we read in to see if that has any more header information in our FASTA file. And we see that sure enough, we looked at this earlier that the sequence header has a lot more information. And actually we need all this information so we can link it back to other information that was in a metadata TSV file that we also downloaded from the rndb. And what's happening is that there's a space in the organism name here. We also see a variety of other spaces here with chromosome, anonymous, and then over here before the plus or negative strand. To see how common a problem this is, what we could also do is we could grep for the header line of each of our sequences. And I'll go back to the raw FASTA file. And again, it spits out all this, which we've already seen. I could add head to only see the first five examples. And you'll see that's there. And I guess it's not five, it's 10 rows is the default output from head. And we see all of these actually have a space in their names. So what I'd like to do is replace that space with an underscore so that everything stays together. The problem is that when this comes into mother and gets aligned, mother then is stripping out everything after the space. And I believe mother's doing the right thing there. And sure enough, if we look at the line file, yeah, it's removing that. And that the name is not supposed to have spaces in it in the FASTA header. At least it shouldn't, it causes problems. So what we'd like to do is remove that space in the organism name. And in fact, we want to remove all the spaces in the FASTA header for the sequences. And I'd like to go ahead and return all those spaces into underscores so that we keep this header all the way through our analysis. We need to go ahead and file an issue on this to change our spaces to underscores in our FASTA file as we're aligning it. So with new issue, we'll say replace spaces with underscores in FASTA headers prior to initial alignment. So it's that first alignment. We're aligning our R&DB FASTA files against the silver reference alignment where mother is removing everything after that first space. So we would like to modify. So I'd like to double check. I've got the right script here. So it's going to be an alignsequences.sh. Modify code, alignsequences.sh to change the sequence header to remove spaces. My general plan of attack, I think, is going to be used, said, to generate a temp.FASTA file that has the spaces removed. And then align star temp.FASTA and then rename star temp.align to star.align. And then we want to remove the star temp.FASTA. So again, we've got our R&DB, the FASTA file. We're going to use said to change those spaces to underscores in the headers. The output we're going to send to a file ending in temp.FASTA. The temp indicates to us that it's a temporary file. That's the file we're going to align. I don't want to write back over the initial FASTA file because, again, I want to keep my raw data raw. That raw data had come to us from the R&DB. We will then align that, like I mentioned, and then we can rename the alignment output that will end in temp.align to align and we can then get rid of our temporary files. I'll go ahead and submit this issue. This will be issue number 15. Over here in my command line, I'll go ahead and create that issue. So we'll do get branch issue 15. Get checkout issue 15. Run branch issue 15 and we're in good shape. So what I'm going to do over here then is I'm going to use said and we will the S and then the three forward slashes where I'm going to look for a space and replace it with an underscore. That'll be good. And the input to this, what we could do is the less than sign and the name of the file and then the greater than sign to what the new file is going to be, temp.fastaid. So this is one way of doing it that you'll often see but it turns out we don't actually need that less than sign. What we can do is we can say run said with this find and replace pattern on this file and the output will go to this new file. I'm going to go ahead and copy this for now to test it in my bash script. Run that. Now I want to go ahead and do a grep on that the sequence had to make sure did what I had hoped it would do because I'm going to get a ton of output. I'm going to use head to double check the output. And so what I see is that sure enough I've got my underscore between my genus and species names, but you know what I didn't get all the spaces. That reminds me that said will only replace the first case of the space and if I want to place everything in that line I need to put a G for global at the end of the search pattern before the closing quote and now if I copy that and I run that at my interface and I again do grep with the head on that temp.fastay file I see sure enough all my spaces have now returned into underscores and we're good shape. So this line works and this is the file that I want to align and we will also want to move that to something ending in a line without the temp and we'll go ahead and remove our temp.fastay file and that should be temp.align the outputs will be temp.align and we want to remove temp.fastay and again we could do what we did last week which was if dollar sign question mark equal zero then we want to do all this else if I will say echo fail mother failed to align sequences I'll do exit one and save that I'll go ahead and get status on that that's been modified I'll get add get commit-m and I will say modify sequence header to remove spaces closes number 15 I'll get checkout master and I'll get merge issue 15 that looks good now again I'm on the master branch I'm going to go ahead and run that to generate my align file so I can do make data raw rndb that 16s align and I'll make this stay tuned hopefully there won't be any problems and when we come back after I zoom through this then we'll go ahead and make sure that we closed out the issue and that worked pretty well on the v4 data I'm going to run a summary seeks command from mother on these output files so the summary seeks parenthesis fast a equals data v19 rndb dot align on that and this will then output the distribution of starting positions ending positions in the alignment space and then the number of bases per sequence ambiguous bases we see that we have some sequences in here that have 70 ambiguous base calls in it that's probably something we're going to want to remove down the road but for now I'll leave it this looks pretty good all of our sequences start at position one and at 41 36 you'll notice that this is longer than 1500 nucleotides that's because these are aligned sequences and remember that they have gap characters in there to get the different evolutionary positions in the sequence to line up I'll rerun this but instead of v19 let me do v4 and what I see here is that not all the sequences start at position one there's at least one that starts at position 10 and not all of them end at 645 some of them end at 640 now mother outputs data from an alignment and the gaps are represented by hyphens and missing data is represented by a period those periods for missing data come at the beginning and end of our sequences so I'd like for those periods to actually be hyphens for some downstream analyses that we're going to do as part of this project which are the sequences that don't start at position one again what we saw earlier is we could do grep greater than sign to get the sequence name data v4 line and again I'll do head because I want to control the output that we see we get those sequence names but what if we don't want it to match that greater than sign we don't want it to match our sequences if I said like ATGC or something like that well there's a lot of AT's G's and C's in my header row well there's a special option we can use with grep which is hyphen lowercase v and that means don't match the pattern so return the lines that don't match that pattern and again running that through head now what we get back are our sequences which is pretty slick we can add another grep to this move that head and I do grep and then say I do backslash period because a period itself will match any character the backslash is needed to indicate that we want to match the actual period character we see that we get back some sequences here that start with a series of periods and it looks like we've got three of them but we've already seen that we can count these right so we do we have three sequences that have periods at the beginning of them doesn't seem that we have any at the end which is interesting and so something that this gets a bit convoluted that we have these two greps together in series and while that certainly works there's an easier way and again if we're looking for sequences that start with a period then we don't need to remove the headers because all of our headers start with a greater than sign so we could do grep backslash period quote and then data v4 rndb.align and what happens is I got back the headers right and that's because the headers themselves contain periods right so it does exactly what I asked it to do so I can focus it to say remove or give me those sequences that start with a period and so if I use the character which is the character above on the keyboard that will anchor the search to the beginning of the line so this is saying at the beginning of the line find those lines that contain a period and sure enough what we get back now are these same three sequences that all start with periods so what we've seen in these last couple of examples is that we can use grep to return lines that don't match a pattern we can define patterns that start at the beginning of a line we can also chain grep commands together in series like we did in this example right and it doesn't matter how you do it as long as you get the right answer that being said the simpler example the simpler way is easier to maintain and understand often times but sometimes you start with a more complicated case and then figure out how you can simplify down the search pattern so let's do grep backslash period dollar sign so the dollar sign anchors the search to the end of the line and then we'll do data v4 db.align and we get nothing back which is surprising to me because when we ran summaries seeks there were a few sequences at look that ended earlier than the others so I wonder if instead of a period it actually ends in a gap character and if I put in hyphen dollar sign I actually get an error and I think the problem is that grep thinks that hyphen dollar sign is an option that I'm trying to give it kind of like we gave it hyphen c well it thinks I'm trying to give it hyphen dollar sign so I can use the backslash escape character to say no I want you to actually match the hyphen and again what we get here is all the sequences that have that the hyphen at the end of their header line because that's indicating the negative strand of the search so what we could do is we could again do grep and we could do hyphen v to remove those sequences that start those lines I'm sorry that start with the greater than sign and we see that we've got several sequences here that end with either one gap I believe this one ends with five gaps we can see how many there are there's ten sequences that end in a gap so that's great now what I'd like to do again is we're going to create an issue that we're going to resolve that turns the periods in our sequences into hyphens I'm going to create that new issue and we will convert the leading and lagging periods in alignments into hyphens to represent gaps output from mother's commands is starting sequences with periods that when the sequence doesn't start at position one would like to use hyphen to represent a gap instead again looking back at my scripts where do I want to do this well I think I want to do it in my extract region and here in extract region I'll say after filtering I then want to so which is this file this is extract region put all the nice mark down in here we will we will use said to convert period to hyphen in sequences that are outputted from mother commands and I will do get branch issue 16 get check out issue 16 we are on the branch everything is clean we will go ahead and go to our extract region shell script and looking at our code that we have here we see that the output of all this ends at filter.seqs and the file that we have got here ends in filter.fasta and that is the file that I'm going to operate on with said so I will do said my find in my place and I will then operate on that fasta file and like we did earlier I'll go ahead and output this to test.fasta and hopefully that will all work and then I will change that to be test.fasta here to target and we will also want to get rid of that file because we don't want to keep around our temp files so test.fasta actually that is going to be moved to target so we will get rid of the filter.fasta now I want to double check that this works I've got to put in the pattern and replace what we are going to do is we are going to look for those that start with a period and then repeat the period zero or more times and we are going to repeat that with as many hyphens as needed to replace the string if we save that now we should test this for sure right so we will go ahead and set on that for data and we will use the file that we have already generated rndb.align and let's output that to test.fasta and we can test that it worked by doing grep minus v to get rid of the header line and we can then do on the test.fasta and pipe that back to grep and look for anything that contains a period nothing matched it worked very good let's go ahead and run make datav4 rndb.align and we will see if this works and I will be back with you in a second so it ran all the way through let's go ahead and look at our datav4 directory and we see that we have got those files and I want to double check by doing my grep and see if anything maybe I will just see if we can find any lines that start with periods and datav4 rndb.align nothing matched we are in great shape again we modified that I am going to remove my test.fasta file that's good and I will go ahead and do get add code extract get commit replace leading periods with hyphens closes and we are on issue 16 number 16 and we will do get checkout issue 16 oh sorry I am already on there get checkout master get merge and finally I need to do a get push and I will check my issue out and it has been closed now that I have closed the issue and pushed the issue and the changes back up to get hub I need to go ahead and build out those other rndb files I will go ahead and do make datav4 line and what I can do is I can put them all in a single line datav45 datav34 datav19 and it will now run through all these and I will go ahead now while that is running I will go ahead and describe to you all the exercises that I would like you to work on here during the break for the first exercise what I would like you to figure out is how many sequences in our rndb aligned file the full length version have ambiguous bases in them the second is to determine how many of the full length sequences in that same file contain the standard forward primer to amplify the v3 region shown here or the v4 region also shown here as a bonus see if you can figure out how to modify your regular expression to represent the generate bases remember that you can use that star to repeat the previous character zero or more times and that you might need to use the hyphen to represent and match a gap character the final exercise question the fast day sequences that we looked at in the header it contained four fields separated by pipe characters what I'd like you to do is generate a file that contains the four fields separated by comma so this would then be a csv or comma separated values file be sure to remove the greater than sign from the header row to stretch yourself a little bit figure out if you can give the four fields names or column headings without using a text editor go ahead and work through those exercises pause the video once you've gotten through them go ahead and press the play button and I'll come back and show you how I work through them as always I hope you found those exercises engaging and helped you to stretch your brain muscles a little bit with the new material that I've covered in today's episode I've written up the three exercises here and we'll work through these together so again how many of the sequences in our R&D be a line file have ambiguous bases in them well we're going to use grep so we'll do grep and I will I'll do that negative search to remove the header lines and I'll remember to do data B19 R&D be dot a line so only be looking at the actual sequences and we will then pipe that to another grep and I'll do a hyphen C another way of writing this would be to do grep N and then WC dash L let's run both of these and see what we get so that gives us 174 sequences and this other one also gives us 174 sequences very good again remember what we're doing is we're doing the anti-search or looking for lines that don't match this character from our R&D be a line file we're then running another grep to count the number of lines that match the N either using the hyphen C within grep to count or with WC dash L to match the number of lines that match in this next exercise we ask how many of the full length sequences in data B19 R&D be a line that same file contain the standard forward primer to amplify the V3 and V4 regions very good well it's going to be very similar so we'll do grep and I'll go ahead and copy this sequence down in and the file we're going to search on is data B19 R&D be dot a line and that's not going to map right that's nothing is going to come back if I search for this nothing's going to come back it's a little bit slow because the file is big but it's not going to match because that file is aligned and so what we need to do is we need to insert the gap characters now I don't know where the gap characters occur in our primer so what I can do is put the hyphen star and that means match the hyphen character zero or more times after a C and so what I'll do is I will copy and I will then add WC-L to that to count the number of times that primer is found and what we found is that it shows up 74,966 times in our sequences to remind us how many total sequences we had data B19 was about 75,000 what did I do wrong and that's a surprise that nothing matches so I want to double check what's going on here so if I do head on my align file there's a lot of output here and what I'm noticing is that my greater than sign has a hyphen in front of it that's not good so I think what it's doing is it's I think it's matching this zero or more times and replacing it with a hyphen and perhaps what I'd rather have it do is match that one or more times and so if I want to match one or more times then I'm going to use the plus sign now let me rerun that so let me redo make data B19 actually before I do that let me go ahead and get checkout issue issue 16 and we do make data B19 we'll let this run and see if that solves a problem so that seems to have gone through let me go ahead and do a head on my data B19 rndb.align and sure enough that minus sign is gone now so we're in good shape while I'm in the middle of this exercise I know let me do get status I get add code extract get commit I then remove hyphen from before greater than sign in header get checkout master get merge issue 16 actually I'm not ready to merge it what I'd like to do I need to go back and modify that to indicate that this goes with issue 16 so I showed you how to do this before get checkout issue 16 get commit hyphen hyphen amend and I will then say closes number 16 so it's okay if we have two issues that close 16 again I'll do get checkout master get merge issue 16 so all the typos are real all the bugs are real I didn't make this up I get merge issue 16 that's great you'll see I added another commit that referenced this issue now both of those commits are now tied to that issue alright where were we alright back here with our example so we wanted to grep this and run it this was our primer sequence that we were looking for this should work now and sure enough we see that we've got the 74,966 and I think what we were trying to do was grep minus C to count the number of sequences in RNDB and I'm going to go ahead and put that carrot to anchor it to the very beginning because that's how we found the bug in the first place and see we see that we've got 76,574 so that's a pretty good representation of the sequences that match that v3 forward primer and again we use the coordinates because we know that primers aren't universal and that's one of the reasons we use those numerical coordinates now the next primer sequence is going to be fairly similar to that except we see that it has an M in it and an M is a degenerate base go ahead and kind of build out this padded primer that's padded for the alignment and instead of an M we can use a period as I put a clue and if we run that we'll see whether or not we get more matches and we do we get 75,700 matches to the full length using that v4 primer now that was an M that we replaced with a period what does the M actually represent well I always IUPAC Google for IUPAC code and you'll see it's purple I've already been here but this tells you the code for the different degeneracies and an M represents an A or a C that's good to know and so that period can represent A or C we don't want to match an A and a C we want to match an A or a C and so we can do is we can wrap that AC in square brackets and what that says is in that position match either an A or a C and we'll again run this and we see that we matched one fewer sequences than we did using a period again we didn't use primers to locate these regions we used coordinates and that works pretty well so the final exercise indicates that the fastest sequence headers contain four fields separated by pipe characters can we generate a file that contains the four fields separated by commas be sure to remove the greater than sign to stretch yourself figure out how to give the four fields names without using a text editor alright so what I'll do is I'll do grep and then hyphen to get the header row from data v19 rndb dot align and we could use any of these files this again is for exercise so that's going to give us all of our it's going to give us all of our headers and I then want to replace those vertical bars with pipes and so I'm going to pipe this into sed and we will then do s forward slash vertical line forward slash so I'm going to replace that pipe I'm pausing because I don't remember if that pipe is a special character or not but we'll replace that with a comma and then the closing forward slash and I want to replace all of them so I'm going to go ahead and use a g I'll do head to test things out and see if we get ourselves into trouble and we see that sure enough that replaced them with commas but I still have that greater than sign I'll go ahead and do sed s and replace that greater than sign with nothing and do head again copy and paste that in and you'll see that we now no longer have that greater than sign and the stretch was to figure out how could we how could we put in a header for column names and what I will do is I will say we'll use the echo function so we'll say echo and then in quotes I'll put organism name with an underscore not a space and then we will put in I think this is like the gen bank accession comma no spaces and then the next one I think is the maybe this is the genome accession it doesn't really matter I'm just trying to demonstrate how to add how to add the column names and then we'll say the sequence accession and then the fourth I guess there's five so the will say location and then coordinates say genome coordinates say chromosome here right whatever we put in there doesn't really matter it turns out that there's five fields I'll probably go ahead and update that in the notes and what we can do is we can output this like we've been doing to our file and so I'll call this my header table.csv and so if I run echo on that and then I do nano header table.csv I see I've got those values which is great and I'd like to output all of this to header table.csv now if I run this and I do head header table.csv I'm sorry well yeah we can see it here right we no longer have our column names and that's because instead of a single greater than sign if we put in two the two greater than signs means append so one means take the left of stuff on the left of the greater than sign and put it into the file on the right two means to append it if you only use one it's going to write over everything that's in there and if we run these two lines and then do nano header table.csv we see sure enough we've got our column headings as well as the different columns again separated by commas if you figure that out you must have done a little bit of googling or had some prior knowledge good job the key to that exercise again though was to figure out how to generate the five fields separated by commas once again for joining me for this week's episode of code club be sure you spend time going through the exercises on your own to help reinforce your new skills using regular expressions you'll find regular expressions are common in every programming language and the syntax is pretty much the same once you get the hang of how to use them in one language they're pretty easy to master the others it would be great if you could take the ideas we've worked with today and think about how they relate to your current projects I'd love to see how you're adapting what I've covered in this and other code club episodes also feel free to ask any questions you have in the comments below and I'll do my best to answer them in a future code club be sure to tell your friends about code club and to like this video subscribe to the rifimona's channel and click on the bell so you know when the next code club video drops keep practicing and we'll see you next time for another episode of code club