 We'll continue with developing a read fast day function. So there's actually one thing that is kind of important whenever you develop code, and that's the data that you develop it with. So right now you have a single fast day file on your computer, and it contains 100,000 nucleotides. So if you ever need to look at your data and see whether what you did with it is correct, that'll be a little bit difficult. It's just too large. So it's always good to start your initial development with synthetic data, test data, data for which you'll know the results. And that's especially important if later on you go and work with any kind of statistical analysis or machine learning. Before you work on actual data, you must work on synthetic data for which you know the correct results. It's absolutely pointless to throw an intelligent algorithm at real world data and then just blindly trust and copy the numbers. You have to know that your algorithm is actually able to produce the correct results on your system in the scenario that you're working with with all limitations that you have. And for that, synthetic data is absolutely paramount. Those of you who are going to stay on for the EDA, two days of EDA workshop are going to hear a lot more about that. Now what is synthetic data set that's useful for us look like? Well it ought to have a fast A format. It ought to be small. It should have some sequence characters. Since we're going to be using it to look at dinucleotide frequencies, it would be kind of nice if we can count the dinucleotide frequencies and see whether what we're doing is correct. So let's write up something. So we open file, new file, and in this case a text file. So what should we write? First character, something, test. Then we write a little bit of sequence. You tell me what I should write. Dineucleotides. Dineucleotides? Well no, I mean the second line of my fast A file, so I should put in some sequence. So sequence. I wouldn't use an actual gene sequence here. I'd use something that I know what the result ought to be. Okay, that sounds reasonable, so let's try it. And we might anticipate a special case and actually do this in lower case. Or actually in mixed case. Okay, fair enough. Now what do we do? Save it. Save it. File. Save as. Save as. What's a good file name? Test.fastA. Very good. Always useful to have your file name tell you what the file contents is. On the Mac I use the space bar in file directories a lot. I don't know if you know that function. So on the browser if I go select the file and press the space bar it tells me what the files are. If they're text files or files it recognizes I can actually look into the contents. I don't know if there's an equivalent function in Windows, but on the Mac I use that a lot. So I can just quickly scan through the actual contents of a directory. Other than that make your file names meaningful. File one, file two, definitely last file. That's not a good way to do it. Usually I write dates for sorting. And usually if it's anything that I maintain, especially if it's documents that I exchange with my collaborators, I always put a version into the file name. Okay, but now this is test.fastA. So we can develop it with this. Now the file name will come in as a parameter, but to develop it we can simply define first fn is test.fastA. Now we implement our code step by step. If I'm in a commentant section and I press the enter I continue creating comments. If I press shift enter I start with a national code line. So let's read the file and assign it. We need to assign the results to a variable name. Now when I read files of any kind I usually don't use them right away as they are. Usually there needs to be a little bit of processing. Removing records that I don't actually want, skipping lines in the input, things like that. So I generally just call the result of the raw read command tmp, one of my ways to write a throw away variable. So I know I'm not going to actually be using this for anything. That's just how I capture my first set of results. And the command we mentioned we'd be using is read lines. And the argument is the file which is fn. Now remember we've already defined fn to be test.fastA. In the function fn would be passed as a parameter from the argument list of the function signature. So we can execute that. We have temp. Temp is a character vector. It has one, two, three, four, five elements. And the five elements correspond to what we want to work with. Happy? I'm almost happy. I'm almost happy because remember when we read in the fastA file from ensemble there were two extra blank lines at the end. I want to be sure that whatever my function does it doesn't add characters or anything if it encounters blank lines. So when you can already anticipate problems with your data it's good to put them into your test dataset. So I'll change that, test.fastA. I put in an extra blank line at the end. It doesn't really matter. Let's save it. Close it. Read lines again. See what the result is. Right. So now we have a header line for A's, for C's, a mixed case G, some T's and an empty line at the end. Two adjacent quotation marks are the empty string. So it's a string, it's a character thing, but it contains nothing. Next set. Discard the header. What do we know about the header according to the specifications? Hang on. We need to troubleshoot some problems. Ten. You didn't actually reread it, but you just looked into what ten was, see what the issue is. You changed the file. You changed the file. You have read the original file and assigned it to the variable ten. Then you looked at ten. You changed the file. Then you looked at ten again. Then you changed the file. Then you looked at ten again. Then you looked at ten again. It's always the same original ten. You never read the file if you did. If I change the file, I have to reread it. It doesn't automatically just hit a match to a link between memory and other ideas. Okay, so I can move it up there, too. That's our development data. Now we need to discard the header. What do we know about the header? It's the first line. And it's one line only. Now this is a point where we could put in a lot of sanity testing. We could check whether the first line actually starts as requested. We could check whether there's more than one header line, and in fact this is just a multi-fastay line. We could check whether it has a header at all. It might not. It might be a plain text sequence file. Should we use that or should we not? That's a bit of an interesting philosophical question. There's a saying which says, be most permissive with what you read and be most restrictive with what you write. So when you read something, you might write your code to be able to graciously adapt to all kinds of different variant formats, existence of a header line, header line being more than one line, odd characters in your sequence, and so on. When you write it, you have to be sure that you write absolutely OCD, fast A, single header line, and so on and so on. I think in bioinformatics, even though that's a good general rule, I don't think that rule is very good for bioinformatics. I think you should also be restrictive in what you accept and what you read. So for example, if I have a function called read fast A, I think it would be a mistake to allow it to read files that don't have a header line. If it encounters a file that does not have a header line and you want to be diligent about handling errors, the file should fail and say, warning, this file does not have a header line. If this happens frequently in your workflow, you might have an explicit flag where you can switch this off or make it explicit that you could accept something else. But in principle, if your file does not look like you want, like you think what it should look like, there's really no guarantee that you can work with the file in the way that you thought it would be possible. So I think it's much better to be restrictive and if the file does not match your specifications, don't work on it. For now, we might put this into our to-do check for format. I'm sure you can do something like that at home. For now, we just discard the header. How do we do that? How do we discard the header line, the first line, right? So if we have a vector and we pass negative indices into the square brackets, it means everything from that vector except for these lines. So 10 minus 1 gives us these characters. If we don't want to just express that, we need to reassign our temp. After we do that, we have a new temp. Just as an A side, if we can guarantee that we read FASTA files and we can guarantee our format, because this is part of an internal workflow that we write, we could have also attached this directly to the function call. The function call is evaluated as an expression. So this is in fact a vector and we can use the minus 1 attached to this. But in that case, the header line never ever makes it into your code. You can't use it later on to interpret it. You can't store it somewhere else in case you want to work with it. And most importantly, you can't use it to validate your file. If it's a raw text sequence file, you will be losing your first line of nucleotides. So we discard the header. Next we need to collate this into a single string. How do we put elements from a character vector together into a string? I don't know. I'm afraid you'll have to Google that. Figure out what you think you would ask Google, then do your Google search and then tell me if you think you found something that is useful. Well, that is a different kind of function that would read all the characters and it would include the line breaks. Concatenate the C function. What does concatenate do? It combines things into a vector, but it doesn't join them together. There's a function called C bind. A function called C bind takes a two-dimensional object and adds columns to it. C bind is a column bind function. It has a cognate R bind for row binding. No, that's not what we need, either. We'll paste. We don't need to do this. Do we need to do this? At the point at which we're at, I think we do need to do this. Especially from a perspective, I've erased it now, that maybe sometimes we'd like to keep things together in a single string, so not in separate lines, so yes. Let's do this. Can we? I don't know that it does that, does it? Okay, so my top Google hit here is how to collapse a list of characters into a single string in R. That kind of fits the bill. Top answer, you can do it with the paste function. Paste. There's an example here. Let's try that. So in this example, I'm passing the paste function, a vector of three letters, and then I'm using the collapse parameter with something, a comma and a blank. And what I get is three letters separated by comma and blank. What happens if I don't use that? Then I just get the three characters back. So the collapse seems to do something. The collapse seems to take elements from a vector and put them together into a single string. What happens if I don't pass a vector but three different, just three elements? Oh, now I get a single string, but it's separated by a blank. And apparently collapse did nothing because I don't see any commas here. So that's kind of confusing. It's different if we simply give it a list of elements than if I give it a single vector, and collapse seems to do something to a vector, but something else happens with that list. Okay. Well, paste looks useful. So let's see what our help tells me about paste. Ah, it tells me that there are actually two possible parameters here. One is sep and one is collapse. And these have default values. The default value for sep is a blank. The default value for collapse is null. So what does that mean? So sep is a character string to separate the terms. Now if we give paste a list of terms like here, then the default sep will apply to get them together. So for example, then if I define sep equals the pipe character, I get this. However, collapse is an optional character string to separate the results. So what does it mean to separate the results? Well, if paste doesn't have a list of terms but a vector, things look slightly different. So if I have a vector ABC, the separators only apply to the first one. So I have three different results here, which I can collapse into a single string using the collapse parameter. And we do have a vector in our fast state file. Temp is a vector. So we need something like this. However, we don't want a comma and an apostrophe. We don't want pipe characters. We just want nothing. So what do we, how do we need to define collapse? There's two quotation marks, the empty string. So that looks good. And we can just overwrite our temp. I have a question. Yeah. What exactly can you know if you should use single quotes, double quotes, or no quotes? If you want a character, you use some quotes, single or double quotes. If you have no quotes, R assumes this to be a variable name. So TMP in quotation marks is the character string TMP. TMP without quotation marks is the object or the variable that we've defined previously to be called TMP. Single characters and double characters are interchangeable. I believe in all cases, you can use either one or the other. Is that right? I can't think of counter examples right now. Of course they have to be paired. So open it with a single and then close it with a single quotation mark. Okay. So we're one step further. Now we have to break this into single characters. How do we do that? How do we even go about solving that problem? The answer by now, doing the same thing again, we Google for how do we split or how do we break a string into single characters? How do we go about getting a vector of single characters? What does Google suggest? So if I Google for R, break a single string into characters, many suggestions. This one says how to split strings in R for dummies. That would be me. How to split strings in R. So apparently there's a function called strsplit. String split. Strsplit. Okay. Let's explore that. So strsplit. Split the elements of a character vector x into substrings according to the matches to substring split within them. Okay. Let's experiment with that. So what happens if I just say that? Okay. It complains. The argument split is missing. So we have to supply something to split on. So what could we be splitting on? So for example, we could split on G. Hang on. I never updated this. So we're now at a state where TMP has to be a single string like this. Strsplit. Okay. Now we have lots of parenthesis here. We need to figure out what that means. First of all, we split on G. Apparently that works. It takes our string and breaks it apart into parts that are delimited by a lowercase G. The lowercase G itself does not appear in the results. So that's important to know. Whatever you specify as a split pattern does not appear in the results. And that can be useful. Because the pattern is actually a regular expression. And you can make sophisticated patterns that match things in your text or in your input that you don't want. They get consumed by the algorithm and then you get only the results. So that's good. However, there's the thing with this double index one. What does that tell us? If you think way, way, way back to the tutorial on how we define positions in lists, you might have come across these double square brackets. So that seems to tell us the result is a list. And the first element of that list is a vector. That vector has three elements, which is the result of our split. Now why would it do that? Why do we take a perfectly good vector and then put it into a list before we return it? Well, the secret here is that string split is actually quite versatile. And you can apply it to vectors of elements. So for example, if instead of this one element of temp, I give string split a vector of temp and then a shortened version of temp. So now two, oh, sorry. That won't work. Think something else. So the thing that should be split is now a vector of one is our temp string and one is an explicit string, which is G, G, G, T, T, T, T. And the split is on lowercase G. Now string split splits each of them in turn. It operates on each of the character elements in turn. And it puts the first one into the list element one. And the result of that is three characters. And it puts the second one into list element two. And the results of that is two, a vector of two elements of characters. And that's why we need a list. We can't return such results as matrix or a data frame because the length of the results can be different. And the only way to store vectors of results that have different types or different lengths is if we put them into lists.