 Welcome back. Let's see how far we can get until the next fire alarm. I hope it's not us causing this, magically. So where were we at? We were discussing string split. And we were specifically discussing why string split needs to put its results into lists. And the reason is that string split operates on potentially a large number of individual strings to split. And the result of every single split can be a vector of widely varying lengths and the data structure to put elements with different lengths in R, these are lists. But normally we don't want a list, especially in this case we don't want a vector inside a list. But we specified our function to return a single vector. So we need to take the vector out of the list when we assign it. Something like that, there's two different options we have. One option is like this. To simply, as I've done in the vector example before, extract from the result for which we know that it's the first list element, which is a vector. We might do this more explicitly, but I'll get back to the version that I've just written. But I'd just like to demonstrate to you what's going on here. So at first, we have a list of one element, which contains a vector with three elements which have been split up. So if we then take this first element, we get a plane vector. So this is the same thing as I've done before with an intermediate assignment to the variable temp, TMP. But as I said, we don't actually need that intermediate assignment. We can just do this. So string split, and then attach the subsetting operator of the two square brackets, and then assign the result. The double brackets? Yeah. The double brackets? Yeah. Well, double brackets subset a single element from some object that has more than one element. So we typically use these double brackets to get single elements out of lists. If we use just a single bracket, then we can extract ranges or multiple dimensions. But the double brackets don't work in that way. Extract the line, one line. Right, so temp contains one line. If I use string split and operate on temp and split on the lowercase g, the result is a list. I want the first element of that list, so I attach that operator to this expression. An alternative thing that I could have done is the following, and use the unlist function. So in this case, the unlist function does the same thing. But if my element would now contain more than one element to split, all of the resulting elements would go into the same vector. Now, depending on the situation, what you expected, what you don't expect, either one or the other is the better way to do this. If you want to be defensive in programming, after you generate a list, you can test that it indeed only contains one element as expected. So if we look at length of this result here, it is one, not three. It's one vector that contains three elements, and the list is the top container. But if we unlist this, then everything just gets thrown together. Now, we could test for the length and throw an error if the length is not the expected length of one. For our code here, it really doesn't matter. We can write this in either way. I think in this case, I would do it this way. Right, but we didn't want to split on g, right? We wanted to split this up into single characters. So how do we do that? Potential marks. Right. So the idiom here is, if we split on the empty string, the result is single characters. So string split temp on empty strings is a list again. And if we unlist it or extract from the list, then we finally get our single characters. And we assign our single characters to a variable v, and we return the result. And we're done. Equal, just put the quotation. Save the same result. Yes, this. If we don't name the parameter, we can get the same result. OK. That's a good comment. What's this with naming characters and not naming characters? For this, we need to consider how a function actually knows what value is named that is passed as a function argument. And that is information we find in the function signature. So if we hover over, say, string split, we get this little box here, which is the signature, i.e. it's the name, the parentheses, and the names of the arguments that we pass into the function. Now, we can pass parameters into the function in two different ways, either by default position or by name or mixed. By default position means if x is the pattern to split, split is the pattern to split with. Fixed is whether we use fixed variable bytes. Perl is the dialect of the regular expressions to use and so on. We can put all the expected parameters simply in that order and it will work. But we can also change the order around if we name the parameters. So for example, I can say string split empty string and I will get this result. But I can also say split equals empty string x equals 10 and turn it around. So if things are not in the right order, then you must name them. If things are in the default order, you can omit the name. Typically what I end up doing is to omit the name for the first argument if that argument is the input for a function because it's just x or something usually, internally. But I use names for the other parameters and the reason is simply to make the code more explicit. It's easier to read if I know that I'm not just passing an empty string and how do I remember the function signature? I'd need to look it up and that just wastes time. I can make it explicit and simply say split equals the empty string. So string split what and then pass the parameter. I think this is relatively explicit. And then we return the result. So let's see if it works. In order to make the function known, we just execute this piece of code. Or in this case, since the file is laid out or constructed so that nothing untoward happens as if we execute the script, we can just source the whole thing. So I click on source. This sources the file. And now in my functions, I have a function called readFastA. And the signature is function of filing. Now I had defined fn as test.fastA to make sure that this global variable does not interfere with the local variable I deleted again. So that's one way probably the way I write most of my bugs is when I develop a function locally, define some variables with the names that are used in the functions. And then I run the function and I expect something to happen, but something else happens instead because the variable exists in two places, internally and externally. So it's a good idea to remove such variables that you've used for development when you don't need them anymore. So we remove file name. And now it should work. Let's try it. So readFastA and file name was test.fastA. There you go. One vector with single characters. OK. So as a result of this, everybody should have a local function readFastA. If you don't, put up some red posters and we'll see how we can help them. What's happening instead? I'm not having a specific question. I think we just wrote a question here. It was supposed to be the same one, right? And when I took it this way, I didn't do it. I'm not a parent. I was just trying to figure out how to do it. So why do you turn it? What do you do? So, so, so, so, so, so, so, so, so, so, so, so, so, so, so, so, so, so, so, so, so, so, so, so, so, so, so, so, so, so, so, so, so, so, so, so, so, so, so I have a very nice example here. I've been waiting for this. Tammy needs to debug her code. Okay. So what Tammy did is execute this line and then say, hmm, nothing happens, it doesn't work. So at that point, she needs to go and debug the code. So how do you debug code? Well, there's varying methodologies. One methodology I like is called rubber duck debugging. Have you ever heard of rubber duck debugging, Lauren? You haven't heard of rubber duck debugging, Craig? Have you ever heard of rubber duck debugging? I'm not familiar. No, okay. Anne, do you know rubber duck debugging? But you have something that doesn't work. Okay, so here's my little rubber duck. And I tell the rubber duck, so in this line, I'm assigning the result of string split to V and I don't see that I'm getting anything. And as I explain this to my little rubber duck, I might notice what, hang on, I'm assigning it. I'm not printing the results. I'm assigning it to V and indeed, if I look at the state of V, show this to my little rubber duck, I see that it actually worked. Not just V, I do get the appropriate response, but no function to return from. Okay, but you would have explained this correctly to your little rubber duck. So here's a rubber duck. Now, since I expect we'll be doing more debugging, I brought rubber ducks for everyone. And since maybe not everybody likes a rubber duck, there's also other squishy things. So what I'm going to do is, I'm going to put rubber ducks on this chair in the middle of them. There's enough for, oops, ready, lively, lively, then nice, right? I also have some for our second workshop, boy, we're going to need many rubber ducks. Pick and choose, have a squeaky rubber duck or a squishy thingy, and use it for debugging. It's great, it actually works. If you don't have a rubber duck for debugging, you might have a significant other, it also helps to explain the code, but if all you do is talking about thugs and your code, I'm not sure how good that will be for the relationship. So rubber ducks. Okay, is everybody at that stage? Everybody has a function. No, okay, let's debug some more. Yes, oh, sorry. You're good, all right. Okay, I just wanted to ask this question. Okay, so I have, this is the other issue I wanted to ask. So you're just trying to have a lot of them here at your class? No, I can't, all of that. Yes, so I can see that. Yeah, I can see that. This is all, try that with the two eggs. That's the one that's the one that's the one. So that makes sense, doesn't it? I'm just wondering if you have any questions? I just want to know if you have any questions. Yeah, I'm sorry. So are you ready to get up here? Ready? Yeah, I got the question. So we're saying everybody's going to have to work. So let's just say that's the pair. So what do we mean? Yeah, so read fast that this function involves every operation in the process. So this is not needed, right? If you want that, you can have it at the company. Yeah, so that would be fine. Are you calling in my phone? Yeah. Read fast that. Read fast. Yeah, so you're defining it as my function and you want to define it as read fast. Try it. There might be other options. So now she would have to work in the lab. Read fast that. Or you could just put read and see what that means. Like it's that fast. We want to try it. Yeah, so we remove the test fast that it should. No, no, no. There's just a few extra things in here that you want to get rid of. Yeah, because it's automatic. Awesome. Okay. It's going to go like, okay. Read fast. Read fast. Read fast. Read fast. Alright. So here above this is where you define the function where you define the function Yeah exactly. No, so you just have some extra step lobbying I didn't want to just only have this in line, so, yeah. Okay, for there, see if you can add me when I need to do this. I need to do this. I need to do something twice. I'm going to try to do this. So then, like I said, I'm going to follow Henry as fast as I can. Oh my goodness. I don't want it to be out one definition. It's going to return the last thing that was done. Okay. I need to build it. It should be fixed in this place. So if you use it all three times, if you get it two times, it's going to be just a little bit more. And then I just pull it. And then I just pull it. And just import it. And do that. And do it like that. Great. Yeah, okay. So you have it all written out, right? Okay. The function that you wrap it up in, then you can test it. Okay. And so when you're naming it, say you're going to do this. So I'm going to do this. I'm going to do our action function. Yeah. Right there. So that's the function. And then this. And you name the vector. Can you give me that? Okay. And then that was... I left it all just stored in that vector. Okay. So this is the function you're interested in. So 20.R copy. So this is the step. Yep. So we're going to do that. And then you can... I don't know. Yeah. The function I'm going to read. The function is for that one. Okay. So I'm going to do that. Okay. So I'm going to do this. Okay. So the function you're interested in is 20.R copy. So the file you have is kind of like 20.R. copy. Yeah. 20.R copy. Okay. So I'm adding this file. Yeah. Yeah. So once you're done. Yeah, so here's your test file, right. Yeah. So we'll just get this. And then the function's like... Yeah. You can actually... It's still like. Okay, so this is a list of the first elements of the list of the structure. So let's see if you can go and get the right element of the character. Then you can answer, like, keep this file. Okay, so it's not super key. It's not super key. It's not super key. It's not super key. So this is saying, okay, you can go in to one, and search the element. Then I go in, and search the element. Then it's going to be a C. So by doing that, I'm going to change the tool name, you're just getting this little extra step. So this is the list. This is the first, this whole thing. This whole thing is a different place. Exactly, the first thing. So this is just saying, like, pull it out of that list. We don't have more than one object. So yeah, so we're just going to pull it out and make it automatically accessible to the character. Is it this name thing? Yeah. Okay. It's on that, so once it, like, so here. So it's my understanding. Try using this one. What do you think? I expect to see what you have, and to see what's there because it's in the file that you're showing it to me. So you want that guy. I see it. Right? This is what I'm going to do. What do you think this is? This is what I'm going to do. Okay, so we just have to read it. Now let's make it a C. C1, right? C1. Yeah, C1. Oh, I see what you're telling me. Okay. So we just have to name this, too. RM. That means remove. No. So it's removing this FN object that has been defined earlier. Okay. Okay. So, great. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. She does. Does that mean that she's 10 to 12? I said yes, yeah exactly. So 10 to, I just took it out. I just really get to 10. So I got pretty bad. Because what we want is practical test. Okay, now we're going to take off this. Other than it stays 10, now 10. This one is passed into this. It's collapsed altogether. It's one string. I think it's re-signed into this test. It's a little bit. And now it's only going to split that same object. It's making a new object called B where it's cut up. Okay, so for me, that is a lot of your objects today. Yeah, so yeah, because the way the function is written, that's our memory. So I can see all of it in something else, FN2 personally. So I'm going to call it FN2. Now I have to rename this into FN2. So here, which is this. So here, right. So I'm going to call it FN2. So then if we do this. So here, I'm going to tab. So I'm going to press FN2. And FN2. So I'm going to type FN2. So it's just a name of a very good one. So I'm going to type FN2. So I'm going to type FN2. So I'm going to type FN2. So it's here. Okay. Now you can proceed. Oh, sorry. No problem. I thought you had to come in. I know. That's automatic. That's actually what was called. I think the sort where it marker in here. I mean, it knows, but it doesn't write. It knows right. It remembers the one last minute comment. I was talking to the person who just answered that. It knows. I've been looking at your stickers. Okay, so that's all good. Now, to finish this up, I'd like to have a usage example and a test here. Now, the usage example can be very simple. So all you possibly would want to do is this here. But it wouldn't actually work because there's no guarantee that at the moment that you try the function possibly under very different circumstances that the file test.fastA really still exists. So what we can do is we can create a little file and use that instead. Now, when you create a file in the usage example or in a test, there's a problem. Because if you inadvertently choose a file name of something very important and very dear to your heart that you have in your local memory, and then you run your test with a file name that accidentally has the same file name, R will do the stubborn thing and overwrite it. And that may not be good. So to guard for that, R has a mechanism to generate files that exist only once. So called temporary files through the function tempfile. So if I run the function tempfile without parameters, it gives me a long string of gibberish where it would create a file that is unique and exists only once and in only this place. So what I can do is create myself a finding. And now I can save to that file my little use case. So do something like write lines, test. The write lines arguments require the text to write and a text connection. So my file gets a file name. I write to that file name. And then I do read fast a my file. In this way, I can be sure if I try things and I experiment things, I don't accidentally overwrite precious files. How long does that file exist? It exists until you next close your R session. So it's temporary. The system cleans up when you close your R session. Temporary files and directories are removed. Write lines is exactly the opposite of read lines. So write lines takes a vector of strings and writes that to a file line by line, each element in a line. Read lines takes a file of lines and reads in a vector of strings. So read lines and write lines are exact opposites. Yeah, just like the entirety of it, like what the purpose is. Why I'm putting this in here? This is to remind myself of how to actually use the function. So what is the false indicator at the beginning? The false indicator is so that this block does not get executed when the entire script is sourced. So the way that we use this function, or the way that we load this function later on is simply to source the entire file. And then this is a utility, which is a function which then exists in our workspace. However, we can have some housekeeping functions here, like examples or tests, to make sure that the function actually works as expected. And we wouldn't want to execute them every time we load a function. And this is why we put this in a block, which kind of hides this from execution. Oh, that's a good point. Yeah. It's not working. You want to run it from up there? Yeah. Sure. Or you can just run it the same way and just do, yeah, control enter. Great. So it can't open the connection, yeah, because my file doesn't exist. So then if you run this one, though, it's fine. So this one is just like a stand-in. It's like my file is your real file. Yeah. That's why I said it can't open the connection. It can't find anything called my file. No, so that one's perfect. So you're good. Right. Yeah. Ah. So it's saving it in an object called my file. So you don't want to put it in quotes. You want to just have it. Because it's an object, a temp file object. Sorry. OK. So this is so useful. This is good to have. But now how do we make sure we actually have it every time we run an R session, or at least an R session in our project? Well, there's several ways to do this. Often I put code which I need locally in a project in a file that I call utilities or tools. And just source that whenever I start my code. But in this case, what I've done is I've constructed this little init.r function, or init.r script, so that local functions are sourced. So what I'm doing here is I'm listing all the files in the r directory. And for each script that I find in that r directory where the file name ends with .r, I source the script and I write a little bit of output that that script has been sourced. So if I take that function now and move it into the r directory, read fasta.r. I select it and I go on the more menu and move. And I choose the r folder and click on open. First of all, rstudio complains that the file that I had open for editing no longer exists because I've moved it elsewhere. So I want to close it. That's okay. Now if I look into my r directory, read fasta.r is now there. And now if I type init, which I would do whenever I open this project or I return to this project, it tells me initializing sourcing local functions from the .r directory. One of the local functions is .r. Read fasta.r. So once you have that, every time you open this project and you type init, the function is defined and is known and you can just use read fasta. Well, that's a major milestone. We've written the first function, not just ad hoc but in a quite principled way. We've debugged it. We've talked about debugging. I'm going to move ahead a little bit and actually not talk about testing right now. I'd like to get a little bit of actual work done. And we might talk about testing if we ever run into problems with our function here. So, back to our sequence analysis tasks. The task 2.4 is actually read this file and assign it to MySeq and confirm that this has worked. So, you want to define a variable MySeq and assign the result of your read fasta output of this 100,000 nucleotide file and assign this to MySeq. All right. Help me out here. What do I need to type? What would you type? Luciana. Well, you know, first of all, I want to assign something to MySeq. So, I type MySeq and assign. And then my fasta, right? It's a function, read fasta. And the function wants a file name and the file name is this here. No, read fasta is a function. Right. So, we need to say run this function on that file and assign the results to MySeq. Well, that did not generate an error. Now, how could we confirm that this has worked? Length. So, the first question could be length. So, let's see. Length. MySeq. 100,000. Is that what we expect? Is it? Not just one long string. 100,000 single characters. Exactly. That's what we wanted. Right. Okay. Are these all the characters that we were looking at? So, let's look at the first few. How do we get the first few characters? Head and tail. Yes. Head. A-G-T, A-G-A. Is that correct? Is A-G-T, A-G-A the way our file ought to start? How do we know? You don't remember. So, we'll have to look. How do we look? We just open the file. A-G-T, A-G-A. Oh, I like that. That looks very good. So, the next thing was using tail, right? Tail, MySeq. A-A-A-A-A-G-G. Is that correct? Well, more of the same here. A-A-A-A-A-A-G-G. Okay. Also, correct. Now, this actually checking your object and making sure that what you have in memory corresponds exactly to what you think you should have is good and sound practice. You should always make that a habit. In this business, data is holy. You really need to respect your data. Don't cut corners. Look at it. Confirm it. Validate it at every single opportunity. And as we go along and a little and try to work with it, you'll appreciate why this is kind of important. Wonderful. Now, we can calculate the GC contents. We can actually do some analysis with it. What GC contents would we expect? 50-50? More or less? What's human average GC? Molecular Biology 102. Well, we can calculate it. I can tell you. It's about 40%. Okay. But how do we calculate the GC contents? So we have a 100,000 character vector. How do we calculate GC contents? What would you do? Count them and divide them again. You would count them. How do we count them? Table. Hmm? Table. Table. You could match them against some kind of prototype like ATGs and Cs and see how many you get in each function where it goes through the whole sequence. Okay. We can do that. Any other way? Just separate the data with them and then count. Count? In a loop. We could do that too. Different strengths and weaknesses. Oh, let's just try them. I'll just write the code and you don't need to write the code too, but you can see what's happening. So let's first run a counting loop. You remember how a loop works in principle? So a typical for loop has four, some variable that we're iterating over and then doing something. The typical way we write an iterator like that is something like 4i in the range of 1 to length of mySeq. So length mySeq is 100,000. So this expands to 4i in 1 to 100,000. So we cycle through this loop. At the first iteration, i is 1. Second iteration, i is 2 and so on. Let's initialize this by a variable that we call na, number of a's we find in the file, and set that to zero. Then our count would work in this way that we say if mySeq i equals a, then na becomes na plus 1. So that's really the most low-level pedestrian way to do this. Now we could also have an nc, g and t and we could have a number of if and else statements here and count this all together. But just running this like that gives us 26,219 a's, right? So that's possible. Sometimes the condition that we're looking for is very involved. Of course, just comparing one character against another character is not very difficult. We can have simpler and more vectorized and more efficient ways to do this. But for example, if you want to find out whether that a has two more a's within 10 nucleotides upstream or downstream, or whether that a is in a list of positions that are highly conserved, or whatever other conditions than any of the other methods might fail and you have to revert to this very pedestrian way of just looping over a vector and counting individual occurrences. So this is possible. Now if you ever ask on Stack Overflow about this kind of code, everybody would just tear their hair and say, no, no, no, no, that's not how you write it in R. Usually it isn't, but there are special cases when that's exactly the right kind of idiom to pursue. But let's look at matching against a pattern and assessing the results. And that's kind of interesting. So let's work with a little smaller subset here, not with 100,000. I'm going to define a vector of, say, the first 20 nucleotides from MySeq. Right, so the first 20 nucleotides. And this is a vector and we can compare the vector with values. So you can say X equals A. What's the result of evaluating this expression? Well, this is a binary operator, the equals operator. It's in the same class as less than, greater than, not equal to, and so on. And the result of this binary operator of a vector and something else is always a vector of Boolean values, true or false. So 1 equals 1 is true and 1 equals 2 is false. So that's for scalars. But for vectors, the way this works is the comparison is done for every element in the vector. So true, false, false, true, false, true, false, and so on. So this is the result of that. This is a vector of Boolean values, true's and false's. And now there's kind of a trick in our idiom that we can use to count how many truths we have in there. And that depends on casting data types from logical values to numeric values. If we cast true to numeric, we get 1. If we cast false to numeric, we get 0. So by treating this string as a numeric string instead of a Boolean string, it gets converted into a sequence of 1s and 0s. And some functions in R do such kind of conversions implicitly and one of these functions is sum. So sum wants to sum over numbers. So it takes whatever it gets and tries to convert that into numbers. If we give it Booleans, it converts the Booleans silently to numeric values and then sums over 1s and 0s. The end effect of this is that the result of sum of x equals a, or that Boolean vector, is the number of truths in that sequence of logicals. So once again, x is equal to a, had 1, 2, 3, 4, 5 truths. And if we sum over these 5 truths in a longer vector, the result is 5. Now, this is one of the rare cases where something is quite idiomatic. I don't think you could write this in the same way explicitly in Python or in some other language. In R, you can do this. But it's so concise and so useful and used so often that this is one of the things you just learn to recognize and understand. If you see sum over sum evaluation of a conditional, you know what you're actually doing is you're counting the number of truths, right? This in principle gives us the GC contents. In this case, it just happens to be 0.5. That's just how it turns out to be. So could we do the same thing for MySeq? Sure. So the result of the GC contents here in this case is 0.47681. Now having to write this separately for C and G is a little bit awkward. So I'm going to use the same approach, but with a different function, which is called grep, which is one of the most useful functions in your R toolkit. Grep, I think it's an portmanteau for group representation. It asks for a pattern and some X it works on. So if I grep for Cs in my X, I get two numbers, and that's the position 18 and 20. So this means that element 18 and element 20 in my vector are C. Now the cool thing about the pattern is this can actually be a regular expression. So we can put fancy things in there. So for example, a regular expression for saying give me either G or C is written in this way. This is called a character class expression. Everything that is within these square brackets is matched. So what this says is show me the positions of something that is either C or G in my vector X. And this is this. Now we could either use length of this vector for counting, or we can use a variant of grep, which does the same thing, but returns booleans, which is grep L. Grep logical. Grep logical gives me again a vector of trues and falses, and I can sum over that to give me the expected result. Right? Applying this to mySeq grep logical Cg in mySeq divided by the length of mySeq gives me 0.47681 as above. Okay. So we have a strategy of using counting by explicit loops. We have a strategy of comparing in a vectorized expression with either grep or with equal sign, and summing over the results. But someone mentioned... What I think is probably the preferred way to work with this and that's the function of table. Now table goes through a vector, finds everything that's unique in that vector, and then tells us how often this unique thing occurred in the vector. And that's something that is supremely useful. So let's have a quick look. Table X. So table X, we'll analyze this in a little more detail in a moment, essentially tells us we have five As, two Cs, eight Gs, and five Ts. Now that's kind of nice to look at it, but of course we want to work with it. So we need to understand a little better what this table object actually is that gets returned. So let's assign the result to a variable and let's look at the structure. So in principle, this is a named vector. It's an object of class table, which is internally recognized to be organized in a particular way. The values are integers, and each of the integers, i.e. elements in the vector, have a name attached to it. This is why it's a named vector and the names are A, C, G, and T in that order. So since these are integers, I can do things like summing over them, which is 20, telling me what the most frequently observed character is, the maximum, which is eight, or the minimum, which is two. So I can do computations with this, but I can also pull out the individual values that I'm looking for. And this is because these are names of elements, and that is one of the fundamental ways in which we can access our objects. Remember, we can access them with indices. A positive index will return us an element of one thing. A range will return us many elements. A negative element will remove that from the output and so on. We can access vectors with logicals for every position of true that vector is returned, or we can access vectors by their names or row names or column names if it's a more dimensional object. So my tab A returns the element from the table output that is named A, and that is five. My tab of a vector of names, which are A and T, gives us two results here. A and T are both five. My tab G and C gives us eight and two. Note that this is not the same order in which they existed in the table, like I'm getting G first and then C. Whatever the row names are, if I specify them, I'm getting them back here. So it doesn't have to be unique. This is one of the ways of sub-setting a vector with row names or names. So to write an expression to calculate GC values with a table, I could do something like tableMySeq, then subset that result to the elements named G and C, then sum over the values, then divide them by the length. So tableMySeq is this. Subsetting the table is only these two values. Summing over these two values is... Let me just mention this because it just appeared. I didn't select the closing bracket when I executed this, and now my console has this plus sign. That's probably happened to you from time to time. So whenever I hit return, I get only a plus sign, and it's waiting for me to do something. In this case, I know that I need to close a round bracket, but sometimes I'm executing a complicated function. I don't know if it's a round bracket that's missing or a square bracket or a curly brace or whatever. So to get out of this, I just hit escape. So the escape character puts me back to the command line. So let's select it properly. 47,681, which of course divided by the length is the same result. There are different strategies of calculating GC coordinates. One explicitly, slowly, in a very pedestrian way, but also in a way that allows us to be the most flexible in selecting the situation that we actually want to count by iterating over loops and evaluating some condition within the loop. Number two is comparing against the pattern, and number three is using the table function. The table function would be the one that might be most useful in validating our assumptions over the file. For example, if we simply count how often we have a G and C uppercase, we might not notice if our sequence file contains lowercase characters, and we might erroneously not count them. But if we use the table function, we might immediately see that we don't get four categories, but we get eight categories. In this case, since we've done table and we've seen A, C, G, and T, and these are the only ones, this means there is no other character in our file. Incidentally, that's a decent way to validate that your input file only has the expected characters. 0.47, is this value expected? What's the human average? I didn't know that. I googled that. When I designed this unit, what do you think? What does Google tell you? What's the expected GC context in human genome? What do we know about GC context? Come here. What do we know about GC context in genomes? It's... I don't know what you mean by accuracy, it's your primary C. In terms of average amount, I think maybe 48, 49 percent. I think it's a little bit less than... What does C mean, C? We're not sure. We're not sure. Because most of the sequencing data hides the... to American-Semitic regions where GC content is high. Oh, we have regions where GC content is high. So it's not variable throughout. We have tracks where it's high and tracks where it's low. So it's not randomly distributed. This is why it's actually not so easy to find an answer, because, you know, if you Google for it the answer is it kind of depends on where you're looking. I vaguely remember after spending more time than that. What, too? But in an interesting way, I came up with an idea that it should be, on average, which is less than what we have here. But... that there are regions where GC content is much higher and regions where GC content is much lower. So that's kind of interesting. Can we... What's the situation here? Can we find out what our local GC content looks like? We should, right? We have the data here so we can analyze it. Let's look at our local GC contents. So GC contents as a function of position in our nucleotides. All right. That requires strategy. The task is find and display GC contents as a function of position in these 100,000 nucleotides. How do we define this step by step? You want to start by breaking the entire block. Okay. Let's write that down here. So break sequence into blocks. And then we go and have a beer. You know, coding R, defining coding R as a drinking game would actually be a lot of fun. Because you have to have a lot of steps. I once witnessed people using a drinking game for the Bachelorette. Has anybody ever seen the Bachelorette? So they had to drink whenever one of them said love and it was like insane. Like, it's never meant, but they used that word like millions of times. The worst was when they then decided they go with like. So breaking sequence into blocks. We don't get to drink then. We need to do something more. Break the sequence into blocks. Then, and then. Right now, we've only calculated GC. We also need to store the result. And then. Okay. Histogram. What does the histogram show you? For each block. So you would do 100 or whatever the resolution is. A thousand histograms. Oh no, I guess I meant like if you spread it across each block would represent like one value. So you would then do a histogram and that would give you the distribution of GC values in the blocks. Well that's, you know. No, no, no. We'll have a look at that. But it wouldn't be quite what we were looking for. But it's also an interesting way to analyze the data. So we'll definitely do that. No, what I would do is block the value by block position. So for the first block, the second block, the third block and so on. So let's just remember the histograms. So can you do this? Of course you can. You have a little duck to help you. Okay. Now, we can actually break the sequence into blocks by subsetting. You know, 1 to 100, 101 to 200, 201 to 300 and so on and subset it that way. We don't need to do that because it's all just copying the same sequence data into different positions. So we can just keep the sequence in place and take the histogram for a range from, you know, one position to another position to another position. The advantage for keeping the histogram in place and just adjusting our positions is in principle we could have overlapping regions rather than, you know, just splitting it up. We could say a window of 2000 values centered on the middle and doing that in, you know, 500 base pair increments. Something like that. That would be a kind of a smooth average. But in order to do any of that I would start out by defining myself a vector of indices. And for that something that's really useful is a function called sequence. Now in our world it's a bit unfortunate that our so useful variable name sequence is already occupied by an r function which might lead to complications. But just remember sequence in this context means give me a sequence of values and it's very versatile. Sequence is also one of the things you need again and again and again. So there's two, well, several principle ways of using sequence. So we could have a sequence of 1 to 10 in increments. Oh well, let's go slow. Sequence of 1 to... What? Sorry. What am I doing? So of 1 to 10 just gives me 10 values. Well, you'd say, well, you know, what's good about that? We can just use the colon operator and get the same thing. Sequence is more versatile though. So I can say seek 1 to 10 by equals 3. And that increments by 3 every single time. 1 plus 3 is 4 4 plus 3 is 7 plus 3 is 10. Or if by equals 4 we have 159. So that guarantees that the last value is always still contained in the sequence. So the last value here is not 13 but 9. The maximum value is 10. But we can also use smaller values. So in this case I will get floating point values of whatever. A second very very useful argument to this is length out. And that tells me if we have a sequence of equally spaced numbers what should the output length be? So for example if we say length out is 7 we get 1, 2, 3, 4, 5, 6, 7 numbers where the minimum is 1 the maximum is 10 and the values in between are equally spaced by 1.5. Or if it's 13 we want we get some different odd values somewhere in between. So if we want a vector of indices that we want to compute with we could write a sequence well something like this index to start with now how many blocks would we want? What's useful? 100? 100 sounds good 200? 300? 1000? We might want to play around with it, right? So let's you know to do this well is let's write this in a way where we don't really need to commit ourselves we just assign a value of n and then we don't hard code numbers but we make everything so that it's relative to our n. Okay, so the first index is 1 the last index is the length of my seek equals n. So we get blocks of n plus n plus 100 and 100 again and hang on, no you said length out is, sorry, okay so this gives us 100 blocks of that size. Now there's a thing about that which you can probably appreciate and that is if our length out is such that what sequence returns are not integers, but floating point values, we'll get an error when we try to use them as indices into our vector. So we need to do some kind of rounding. There are three common rounding functions in R, one is called round and that rounds in the way that you would expect. One is called floor which simply takes the integer portion and drops everything after the decimal and one is called ceiling which does the same thing but adds one to the result. So when we do this kind of thing we want to make sure that we don't actually exceed our our blocks so we'll take the floor rounding in this case. So with 100 it wouldn't apply because we do get integers, but if we have 311 that might be different. Alright now we have 100 blocks as indexes so yeah, let's write a loop so I need my I to return each of these values in sequence. So since idx is a vector I can simply use that vector here. I don't need to iterate over the length of it and then pull out and subset it one by one. I can simply place that and make sure that that's what I want. But there's a problem here because I can immediately appreciate that I can write something that goes from the first to the second, from the second to the third and so on. But how do we know that we've left over at the end? So one way to do this is to actually not iterate over idx but iterate only over the first values of it. Yeah, let's do this explicitly. Let's use these in principle but in a somewhat different way. Let's define two index vectors call one I from and say this is idx but not the last element. The second one is I2 and this is idx but not the first element, right? And then I can simply define my block as going from I from to I2 and they're all contained in my sequence. I think that's a very nice explicit way to do it. So now we go for I in I from because we need to apply it to both. My sequence is my seek I So this is from 1 to 99 for I in every element of this my sequence is this range here. So let's see if this is kind of correct. Let's say I is whatever 7 then the range of I from I is a thousand characters here and subsetting my sequence is a large number of characters. Good. Now we need to calculate the GC of S and I'm using one of the variants here but I need to store the result. So how do I store the result? Well, I initialize a vector of results is a vector of numbers of this length. So when we run counting loops or analysis loops and store the results somewhere we should try to be really, really careful not to grow our results list dynamically. So what I could do is I can just initialize my results as a vector and then I can say position one, assign this to position two assign this to position three and the vector grows dynamically as I assign to positions that the vector doesn't even have in the beginning. So a small vector goes into a large vector until all my results are done. But this is one of the computationally most expensive things that you can do in R because it requires a lot of memory management. So when R generates a variable it allocates a little bit of the computer memory for that variable. When that variable then dynamically grows it has to allocate enough for the larger by negotiating with the operating system to allocate space that we can fit the larger object into. Then we have to copy the old results to the new space then we have to add our new result and then we have to free for the old result because we don't need that space anymore. So there's a lot of copying and negotiation going on and this makes the process convenient but supremely slow. So whenever you can avoid that code gets to be like at least 10 times faster. And this thing of dynamically growing results is the primary reason of why sometimes code is too slow to be working with especially when you're doing genome scale analysis where you want to look at something 100 times. Don't grow the results. If at all possible figure out before you run the analysis how large the result is going to be and then pre-allocate the space. So our result is going to be as long as the vector I from because that's how we define our loop. We do something this many number of times and for each of the times we need one spot in our results vector. So that length we need numeric values of that length and we just call them GC profile. So this is now a vector of 99 elements and each of the elements is 0. And as we now calculate for I in one length I from and so on we store the result and that is simply GC profile is this number here. So I remove the number here and save it. Okay, once again for each starting block we define the sequence we sum over the GC contents we divide by the length and we store this in GC profile. So that's reasonably fast 99 values here so let's plot them plot GC profile. Does that look correct? No. Right? That doesn't look correct. I think I need a rubber duck. Okay. Well the indices look right. These values look right. Is there a histogram command? But that doesn't look right. Why do they end with 100,000? Let me redo this just to be sure. So IDX is a vector 100 elements of here's my mistake. Length X, I don't know what X was but it's not what I want. I need length IDX. I want to remove the last element. That makes a difference. And I2 then and I can immediately see what happens. So X had the value of 20 so it started working until here and then I removed the value and that made our I from and I2 to be the same value so that returns just a single nucleotide which is either GC which then gives us for that single nucleotide a value of 1 or which is not GC which gives us for that single nucleotide a value of 0. So this is why it was iterating. Okay, so here was my mistake. So we need IDX minus length IDX of course and IDX minus 1 GC profile run it again plot the GC profile. There we go. So note what this means. This means that we have regions in there where our GC contents approaches 1 over like a thousand nucleotides. Is that actually true? But let's check that. So how do we find the coordinates that this maximum value here corresponds to? What would you do? How do you find the coordinates for this here? Go back to the vector and actually pull out the nucleotides. So where is this? So GC profile the maximum value is 0.778 in GC profile almost 80%. Now one thing that we often need is we need to look into a vector and identify where is that largest value. So in principle we look at a logical vector GC profile equals to the maximum which gives us a vector of false's. There's an expression, there's a function here which given a vector of true's and false's tells us where in the vector they are. And that's which. So this is an idiom we use a lot. Which GC profile equals max GC profile or which GC profile is min GC profile or something like that. So which identifies the position and the vector of logicals. And that is position 76. So all we need to do is we need to retrieve the nucleotides from position 76 to position 76 that vector that will give us 1000 nucleotides to make things a little bit more visible I'll use the paste collapse thing where we go. So note that we don't actually have a lot of CGs we have stretches of Cs and stretches of Gs so the GC content is very high but in terms of dinucleotides that's going to be interesting. Okay so what we've done is we have prepared the 100,000 nucleotides that we've downloaded from ensemble in a way that we can start analyzing things and we've run a few simple analysis to start looking and quantitatively looking at features of our data here. One of the features was looking at GC contents and we've calculated that globally as a value over our nucleotides and we've also looked at how do we iterate and then get positional information and we've seen that the positional information of GC contents is not random but structured. There are stretches of DNA sequence that have significantly higher GC contents than the surrounding regions. These as you know are often regulatory regions regions that are important for regulatory regions. Okay good time for a coffee break after that we'll look at the dinucleotide frequencies. So I think I'm going to take this little code block and upload it on the project so you can download it after the coffee break for reference.