 I hope you've been able to wrap your mind around how this this subsetting works. I've seen some people actually doing what I told them to do and googling for subsetting and as a result being led astray. There is a subset function in R but it works slightly different and in the help file for that subset function it actually tells you explicitly it's a convenience function that's probably not good to use when you're doing actual programming work. In order to use the subset function correctly you actually need to understand how subsetting works in principle but if you understand that you might as well just write it out in the way that we're doing this here. So I would not recommend using the subset function. Similarly there's a filter function but that's something completely different that applies to time series. So when I talk about subsetting I usually mean but sometimes using the word informally I usually mean taking a defined slice out of a data frame like the first 10 rows or the last two columns. When I talk about filtering I usually mean taking rows or columns or values out of a data frame or other object depending on a particular condition. Ie values that are greater than 10,000 or columns which have missing data or things like that whatever the condition can be and these conditions can be formulated very very very flexibly. So using subsetting and filtering ideas is one of the things we really need to do every single day. One thing about the subsetting function is R has thousands of functions and it's important for us to remain productive to restrict ourselves to a core set of functions and not dilute our knowledge about how to do things into too many different paradigms. Try to stick with one paradigm and one principle. This is why the code we write here sometimes certainly a bit pedestrian but I'm trying to repeat things more often and reinvent and introduce new functions all the time. Trust me I could introduce a new function every two minutes and not run out. So as I said the canonical solution here is to use filtering or subsetting by this logical comparison operator. We take the column transcript stology in type which is the column of a data frame is a vector, a vector of strings in this case that describe what the transcript types are and we compare that element by element to the string protein coding which we've determined above that this is part of what we need and part of the contents of that column and then we get a vector back which I usually call cell, a logical vector which has as many rows as the data frame has, which has as many elements as the data frame has rows and so I can use it to pull out only the rows that I want simply by seeing JNA transcripts use the rows that are indicated by this logical vector and use all the columns here. This expression some cell is not necessary in this but it's a way of me to confirm along the way how many true values I have in that and whether that's something that I expect. So since this is so important we'll do a little more of the same and analyze the data a little more. How many transcripts do we have? What are the transcript IDs? And then let's restrict the rows to contain only transcripts with ensemble IDs and how many transcripts are left? How many transcripts? The selection, for me they gave me 866 gene and you have 1,420. The difference is that I did my selection on the original data frame as I've read it from the file. You must have at some point already restricted this to the results of 866 rows and now this is why your data frame only has 866 rows left. I assume if you go up and look into your data frame gene as transcripts you will see that it has 866 observations not 1,420 and this is when you apply the selection again so now my selection has 1,420 because it was a selection that was derived from the same. Now if I do the same thing again my vector has 866 elements and of course they're all true. So how many transcripts do we have? How many transcripts are still in our table? 866. Every single row corresponds to one transcript. What are the transcript IDs? Where do we find them? The column transcript IDs contains transcripts. Now the question is how many different transcripts do we have anyway? Are all of them ensemble transcripts or some of them other transcripts? So how do we work with that? What are the transcript IDs? I heard a word here which I'd like to hear. Say that again. Unique, right. We've used the unique function before so we can use the unique function on the vector of the transcript IDs. This will give us all the ideas that we have. The length of the result will tell us how many different ones we have. Okay so you can write that. So no big mystery. You know we'll assign this to some discardable variable so we can we can look at it in more detail. But then we use the function unique but what goes in here? What's the argument that unique expects? Hmm? It would work on a file name but it wouldn't give you the result that you want. The name of the data frame? I wouldn't even know what it does with that because the data frame has many calls. Okay any suggestions? What what do we write instead of our question marks here? G G N A S transcripts. Transcript ID. Okay so the 866 transcripts which are annotated in the table map to 99 different distinct transcript IDs. So what are these things? So this here, E N S T something is an ensemble transcript but we also see transcripts of this type here. What's this? N M something or X M something? Under score. Controls? Controls? No. These are specific types of accession numbers. Very recognizable. NCBI. NCBI accession numbers exactly. More particular these are RefSeq accession numbers. So this is from the RefSeq database that's the preferred database of working with NCBI sequence data. So N M means mRNA which is in the nucleotide database. That's what the N stands for. X M are also mRNA data which are inferred from computed gene models. N M are supposed to be actually observed. X M are inferred. X M's are often called hypothetical genes. X P, X M. There's really nothing hypothetical about them in the way that they still do exist. It kind of always freaks my students out to see that. Can we even use that? That's just a hypothetical gene. Well the hypothesis in this case is that it actually does exist. And the computational evidence is usually very good for that. So that's the difference. But of course that means that these are likely to be redundant because some of them have an NCBI accession number. Some of them have an ensemble accession number. But they might code for exactly the same transcript. So let's restrict ourselves to use only transcripts that begin with ENST. Yeah. What about the ENSE ones? That's an ensemble EST. Express Sequence Tag. I believe. Okay. So we do another subsetting of filtering here. But restricting ourselves to contain only ensemble transcripts. I.e. strings that begin with ENST. So now we need to change our strategy because we can't use the equals operator anymore. Exclamation mark equals means not equal. Right? But here we actually need to do a little bit more sophisticated comparison with the string. Regular expression. Yeah. Right. So we need to use a little tool that uses a regular expression. Something that we've encountered before. I believe is grep. And we need to figure out how to tell grep to find us patterns that begin with ENST. You can use either grep or grep L. We can look at the difference between the two, but they both will work in here. So without giving anything away, the approach is exactly the same. We define a selection and then we do the same thing here. But of course we need some or some smarter. It's not necessarily logic. It's some smarter. How would that work? Wild cards? And unique? Unique. Unique wild cards? I don't know. That sounds like something we would have done with grep, not wild cards. So what's a selection that will find only ensemble transcripts? What would you do with ENST? And the function that we've looked at, which allows you to do that, is grep. So in order to look at that, we can do something like the following. We can make a very small mini example here, make ourselves a little vector, which contains two of the IDs we want and one ID which we don't want. And now we can use grep. Grep wants a pattern, some vector x, and then other parameters which don't concern us at the moment. So the challenge now is to find a pattern, a regular expression pattern that will give us what we want. Any proposals for such a pattern? Well, I would propose that the pattern is just four letters ENST, which is slightly different from what we used last time when we had Gs and Cs. This is the actual substring ENST. If we put this in square brackets, this actually means either an E or an N or an S or a T. So for example, all of these ENST would be matched as well. So the square brackets are options. We call that character classes. If any of the four characters are found anywhere in the string, then the regular expression says this is a match. But here, we're actually looking for the explicit four character substring, ENST. No S risks. No, no. Regular expressions don't use S risks for wild cards. It's a different syntax. So ENST actually means the literal string ENST. NM means the literal string NM, NM underscore perhaps. Okay. So what does that tell us? The output of this function is now one and two. How does that relate to our question? Why does it return one and two? What does one and two mean? Those are the positions. Exactly. Those are the positions in the vector. It means it found the substring in the string that's in position one of the vector and in the string that's in position two of the vector. There's a variant of grep that for our purposes of sub-setting, we can use an exactly the same way. And this is grep L, which is logical. Now, if we do grep L instead of grep, the result is true, true, false. Now, there's a refinement which we could use here. And I just mentioned that in passing. There are two special characters that we can use within a regular expression. And that is the carry, the little upward angle, which indicates the beginning of the string. So if I specify my pattern this way, it means the pattern must appear at the beginning of our string. If we have ABC, ENST, it will not be a match. The cognate for the end of the string is the dollar sign. So ENST dollar means that pattern must appear at the end of the string. If we don't have it at the end of the string, it's not a match. Now, the way our things are structured here, there's no reason to believe that ENST, the literal string ENST would appear anywhere but at the beginning of the string. But we don't have to trust our input data. We can just specify. We actually do want this here at the beginning. And we could do more if we want to validate that this is really an ensemble ID. We could test that this is followed by 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11. Exactly 11 numbers. So I'm just going to briefly show you how we do that. We introduce a character class again. We can put a range into the character class. So these are the numerals 0 to 9. And we can put a repetition operator here, which allows two numbers or one number. If it's two numbers, it says minimally this number, maximally that number. So this says nine digits, repeated 11 times, and nothing else in the string. So this is a full regular expression that actually parses the specification of what an ensemble ID would look like. For our purposes, this is overkill. But just to make sure that I'm correct, it does give us the same result of 1 and 2. So to illustrate this again, beginning of the string, starting with four letters, capital E, N, S, T, followed by any number, repeated 11 times, and nothing else in the string, i.e. the string ends at that point. If I need to put the repetition in to repeat more than just a single character, for example, I want T and a number repeated 11 times, I just put this in parenthesis. So the repetition sign here can attach to parenthesis, which can group things together. Well, I just leave the expression like this. It is overkill. It is also correct. So why not? I hope I've explained to you what these somewhat arcane characters mean so you don't feel frightened. Trust me, regular expressions can be difficult to read, especially, but yeah, they can be difficult to read. We'll try to keep things simple, but I think this is still pretty rather straightforward. Okay, so we've developed this. Now, of course, we want to apply it to GNS transcripts dollar transcript ID and see what we get. So we get 430 of the 866 that are ensemble IDs. That sounds about right. And to verify the result, to make sure nothing that is not an ensemble ID has slipped into what we were doing here, we can use our selection operator. By the way, this is now not a logical vector. It's an index vector. So one, two, three, four, five, six, seven, and so on. We can use our selection and apply it to the vector itself. I pull out all the matches from that vector. We can use unique and this verifies the result. We have 58 different ensemble transcript IDs in that table. 58. Did we see that number before about this gene? This gene has 58 transcripts. Oh, my God. Look at that. Okay. Well, then we still need to subset our data frame, and that's simply the expression GNS transcripts is GNS transcripts with only the rows that are identified in the vector cell. Okay. Now, if we're all at the same point now, our GNS transcript table has 430 observations of 18 different variables. And if you've confirmed this with unique, you should have 58 different transcripts left. Now, we'll skip task 2.4, or I can just mention what you would do in principle. To calculate transcript lengths, you would just take out the vector of end coordinates and subtract the vector of start coordinates and put that result somewhere. Actually, you know, we can just have a look at that. It's fun to look at data. So with 10 seconds of coding and some very terse commands, I can get a histogram of transcript lengths. Most of the transcripts are quite short, but some of the transcripts are almost 1,000 base pairs long as 1,000 amino acids long or 2,600 base pairs. Now, let's find some alternative descriptions and integrate some data. So there's a very well-done database. Unfortunately, the database is data of 2014. So it's now a little over four years old. And in this business of looking at mutation data from large cohorts based on RNA-seq and sequencing experiments, four years is like forever. The people at Intigen have promised me that they're working on an update, so I hope something interesting is going to come out of this. It's taking apparently much longer than they thought. Some of the reasons that include that may be technical. It's really not so easy to work with the current scale of genome sequencing and genome variation data. We are working on literally 10,000 of different disease-related genomes now that are being analyzed. Funding is also a problem in the world of databases. There's a terrible issue there that you can usually get research funding relatively easily to build a database, but it's superbly hard to get infrastructure grants that allow you to maintain your database and keep it into the future. So if you find yourself ever in the position of wishing to build a database, make sure that you build it in a way where it ought to update. You don't have to update it with a single keystroke. If you have to have somebody curating it and working on it and keeping it alive and maintaining it over years and years for it to remain relevant, don't bother. You're not going to get the funding. So many millions of research dollars have been lost because infrastructure has been built that hasn't been upkept. I could tell you stories about the bind database in Toronto, but they're not very uplifting. So, Intogen. The data is still fresh enough to be useful. I hope it's going to be updated. So have a look. Find the GNAS page on the Intogen Cancer Driver Gene website and explore the page. What do we find there? Intogen Cancer Driver Genes. Intogen. Mutational Cancer Driver Database. So have a look at what you find on that page and orient yourself a little bit and find the GNAS gene. So if I type GNAS, it gives me two choices. One is ENSG0087460. We might recognize this as the identifier that we've already used. The other one is called GNASAS1. What's an AS1? Antisense. Exactly. And if we click on that, we get GNAS. It's annotated in one cancer type, colorectal adenoconstenoma. It has a certain number of mutation frequencies in different cancer types, too. So this is just the annotation, but it's observed in bladder carcinoma, small cell lung carcinoma, adenox squamous cell carcinoma. So these are epithelial carcinoma. And we get plots and we get tables that we can look at. And this is especially interesting. This is a lollipot of mutation distribution. So this is a mutation distribution along one particular ensemble transcript that shows us how many of different mutation types we have here. We have missense mutations in yellow. We have synonymous mutations in blue. We have truncating mutations in red. And we have splice site mutations in orange. Now, just from looking at that plot, do you think that we're looking at a cancer driver gene here? This is only one transcript, yes. So this is annotated to one transcript. The actual data is annotated to more transcripts. Do you think this looks like the mutation signature that you would expect in a cancer driver gene? Or is that the mutation signature that we could see simply from random mutations or, you know, rare population variants or whatever? How would you know? What kind of mutations would you expect for a cancer driver? Or anything that you might be working with that is gene related or gene associated or phenotype associated or, you know, the kind of stuff that interests you in your daily life. What would you expect? How can we distinguish sense or nonsense or missense? Right. So if there's any effect of the mutations that we see there, we would expect the mutations to be non-random in some sense. I.e., we have an observation bias based on that the effects would either appear in our whatever disease state we're looking at or not appear in our disease state would appear. So that means the number of missense mutations and the number of silent mutations need to be in a certain ratio. Now if we simply pepper mutations randomly into a gene, how many silent mutations would you expect? What percentage of mutations would be silent? How do you even approach that question? I want to see how many codons have and what proportion of that codon is not synonymous. Exactly. Right. So, you know, without doing the calculation and incidentally using a genetic code table, it's actually, you know, if you want to practice your coding skills, you can do that. Randomly select codons, change them by one nucleotide and then find out whether that changes the amino acid. Do that 10,000 times and count how many silent and non-silent mutations you have. In my introductory bioinformatics course, this is actually one of the assignments to take the nucleotide sequence of a gene and then exactly tell me what the expected background on silent mutations for that particular gene is. So, as a ballpark estimate, we'll say something like, well, most mutations that are silent mutations are in the third codon only, in the third position only. So, we have 64 codons that code for 20 amino acids. So, anything that is in the first position will change. Anything that is in the second position will change. So, two-thirds of random mutations are already certainly going to be missense mutations. And in the third position, we have a redundancy of, say, a factor of three approximately. So, a third of the third position, two-thirds of the third position are going to be silent mutations. So, this is something like two-thirds of one-third, which is two-ninths, which is on the order of, you know, something like 20%. So, that's my doing this at a nice cozy table at Tim Horton's back of the napkin calculation. 20% of silent mutations. I think the actual number is actually pretty close to that. So, how many do we have? 8%. So, we have 20%, we have 92% non-silent mutations, but we would expect 80% non-silent mutations. And now, if we want to ask, well, is this statistically different, then we would need to do our usual analysis, t-tests, and ask whether this is significant. But there's another factor that actually is pretty much of a giveaway, that something odd is happening here, and that is some of the sites are mutated more than once. So, this is something that I would expect only very rarely in a random mutation. So, if I have a random mutation profile, you know, by chance, I would hit something twice or three times. But in this case, I have nine missense mutations at coordinate 186. I mean, acid 186. And this is kind of a dead giveaway, that there's something significant happening there. There's a very high chance, if I have a mutation in 186, that this mutation appears in one of the cancer genomes that I'm sequencing, and which make up this database. So, there's a strong selection effect here. And that's actually quite interesting, because not only does this tell me that this is likely to be a driver gene. This is one of the ways to find driver genes from mutation information. It also tells me that this is probably a gain of function driver gene. Because there's so many ways to lose function, that the mutation effects of losing function are pretty much distributed over the genes. But there's only very few ways to gain functions, and these gain of function mutations seem to be important here. So, this is a hypothesis we could follow up to then, after we have this data, find out where this is. Now, in case, like most of you, you're not working with human cancer driver genes at home, but with something else, this is still a useful plot to have. And the target of our little project here is to build code to do something like this, you know, at home. I.e., what we want to do is we want to take a list of mutations or a list of variants or whatever that comes from, associate that with the protein coding sequence, which we can infer from the nucleotides we pulled from ensemble, and then plot it in this fashion as a lolly plot. So, in order to do that, the first thing we want to do is we want to download this table here. Now, you can download the table directly from Intogen, but you need to register. This is one way for Intogen to make sure that they know how many people are using their database. And this is superbly important for them because they take that number and go to the funding agencies and say, you know, 2,500 researchers from 730 distinct sites have used our data and have downloaded various genome information. That may get them a grant. It may not, but at least it's important information for them. The best of my knowledge, Intogen is in Barcelona. The best of my knowledge, they don't sell your email address to fundraisers for Catalan independence. So, you can't register and download the data, but you can also just use the data, which I have downloaded, which is in the data folder. And it should be in the data folder. Why is it not in the data folder? Did I put it somewhere else? Hang on. Where's my data? You think it's in there? Why don't I see it? So, this is not it. This is not it. Oh, yes. Distribution data. This is what it is. All right. It's the distribution data. They call it mutation distributions. So, this has samples and transcripts and chromosome IDs, and the most severe effect, i.e. missense or synonymous or stop-gained variants, what strand it's on, what the mutation is on, what the genome coordinate is, and so on and so on. So, this is what we'll be working with. And we will read that into, download the data, so you can use the data that I've downloaded here. And we will read that in 2R. And we'll read the file into a data frame called GNAS mutations. So, we've done things like that a few times now. So, our data frame is going to be called GNAS mutations. And how do we read it? What's the R command we use? Read table. Have we used read table before? It's a tab-delimited file. So, we can't use CSV. Read table is slightly different. I would use read, exactly, delim. Now, by default, read delim or read delimited text uses sep equals backslash t, which is a tab. We can also use it to read commas, or we can also use it to read spaces, or we can also use it to separate things by other regular expressions of the separator. But this is backslash t. Again, we have a version read delim and read delim 2, which differentiates between decimal periods and decimal commas. But other than that, it works in the same way. So, what's the next thing I need to enter here? My file name. And then the usual thing, header, is true. Does it have a header? Do we have headers? Yes, we have a header. Header looks okay. So, we just leave the header true. Anything else we need to add here? The separator, well, by default, this is backslash t, but sometimes I like to type that out anyway just to remind myself what I'm doing here. Read.csv is clear. It's comma separated values. Read delim assumes that there are many different options. So, actually explicitly specifying the value here, even though it's the default, it makes something explicit. So, you can't do this. You don't have to do this. I would not consider this overkill to state this explicitly. And then strings should not be factors. Strings as factors equals false. Thank you. All right. Does it work? Didn't complain. Does it exist? 703 observations of 14 variables. And we have samples and transcripts, chromosomes, most severe protein positions, stark positions, amino acid changes, nice. What's the column ref? The reference sequence nucleotide. So, you can kind of see that it's nucleotide since we only have A, G, T's and C's. And we have hyphens in here. So, why might we have a hyphen in the reference? Right. If the change is an insertion. So, the reference nucleotide is given here. What's the nucleotide of the mutation? Where do we find that? In what column? Alt, right? So, this is a nucleotide. This is a mutation protein position 186 where the existing arginine is mutated into a cysteine by changing the T of the codon into the C of the codon into a T. So, this characterizes the mutation. Do we know what codon position this is in? No, that's not specified. We know that it tells me this is protein position 186. But that's kind of fishy, right? Because we have the same mutation here. At position 57 million something, something 420, in protein position 830, in protein position 202, in protein position 187. It's always the same mutation, but it appears in different protein positions. So, what's up with that? What does that mean? Exactly. This is due to the different transcript lengths and alternate splicing. So, this means the protein positions which are listed here, we can't actually use them because we need to consider which transcript they come from in order to map that back to our protein sequence. So, remember we wanted to have a lolly plot where everything is plotted nicely on the protein sequence. But that doesn't work if we need to take these protein positions. So, we need to do that by hand. We need to calculate the correct position for the one particular transcript that we're using and that we're going in. So, this is, you know, one of the things that make data integration less straightforward than you might want it to be. Even though we have gene IDs that can be transferred between different sites and transcript IDs that can be transferred from different sites. If you go, once you go into the details, things are slightly less obvious often. But we'll get to that. Okay. How many observations do we have of each transcript? Some transcripts probably appear only once. Some transcripts probably appear many times. How do we, how do we figure out which transcripts appears how often? This is a job for table. I always hear table from that direction. Okay. So, table. Table of what? Something we find in GNSS mutations. What was it called? I don't need to remember. I just put in my dollar sign here and I get a list. And transcript. That seems to fit the bill. Table of GNSS mutations is all a transcript. So, we have quite a few transcripts. Some of them appear a lot like this one appears 88 times with different mutations. Some of them appear only once or twice. The transcript, that's our canonical transcript initially, incidentally also then that was mapped here, appears 32 times in the table. So, when you, when you look at this lollipot here, I don't know why we see 40 mutations in here but the transcript appears 32 times in the table. Okay. Are there any transcripts in here that are not in our ensemble table? So, we, we, we read the GNSS transcripts and that has a set of transcript IDs. Are all the transcripts for which we have observed mutations also reported in our transcript table? This is something that you would need to do very commonly. And I believe in, in the pre-work tutorial, did we use the, the in operator there at some point? Percentage in percentage? Everybody shakes their head. We never, never heard of that. No, no, no in here. Okay. But the in operator is, is, is super useful. So, let's, let's have a brief, a brief idea how in works. So, I'll write you an example for in. So, let's make a little vector of letters. C, another vector E. Now, if my task is to find out which of these letters here in the first vector appear in the second center in the second vector, then I can use the in operator to do that. So, what this tells me is that C appears in this vector, A appears in this vector, but B does not appear in this vector. Or conversely, if I turn this around and say which of these five letters appear in that set, this tells me that E is not in the set, A is not in the set, D is, A is in the set, D is not in the set, C is in the set, and G is not in the set. So, it kind of works similar to the equals comparison. But rather than comparing it against a single string or number, you're now comparing it against a set of strings or a set of numbers. The result of the in operator is a Boolean vector, which has the length of the first operand. So, in this case here, the length of this vector is three, so we get three logical items back. The second case here, the length of the vector is five, so we get five examples back. And this is compared here. Now, I'd like you to write up a little example. So, let's take a few birds, crow, duck, hawk, and these are birds. Only crow, ducks, and turns are local to Ontario. But in Ontario, we also, if we look into the, if we look into Lake Ontario, we might also see beaver, raccoons. Okay. And write me an expression that says which of the animals in Ontario are birds, or which of the birds in Ontario, right? So, you can imagine your list of birds could be a very, very long list of something. Animals in Ontario is a sub list. Some of them may be in the list, some of them may not be in the list. I'd like you to write two in expressions that answer these two questions. If you have them, blue post it please. If you get stuck, red post it. But in principle, it's the same thing as I've done in the examples in line 151 or line 154, except that for C, A, and B, we're using ducks, turns, hens, crows, cardinals, grackles. I live on a boat in the Yacht Club on the Toronto Island, and these days I always get woken up in the morning by the grackles. They're hopping on the deck, and they're picking the flies out of the spiderwebs. That's their morning delicious meal. Swallows. We have lots of swallows. Thank God they keep the bugs down. Remember, the result is always which members of the first set are in the second set. Just looking to return a logical vector then? It returns a logical vector, and the logical vector has the length of the first set. Okay, so I think we're there. So if this vector here is our list of birds, and we want to ask which of the animals in Ontario are in our list of birds, we just say this vector in that vector, which gives us true, true, true, true, false. Now, Aznar just asked me, how do we get the actual names back from that? Well, we can do something like this, assign the result, and then just use that selection vector to subset our vector, which tells us crow duck and turn are in our list of birds. So it's always the same thing. We can use a vector of logicals as a subsetting expression for another vector. I could have assigned this vector to Ontario birds, and then used it in that way. But remember, if this thing here is a vector, you can apply the subsetting operator directly to that vector. You don't need to have an intermediate assignment. Now, which birds are in Ontario? Well, that's exactly the same thing, just the other way around. Here's our birds in the list of animals from Ontario. And we see that there's two birds that are not in that list, which is the elk and the puffin. And that's good. Now, once we reconvene from lunch, we're going to use this principle to find transcripts that are not in our ensemble table. And the strategy is just the very same. We need a list of transcripts, and we need a list of IDs in the ensemble table. And then we just use the in operator and find whether they're all there or not all there, or what the situation is there. So we'll repeat that often.