 I hope you've all now digested what that in operator is all about and how to use it. So this brings us to that point of the task to ask, are there transcripts in our table of transcripts that are not in the ensemble table or in our table of mutations? So do we have transcript IDs in GNAS mutations that do not appear in GNAS transcripts? You've already written the code? Yes? Cool. Awesome. So, what should I write? In the same expression? Sorry. I'm excited about the expression. Well, we get a result. I was surprised. Right. I would be surprised too. So that's good. We need to figure out, is that the right thing to do or what did we do and maybe we'll need to explain it to our duck. So if we try to explain this to our duck, we would tell our duck, so hey, what we're doing here is we're creating two vectors and we're asking whether the result of the first vector is in the result of the, whether the first vector is in the second vector, whether it's contained in there. Okay. And then our duck might say, I don't understand. What's in that vector anyway? So we say, well, just have a look. Let's see what's in that vector anyway. So let's have a look what's in that vector. And then this appears and then the duck asks, well, what is that? And we say it's a table. So and the duck says, well, I'm, I'm seeing numbers and I'm seeing, I'm seeing IDs. So which one, what, what does, what does the in operator actually use? Does it use the IDs or does it use the numbers? And we say we don't know. We thought it would be using the IDs because that's like what we wanted to do, right? But we better check. So if we look at that table, that's what that looks like. So if this is using the names and the first entry here is false, this would mean that this name is not in that list. But that's not actually the case. I see 26, 5, 6, 20 here and I see 25, 6, 20 here. So that actually should be true. If it's using the numbers, that would mean 32 is not contained in this list here. And that actually seems to be true. Why is it using the numbers? Because a table is actually a named vector. So the contents of the vector are the counts. The names that we see are just attributes of the counts. So the actual contents, what the table is is counts. But the words, and we can use them, are things that are annotated to each count. And we can get them from the table with the names command. So we can use names for vectors, row names and call names for data frames and so on. So this is close, but as you've noted, there's something funny about it and actually trying to validate it, you will probably have noticed that this doesn't quite give you what you wanted. But it's close, exactly. So in this case here, the 26, 5, 6, 20 appears 32 times in the list of mutations that we downloaded from Intogen. It appears 12 times in the transcripts that are annotated by Ensemble. Any suggestion what we should be doing instead? We do what? Without table. So not tabling it, but using the vectors as is. That gives me a long, long, long, long list. And as far as I can see, they're all true. Get that in a moment. So right now they're all true. So what happened here? What were we doing? We were looking at a vector of 900 something or 703 mutations, 703 mutations, and we were checking whether they were in the transcript table. So there's a lot of redundancy here. However, the result actually tells us what we wanted to know. Regardless of redundancy, they're all there. If any of them would have been false, then the falses would have been spread all over the results, and that wouldn't have given us the correct result. Incidentally, if you have a long vector of truths and there's a single false in there, it's easy to miss if there's a false. There's two commands that we can use. The command all applies to a logical vector and asks whether all elements in that vector are true. So here the in command in operator returns 703 logical elements. If we wrap this whole thing into an all function, we get true. The result of this is yes, they're all true. Similar to that is any. This is even at least one of them, something. So that of course will also return true. Actually they're all of them. So all asks, are they all true? Any asks, is any of them true? How would we ask, is any of them false? With the exclamation mark, yes. And that's false because there's no false in there. So there's any false among these truths and so on. So any false is the same thing as all true. Right, but now in this case, this gives us the correct result. We know that they're all contained in there. But now how do we use unique? We can use it in two ways. First of all, table implicitly uniques something. So we could use this and wrap these table expression into names which returns the vectors, i.e. the top parts that you were looking at from the table. And that would give us the smaller unique list of truths. But we can also, instead of using table, realize that we don't actually need the cons. So doing table first and then extracting the names of doing that is unnecessarily redundant. So we can do unique instead, which gives us the same result. OK, well, that's satisfying. It seems to indicate that in principle, the intergen table and the ensemble transcript table seem to be talking about the same thing. Now we should think about which of these mutations we need to use. So how many of each mutation type do we have? That's the command that you're very familiar with now. How many of the type? Usually I hear a word from this side, which is table. And from that side, I more often hear a word which is unique. So table will count them. Unique will show us how many there are. So let's table them. Table what? GNS, what? Mutations? Mutations. GNS mutations. And the column is most severe. So we have 43 frameshift variants, 445 missense variants, 111 synonymous variants, 62 splice region variants, and so on. So that's good. Can we plot that? The relative proportions? Yeah. Can you give us what we want to do? Right. But what if a table true and false? Could be, yeah. OK. Yeah. So if you nest expressions like that, they're always evaluated from inside to out. OK. OK, so we have different things with different frequencies. Is there a common plot to look at different things with different frequencies? What do we usually do? Histograms, that's the distribution of values themselves, not quite what we're looking here. Here we're looking for the size of categories. So we use bar plots a lot. And in the Excel world, we use pie charts a lot. Usually bar plots are preferred to pie charts, but let's do a pie chart for once. So this is a pie chart of the result. Most of the variants we see immediately are splice variants. We have a small number of splice acceptor variants and splice region variants and stopgained and synonymous variants, frame shift in frame insertion. So splice, that's a bit of a problem. We don't know what splice region variants do. Are they silent? Are they non-silent? So that depends on the context. We don't have enough context to make that call. We just know that it's a mutation somewhere in the splice region, close to splice donor or acceptor. But we don't know anything else. So from interpreting them as missense or deleterious, it's a difficult call. So we can just exclude them by sub-setting. So let's exclude splice variants and also splice acceptor variants. So anything that contains the word splice, let's build a little selection here. How do I select rows that have most severe some things that contain splice? Using what? Later, but right now let me just define the selection. Define the indices or a logical vector that indicates these rows. What have we used previously to find patterns and streams? Grab. OK. So let's use grab. What are we looking for? What's the pattern we're looking for? Splice. So conveniently, this is listed here. By the way, the one thing where a pie chart is kind of more convenient than a bar chart is getting the category labels not to overlap or not to be truncated in a bar plot is often nuisance. And you have more plotting space for category labels in a pie chart. So we see splice acceptor and splice region. And it's lowercase splice. And we can just use that. So we look for splice. And where do we look for that? GNS mutations, dollar, most severe. 86 hits and a vector of numbers. So we want to exclude that. So how do we use row index numbers to exclude something with the minus sign? So GNS mutations is, what do I do now? Square brackets. Possibly something in the rows. Possibly something in the columns. How do I use my cell? Is this a selection of rows or columns? Columns. Rows. Rows. Let's go with rows. Well, the selection, however, is for those that we don't want. So we just put a minus sign here. And the columns, all of them. So before we do this, we have 703 observations. We already saw that cell has 86 items. So we expect 700 and 600 and 300. And 617 as a result. Yay. That seems to have worked in whatever way we wanted. OK. And the last validation we need is we need to check whether the reference nucleotides we have in our data are actually correct for the data we downloaded from GRCH38 for our 100,000 nucleotides. And that's a bit of a step, right? So we have a vector which goes from 1 to 100,000. And we have different numbers of the start positions, right? So we would expect that our vector from 1 to 100,000 somehow corresponds, if we map that to the genomic coordinate, 57 million, 4844, we would find a C there. If we map it to 4429, we would expect an A there. And so on. So now we need to map start coordinates to our vector coordinates. Simple arithmetic, but easy to get wrong. So what's the first nucleotide in our 100,000 nucleotides? So our vector might seek index 1. Where is that on the genome? The minimum value that we found. And where was that? I think that was somewhere in the sequence analysis script. Right, so we downloaded 58,815,001. So the index 1 in mySeq corresponds to 58,815,001. So in order to map mySeq coordinates to genomic coordinates, all we need to do is to add 58,00815,000 to the index. And then we get the correct genomic coordinate. Or we need to subtract 58 million, something, something 0,0,0,0 from the start position, which we find in the integer file. Then we will find the index in our mySeq file. Then we can extract the nucleotide at that index. And then we can compare, is this actually the one that integer tells us is in the right position? So integer minus offset equals, so can we do that? No, yes, no? Oh, that's very simple. It's just math. Offset is the difference between the genomic coordinate and our index. So genomic coordinates is 58 million, something. Our index in the mySeq list of nucleotides is 1. So in order to convert our mySeq index into a genomic coordinate, we need to add 58 million. Or in order to map a genomic coordinate into our mySeq file, we need to subtract 58 million, something. Right? So let's try this. We can just do this by hand. We look into the genus mutations and start position 57484420. And we just looked at mySeq analysis. And we saw our first nucleotide is 58 million, something. So nucleotide 1 is this position. So we add this to the nucleotide to get the genomic coordinate, or we subtract that from the genomic coordinate to get the nucleotide. And we get the index in the file. And that's, whoa, what's that? 1,330,580. Negative? Yeah? I just have a question. Our chromosomal coordinates, are they always in nucleotides? Clean units in nucleotides? Yeah. Right. These are just a number of nucleotides from the positive stand in the beginning of the short arm, the P arm. Chromosome. It's per chromosome. So the coordinate is somewhat incomplete, specified as a chromosomal coordinate. It would be 20 colon 20 feet. So this is just a number of nucleotides. The 20 is implicit. But we get a negative number. What? So do I need to add this instead? Or what's going on? Why is this a negative number? The wrong genome. It's the wrong genome. It's the wrong genome. Intogen is not using GRCH38. Intogen is using GRCH37. That's the kind of stuff that will make for a happy day when you add vodka. You realize this, and you've gotten this far, and we can't just take these coordinates. And you have to be super, super aware of that. If you just blindly trust that and you take the 57 million and then you take your genomic coordinates and you get the nucleotide there, it's wrong. It's just not in the right place. So molecular biology is messy. Bioinformatics adds a level of messy to that. And you have to be super, super careful at every single step. If you only remember one thing from this workshop, it's this experience. We do something which is computationally absolutely correct, but doesn't make biological sense because our data sources are somehow not compatible. Not just the build, but actually you also need to put in the patch level. The patch level. If you've looked closely, wherever it says ensemble something, it says GRCH38.p13. So there's also patches which are done alongside major releases. And strictly speaking, you have to know the release and the patch level to be able to correctly identify and map a genomic coordinate to the actual victim. So we need to recover from that. Fortunately, everything we've done is in the script, so we can recover very easily. The easiest way to recover is just to toss out MySeq and download MySeq37 instead. I need the nucleotides that are correct for GRCH37 and then validate them. So everything that we've now done to prepare for our Intogen table stays in place. And our transcript table is still kind of correct except that the coordinates in the transcript table wouldn't match anymore. But if we simply download the nucleotides, at least we can tell where the mutations map to. Note one thing, however. Remember that we checked whether the transcripts match. So the transcripts in the transcript table were from the ensemble GRCH38. The transcripts in Intogen come from ensemble GRCH37. So the transcript names are not changed. It's just the position of the transcripts on the chromosome that changes. If the new annotation, which has apparently added something like 1.3 million of nucleotides upstream of that position to chromosome 20, if that would affect the nucleotide sequence of one of the transcripts, then it would get a new ensemble ID. But since the sequence between 37 and 38 appears to be the same for these transcripts, they maintain the same transcript ID, even though that on the new reference genome they're located in slightly different coordinates. OK, so how do we fix that? Well, we need to make a new table. So we need to find a gene source for GRCH37. So we go to ensemble, if I would be. And on the main ensemble page, the location of the archival GRCH37 link is a little more prominent. But you can also find it on this page here. So here, there's a link on the US East mirror, still using GRCH37 question mark. And if you click on that, you find visit our GRCH37 dedicated site. And this has a different URL, grch37.ensemble.org. This means we need to look for slightly different coordinates. So we can probably do that by just typing humanGNAS. Something confuses it. I'm impatient. Maybe it finds everything. Let's just look for GNAS. Come again? But I get this as the top list. GNAS, humanGN87460 looks correct. Next, do you mean here, humanGNAS? OK, well, that's nice. That looks very similar. We can see these coordinates here. So this is around 47, 410,047, 500, 57, 500,000. So let's download the data for 20 colon 57, 410,001, 257,510, 0,0,0. Cation, we send exactly, if you look up, location. Right, but this is the exact location of the GNAS gene, but it's not 100,000 nucleotides. So I'd like to keep that the same. Just the different genome assembly, but we'll work with these 100,000 nucleotides. File yesterday, 37. OK, is there a way to convert that? Yes, there are ways to map GRCH37 to GRCH38. It's a bit involved, but it's possible. So basically what you need is a file of offsets of every single change between one and one. So you need to read this very large file, and then you need to apply all the offsets from one coordinate to the other coordinate to. So essentially it works similar to what we saw before. Take a GRCH38 coordinate, subtract 1.3 million, and that gives you a GRCH37 coordinate. Then these values of 1.3 million are, of course, dependent on where on the chromosome you are, and that can be a bit involved. So what I did here is I just clicked on the location, which gives me a map by location, and then I entered into the location field my string here. So then I get to chromosome 20, 57,410,000 to 57,510,000. Now how do we download the nucleotide data from that? We've already done that once. Export data, output as fast A sequence. The location is this, no flanking sequences, genomic unmasked. Click on Next, go to Text, File, Save As. And now we give it a file name of chromosome 20, 100 kilobase pairs dot 37 dot fast A. Don't append 10. And the file is here, and if we look at it, it confirms that this is correct. So GRCH37, chromosome 20, and these are the co-ninates we want. Now we need to read that fast A file with our read fast A function. My GRCH37 is read fast A of chromosome 20, 100 kbp37 dot fast A. Right. Now we're back in business. So to recover that was relatively quick, but it was super important to have noticed that. So being back in business means that now we should be able to apply an integer coordinate, track the offset, and get something from our MySeq index here. So let's verify that. Offset 37 should be 57,410,000, because that's from where we downloaded the data. And now let's see what we have here. So GNS mutations of this value here should be a C. MySeq 37 of this value here minus offset 37 is a C. Well, that's not bad. I'd be very annoyed if that weren't the case. Is this just being lucky? What's the chance of being correct by chance? One in four. One in four. So we don't trust it yet, right? Let's try another one. Here we have a G in this one here. It's a G. So what's the probability of that happening by chance? One in 16. So we're getting better. OK. So I'd like everyone to pick a random nucleotide from the table and check whether they get the right nucleotide back from MySeq 37. If anyone finds a wrong nucleotide, that would be important to know for two reasons. Either our mapping is wrong, we need to know that, or your data is wrong, and we need to know that, too. So pick a random start position on the integer mutation table. Apply the offset. Look for the index position in MySeq 37. Pull out whatever nucleotide is found there and confirm whether that's the same as the nucleotide in the reference column of the mutation table. Now, of course, I'm essentially crowdsourcing this. If you weren't crowdsourcing this in the class, I would write a loop. And for every unique start position in the table, I would pull out what the reference coordinate is, and then compare reference coordinate to the found coordinate, and tell it to complain loudly if there's a mismatch. So if you were able to check one, please let us see a blue post-it. OK. As I glance around the room, I see no reds and lots of blues. So assuming that you've not just all reproduced the first one I tried, but looked at several different ones randomly in our file, we've seen no contradictions, which is good, which means we now actually have data that we can integrate. Because our identifiers actually mean the same thing. I'm always really, really thrilled if something like that works. We can pick a coordinate from a site in Barcelona and take a 57 million number, and we can look into something. We download it from a site somewhere in Boston, probably. And we find that we get the same letter there. This is pretty cool. Anyway, so now messy. I'm not going to give you a long sermon about how to build data models and how to integrate in a more principled way. I have a little PDF file here about data models that I share with my bioinformatics students. This is an annotated file of essentially a presentation. And these are presentation slides with extensive annotations. So this actually should kind of make sense if you read through that when you have time. Very often we use relational data models when we build data models of the data that we want to work with. Relational data models are defined by entities. This is like a data frame in R. Entities have attributes. Each of the attributes would be like a column in a data frame in R. And there are many of these entities in our database. Each of the entities would be like a row in a data frame in R. But these relational models also have relationships to other things. So for example, if we have a protein, we can store the protein ID and the taxonomy ID. And the taxonomy ID might relate to a second data frame, which has taxonomy information, i.e. taxonomy ID and the species name. And they're related because they share an ID. So generally, we build relational models by having tables or data frames that somehow have shared information. But that's all I'll say about that. Read through that when you have a few quiet moments. Now in order to integrate the data for our plot, we need genomic coordinates because that's what the sequencing experiments and variant calling and whatever return. We need the coding sequence. We need the codon positions and the translation. And we need the mutations that are mapped to the sequence of interest. So for example, from the mutation table that Indogen has provided, we can't tell whether adjacent nucleotide positions map to the same or different codon. They could be the last position of one codon and the first position of another codon. Or they could both be in the same codon. So if we count our mutations and map them to amino acids on the amino acid sequence, we need to distinguish them. Now Indogen has a column of protein positions. But as we've already seen, that depends on the transcripts. So what we should be doing is actually taking the nucleotide sequence, figuring out the gene model, piecing together our sequence from the gene model and translating that, and then mapping our mutations into that. And for that, we need a gene model. So what's a gene model? Did you ever come across the term? Yeah, in principle. So exons and introns is part of the concept. The gene model actually contains the exact start site, which may not be the first nucleotide of the first exon, but somewhere in the exon. Usually there's some untranslated 5-prime UTR. And the last stop codon is not going to be the last nucleotide in the last exon. So we need to know these as well. But other than that, the exons and the introns of the internal exons and introns are all mapped to CDS. So coding sequence, CDS, identified by the ranges of genomic coordinates that refer to the coding sequence. So there are two major sites of integrating information that we use in bioinformatics. One is the NCBI entree system, which is quite useful to explore NCBI data. They've recently done a lot in actually cross-referencing information into the EBI world, which is also very useful. In general, though, working with things programmatically, the EBI site is more useful. And the big integration facility at the EBI is Biomart. So Biomart provides integrated data. Now, we need integrated data for GRCH37. So be careful. Don't just go to any Biomart, but we need GRCH37. And we already have GRCH37 loaded here, GRCH737.org. In the top menu, you find a link to Biomart. So follow that link. So what we're trying to do now is to download a gene model. If I were to tell you how long I took to find, where to find download of a gene model information, I would probably be slightly embarrassed. But it took far too long. If you look into the genome browser and you click on the translated exons, it's immediately available. Things are colored in the right way. But then if you try to download what coordinates this actually corresponds to, this is horrible. It's just not done with that in mind. But finally, Biomart to the rescue, Biomart has what we need. So let's download that from Biomart, GRCH37. So getting Biomart data involves four steps. But first, we need to choose the database. There's Ensembl Genes 91, Ensembl Variation 91. This is the database that contains all the SNPs. Ensembl Regulation 91 contains also expression information. So what we need is Ensembl Genes 91. I just noticed that in your script, I'm actually navigating to ensemble.org and downloading this. That's actually wrong. You should be going to GRCH37 ensemble, choosing genes 92 and choose human genes GRCH37. But we'll walk through that. So Ensembl Genes 91. If this were my project, I'd make a note in my journal now because this is not something I'm writing into the code and it's actually wrong in your code. The data set we use is human genes GRCH37 P13. And once we've selected that, we get to choose what aspects of that integrated data we want to see. We can filter and we can retrieve attributes. So to filter, we can define regions, genes, phenotypes, gene ontology, multi-species comparisons, protein domains and families, variants. There's a ton of information that's available there to restrict your search to some things. So in terms of gene ontology, you could look for genes that have particular go term accessions. In terms of phenotypes, you could choose from a long list of known phenotypes. And select gene information that is associated with these phenotypes and so on. So what we are looking for here is gene level information. And we are going to select with gene stable ID. So what was the gene stable ID that we were working with? Well, it's in here. The ensemble gene ID for GNAS is ENSG087460. So we copy that, paste that into the filter. Clicking outside of that text window anywhere will actually load that filter restriction and then apply it. So this is what we want. We want information about this gene. But what data do we get back? Well, that's contained in the attributes. So filters take a subset of the data. Attributes define what information from that subset we actually get. The attributes we want to get for our gene model is a structural attribute, i.e., information about where in the genome this is found. So we click on structure. And we can choose attributes at the gene and at the exon level. At the gene level, we want the gene ID and the transcript ID to make sure we have the right gene and we know which transcript this refers to. Of course, different transcripts have different gene models. In terms of exons, we need the exon rank in the transcript, just to be sure. So first, second, third, and so on. The genomic coding start, the genomic coding end. So this is in our gene model where in the nucleotide sequence this all begins and ends. And the exon stable ID, we could compare this. And I think that's all we need to define here. So definitely, if you find some time, explore this some more. Look at the data that you can get there. This is, even if you're not working human genes, quite likely in any other model organism, they also have similar data available. It's very well curated, very useful. So one of the things we need for integrating information all the time. Note that the Bioconductor project has a Biomart R package, which allows you to set these parameters in a programmatic fashion. But even though you can do that, it's still useful to basically manually try this out from time to time so you know what options are available and how this selection works. So selecting gene IDs and getting this list of attributes, we can then click on Results and see what we get. Gene stable ID is the correct one. We see exon ranks, which is what we need. Genomic coding starts, at least for some of them, exon stable ID, and so on. So that's all good. It allows us to export this into a file. As CSV, TSV, HTML, or XLS, let's just keep the default TSV and click on Go. And this downloads a file into my Downloads folder, which is called martexport.txt. And if I look into that, I see, indeed, we get lots of information, much of it, which is probably useful. So I'm taking this. I'm going to name it GNSGeneModel37.tsv and put it into the project folder. All right. So we have GNSGeneModel37.tsv. Then, of course, it's not yet useful as a file. We read it into a data frame, as we've done before. The read command is read what? CSV, read the limb. The file name was GNSGeneModels. Any parameters we need to set? Does it have a header? It has a header. Good. Anything else we need to set? They're tap separated. That's default. Well, we've said we can be OCD about that and specify that anyway. Anything else? Strings as factors. Thank you. Strings as factors is false. And go. So GNSModels, 435 observations, six variables. GeneStable ID, transcript stable ID, exon ranks, coding start, coding end, and exon stable ID. Yeah. Why which transcript? Hang on. Hang on. Hang on. Hang on. Hang on. Hang on. Hang on. Let's back up. OK. So remember, we specified the genes as attributes, the gene. Let's hang on. Let's step back. So the filter is GeneStable ID. And the GeneStable ID is ENSG87460. So that's an ENSG, not a transcript. It's a gene. So we're looking for data that are annotated to this gene. So there's lots of data annotated to that gene. But what we're specifically looking for is defined when we go to attributes. In this case, we're looking for structural information. And we need some information defined at the gene level and some information defined at the exon level. The information at the gene level is GeneStable ID. Well, we actually already know that because we've only had one, but it doesn't hurt. Basically, as a sanity check to see that we've got the right gene ID in our output data. And the transcript stable ID. At the exon level, we want the exon rank in the transcript, the genomic coding start, the genomic coding end, and the exon stable ID. That allows us to integrate things with the nucleotide data, which we've downloaded. Once you have this defined, the right filter on the right data set and asking for the right attributes, you can click on results. The results very kindly show you what the results table looks like, what's contained in it. So you can be sure, yes, I'm indeed seeing the right stable ID. Yes, I'm indeed seeing transcripts. I have data for what I'm actually asking for. And once we're happy with what we're doing here, we can use the export function, export all results to file, choose a TSV format, click on Go. That will put the data into your standard browser download folder, where all your downloads end up. Then you go into your download folder, you change your file name, and you move the file into your project folder. So if you're stuck with that, let's see Red post it. OK. No, that's a trouble. So that change, click on the minus sign. Exxon, we want it. Exxon, we want it to be square. We're going to get it in this browser. Exxon, we're going to get it in this browser. OK, now click on Results. That's the first one. Who else? Exxon, we want it to be square. I don't know what this is. Oh, save that. Save that. Save that. This is cool. Exxon results in your file, TZ, and just click on Go. Save. Save. So this is somewhere in your download folder. And rename it. And move it from your download to your project folder. Or save it directly to your project folder. Okay, are we all there? Excellent. So what I'll do next is there's a couple of talks of putting the data together, which is essentially more of the same. So what I think we'll do here is I'll just type out. You try to follow along as well as possible. I'll just type out what we're doing here to put the data together. Then once we're done and happy with the result, I'll take the code and I'll post it into my code snips and I'll upload it. And when you come back from coffee, you can download it, run the code, and then you should be all set and have the same data locally, right? So the first thing is we want a data frame which contains only our GinaS2 gene model according to the following specifications. So we choose data for ENST something 371095. This is a transcript that codes for a particular protein which has these identifiers. So it's an ice form of P36092, GinaS2. That data frame should be called GinaS2 model and store the column start and end for each CDS segment. And then we need to make sure the segments are in the correct order. So we do that. Our spec calls for a data frame called GinaS2 model is a data frame, and it needs to store column start and end from our GinaS models, genomic coding start, but not all of them, just a selection. I'll need to think about what that selection ought to be. And we need a second column here called end is similar, genomic coding end, and again a particular selection. Now these are numbers so technically I wouldn't need strings as factors, but it doesn't hurt. Now I need to define what that selection is going to be. Selection is going to be all rows in my GNAS models $transcript stable ID, which are this transcript. How many do we have? 12. Well that's kind of what we would expect for the 12 exons that the UniPro site told us that we would have. So let's build this little model here. 12 observations of two variables, starts and end. Okay, that kind of looks like approximately what we want. Now make sure the segments are in the correct order. Are they? How do we know? What's the correct order? So molecular biology is really, really messy, but I'm not aware of transcripts that are spliced out of linear order. So where an exon ends up before another exon due to splicing. I can't guarantee that this never happens. Some organisms make very, very strange RNA modifications, but I think generally it's safe to assume that the order of exons is determined by the order of the starting indices. So let's have a look into that, into our model gene as to model dollar starts, and we see the following. 12, 1, 11, 2, 3, 4, 5, 6, 7, 8, 9, 10, oops. That tells me that these start and end coordinates are not in the correct sequence. So the 12th one should actually be the first one. That's the smallest value here, 4, 7, 4, 6, 6, something, and indeed 4, 7, 4, 6, 6, something is smaller than 4, 7, 4, 7, this here. So if we wouldn't have checked this, we would now have started assembling our coding sequence in the wrong order. You can never be too careful. We did check it. We caught this, so the correct way to order this is to say GNS2 model is GNS2 model. The rows should come in the order which we've stored in our vector ORD, and we want all of the columns. Now looking back into here, this is smaller than this. This is smaller than that. This is smaller than that, and so on. Now they're in the correct order. The order we get is our original 12th, then our original 1st, then our original 11th, and the rest came through in sequence. Why it comes through from biomarked in an incorrect order or in a non-canonical order, I have no idea. It just shows that there's no guarantee about these things. You have to be careful. You have to look at your data all the time. Never blindly trust that you'll be getting the correct thing. An error like that is really hard to catch, because syntactically everything looks fine. You have a start ID, it's smaller than the end ID. You have a segment. You start putting things together, A, B, C, D, E. You get a nucleotide sequence to translate it. It makes no sense whatsoever. Because very likely you'll already start with a frame shift at the first one. All right, so that's done. Now we create a data frame for GNS2 protein annotations according to the following specs. Actually we'll do that after our coffee break, and I'll lift over the code and walk you through that, and then we'll be ready very quickly to plot all of this. So three o'clock, good time for a coffee break, half an hour until 3.30, and then we're going into the final lap of this grueling workshop. So drink lots of coffee.