 Welcome to our second day of introduction with R. Introduction to R. I'd like you to pull the latest version of the project so that you have all the updates. You might need to be connected to the internet. Okay. And in your files, you will find a file called codesnips.r. And this is where I will place any significant piece of code that is not in your my whatever file, but that's something that we develop on the site. So what I've put in there is a function to make dinucleotides, the code that we wrote. But I've also put in a few notes that we didn't do yesterday. Improving the function. I said, you know, we're doing it the most simple and most pedestrian way possible. We're just iterating over every position and we're capturing the two nucleotides that we need to have and then we're pasting them together and putting that into a vector. The paste function itself gives us a smarter way to do this because vectors that are pasted are joined element by element. So if I have two vectors and I paste them, we just paste each element in pairs together. So the way this works is assume we have two vectors, A, B, C, and D, E, F. And if we paste that together with a colon character, with a colon character to separate them, what we get is the first element of the first vector and the first element of the second vector. And the second element of the first vector and the second element of the second vector and the third element of both vectors and so on, right? Or if we have three vectors, one goes from one to three, one from four to six, one from seven to nine, we have one, four, seven, two, five, eight and three, six, nine. So we can use that to do dinucleotides in one simple expression. As another example, I've shown an example how to use this to paste together all the 64 codons. That's something that comes up from time to time. I need the codons, so I need to do that. So what I do here is I build a vector with 16 repetitions of A 16 repetitions of C, G and T. Then I make a second vector, four repetitions of A of C of G of T, but the whole thing repeated four times. And then I make a third vector, just A, C, G and T, but repeated 16 times. So the first vector looks like this, 16 A's, C's, G's and T's. Second vector looks like this, A, C, G, T four times, repeated four times over. And third vector is just A, C, G, T, A, C, G, T all the time. Now if I take these three vectors, take them as columns and paste them together, I get the 64 codons. So the first letter in the first codon is the first character from here, first character from here. Second codon is this A, this A and this G. Third codon, A, A, G, A, A, T, A, C, A and so on. So you can permute letters in this way and paste them together. And this approach to building permuted version of strings is useful in many ways. But now this means we can write a super simple expression for our dinucleotides. So I've changed the function dinucleotide vector to just, remember it was this longish for loop where we calculate froms and tos and then paste things together. So I just make an index from one to the length of my input vector minus one. So everything but the last nucleotide. And then I paste together the values from one to minus one and from two to the end, I dx plus one. And then I paste it together with separation. And the parameter separation introduces a character that separates each of the values that are being pasted together. So in this case here, where we pasted our A, Bs and Cs, we use the separator of a colon. So the output string is then A colon D, B colon E, C colon F. But if there's no separator, it just means put them letter against letter, character against character. By default, paste uses a single blank as a separator. So we need to basically specify no separators. And yeah, I've overwritten my sequence something else. That's actually not nucleotides but amino acids, but it's the same thing. It's a sequence. The last letter in each element has to be the same as the first of each other element because we're building overlapping dinucleotides. And the rest in these code snips is just the things we've written out for our bar plots. So we'll do something else today. If you've downloaded the newest version of your code and if you type in it, it re-initializes and creates copies of things that might not be there. One of the files here is my data integration. And we'll talk about data integration today. So open that file, use it to edit it. I'm going to edit code snips if necessary and journals, my notes if necessary, mostly this here. Now, as you know, our data is very dimensional. That didn't used to be the case. 15 years ago, if you were working in genomics, you were working with nucleotide sequences only and sometimes translating them to amino acid sequences. If you were working in protein structures, you were working with structure data only. If you were working in genetics, you were perhaps working with pedigree data only. But nowadays, things are immensely more complicated. For every single gene, we have tons of annotations. We have sequences and transcripts and exons and hundreds of variants and snips and structural variants. We have phenotype annotations. We have protein structures. We have long lists of homologous sequences of conservation of experimental results and so on and so on and so on. And to make the most of our data, of course, we need to integrate all of that. And we need to make informed decisions based on data that we get from a multiplicity of sources and with a multiplicity of meanings. This can be numerical data. This can be informal annotations. This can be textual data. Many, many different things. So what we'll do today is we'll integrate data onto a gene which is in the region which we looked at yesterday. And the goal is to get a plot that looks like this. A plot that exactly does what we have here, to my knowledge, doesn't exactly exist like that in R. So we will write ourselves code to make a customized plot. We're not just taking everything from packages. We're actually writing the code to do our work. I.e., we need to understand what we're going to be doing here. So what we're doing here is we're plotting cancer-related mutations which we get from a cancer database, the Intogen database, onto a gene that we find in Chromosome 20, GNAS. And we're going to sort them out by the different types of mutations. And we're going to plot them on the position where they appear in the coding sequence, in the protein sequence. And we're going to scale the circles that we draw as an annotation here by the frequency with which that mutation has been observed in the database. So the task is first to understand where is our gene? Where is it located in the nucleotides that we have here? What's the gene model? What's being transcribed? What kind of mutation information do we have? Where is that annotated to? And then integrate the annotation of the mutation locations with the gene model that we find in the genome database. So our first task is just to open the ensemble genome browser again. If you've taken notes, you have a bookmark to that or the link that you used last time. If not, it's easy to find. And to open the coordinates 58,815,001 plus 100,000 nucleotides on Chromosome 20 in the HD38, the most recent assembly in the ensemble genome browser. And then let's see what gene is annotated to this region or what we know about this region in the first place. What is this region? When you're done, put up a blue post-it. If you're stuck, if you don't know what you're doing, if you have a hard time following me, put up a red post-it and we'll get you back on track. Safe what? We're not saving anything yet. We're just looking at the region. You already saved these nucleotides yesterday. We'll discuss this page a little bit, but if you're already done and I want to wait for everybody to catch up, if you're already done, you can look at the next task and that's the question, where do you even find transcript coordinates? How do you go about? This is a genome browser page here. Where do we find the actual coordinates of the transcripts that are annotated in this window? This is where I'm reminded of what I said this morning. I'm a person who doesn't always read the instructions, but they're useful. You have examples here. There's a pattern of how things need to be set up. This is rat5 colon and then a large set of numbers and a single hyphen and so on. If we take this pattern and translate it into saying something like human 20, 58 million to 58,900,000, it'll just work. It seems, I see one red posted, but that's probably a remnant. Seems we're all there. Most of the genes that are annotated here are GNAS something. Why are there so many different versions of that? What does that even mean? For example, you have this red one here and this red one here and this one here. Are these all genes? Why are they different? What's the difference? Alternative splicing. Alternative splicing, yes. The bane of working with sequences. This is horrible. You're looking at a sequence and you think, well, that's the sequence that you have. Well, no, that's possibly one of the sequences. There are many sequences that are produced from the single gene usually and they are spliced together in different ways by alternative splicing. GNAS seems to be one, I just randomly picked that. This is just my luck. It's one of the most complicated loci in the human genome. It's multiply imprinted by a lily key transcribed. It depends on what you inherit from your mother and your father and it's alternative spliced and it's all over the place. So let's look a little more about what we're doing here. The idea is once we've mastered that, everything else will seem easy by comparison. But the difficulty here, mind you, is not R. The difficulty is the molecular biology. Molecular biology is insanely messy and we just have to be aware of that. We can put things into our neat computational paradigms but if we don't take into account the messy biology, we'll be just doing cargo cult bioinformatics. Okay. So how do we get more details on anything here? It's a webpage, right? So what do we usually do with web pages? We click on things. So what happens if you click on one of these things like this here? Ooh, we get a nice pop up. So this tells us it's a gene. It's called the GNAS complex locus. It has a gene ID. We can also click on that. It tells us the location of the locus. It tells us that we've clicked on X on one of four of one particular transcript. There's a large number of transcripts and they have different exons. So we can find transcripts. There's a protein which is annotated to that with protein variations and so on. So the first thing we want to do is we want a better idea of the transcripts. So for that, we'll just open the gene page where we will find more information on the gene itself. So I usually have about 250 tabs open on my browser because I have a habit of just opening new information in new tabs. I don't know. You can use the back function or you can use it in a new tab. You'll probably better off doing this in a new tab. So either command click or just open link in new tab. And that's the gene information page. It tells us something about the complex locus. It has synonyms and so on. It tells us there are 58 transcripts, 76 orthologues that are known, one paralogue in a member of protein families. It's associated with 115 different phenotypes. So it's an important gene, mutations of which have a large number of consequences. And there are proteins that correspond to one, two, three, four uniprot KB identifiers. So let's open one of these P63092 to see a little more about what uniprot tells us about the protein that is encoded here. So GNS is a guanine nucleotide binding protein, alpha isoform short. And this 63092 is version two of this. These G proteins function as transducers in numerous signaling pathways. So I wonder there are a large number of phenotypes associated with it because it's a central part of the machinery in the cell that transduces signals from the outside to the inside and changes gene expression patterns. But going back to the gene, we said we wanted to have a better idea of the transcripts. So we click on show transcript table. And this tells us the transcripts that are there. Some have proteins, no proteins, some have proteins of different lengths. Some are associated with one or more splice variants. So for example, notice we've looked at P63092, but we already see three different versions of that in the transcript table. One that encodes only 380 amino acids, one that encodes 394 amino acids, one encodes 395 amino acids. So in order to be able to take that transcript information and actually pull out the actual nucleotides from our nucleotide table, we need to download the transcript coordinates. And that's very simply done. We have the transcript table open. We click on export data. What we want to output is comma separated values because we want the actual data, not the nucleotides themselves. The strand is the feature strand, usually the plus strand in the genome. No flanking sequence. And as options for the comma separated values, we only want the gene information. If you select all and you get all variation features, you will get every single SNP that has been ever annotated to one of these transcripts. And trust me, that's a very large file. You don't want to do this on your mobile or if you're anywhere on limited bandwidth. But we just want the gene information, something very simple. So comma separated values feature strand, no flanking sequence. And as options for the comma separated values, only check the box for gene information. And let's look at this as text. So this is kind of large, not too large. It's kind of hard to read. So we'll download the file, we'll save it, we'll read it into R, and then we'll try to find information of interest to us. So save the results page under the following filename. ENSG, whatever, data.csv. This is the ENSG ensemble gene. So these identifiers have different semantics. ENSG is an ensemble gene. ENSP is an ensemble protein identifier. ENST is an ensemble transcript. ENSE is an ensemble exon. And they have very long numbers. And the long numbers are important for the database because these are stable identifiers. So this means if there are new versions of transcripts or exons or whatever, they get new identifiers. They're being updated. And so save that to your project directory. Save as text. Don't append any text or whatever to that. If your browser automatically tells you, no, no, no, this is a text file. It has to have the extension .txt. Then roll with it, call it .txt, but remove that identifier later. Once it arrives in your project folder in your file, it is supposed to be called ENSG 0087460 data.csv. And the next task then is to read this file into an R data frame, which is called gene AS transcripts. So now we read a file. So the last file we read in R, we used read lines. This was unstructured data. We just wanted to get the text and then we pasted it together. This was nucleotide data. Now this is structured data. It's comma separated values. So if you think of this as like a spreadsheet, the data in each row is separated by commas that correspond to each cell. So if there's five values in the spreadsheet, you need four separating five elements. If some of these elements are missing, you might just have series of commas there. So these are empty elements. And you'll also notice that this file has a header, i.e. it has a list of names that correspond to column names once we read this in. So once again, it has a header that correspond to column names. It's organized in rows. Each row has the same number of elements. All the elements are separated by commas. Now one thing we could do, for example, following on what we did last time is we could just read lists with read lines as text. Go through every single line. Split the line apart using string split on commas. Then we would have all the elements. And then we could use these elements and paste them together into a table. But since this is a super frequently encountered format, there are specialized functions that read files of this type. So most often you will encounter something that is either a CSV file or a TSV file. Commas separated values or tab separated values. This is the most frequent of vanilla structured data interchange format. CSV or TSV. Mind you, at home in your lab you will often find yourself having data in Excel spreadsheets. When you want to read Excel spreadsheets into R, there are specialized functions to do that. I try to avoid that if at all possible. They often make assumptions about how Excel format works and it's poorly documented and this can go wrong in many ways. What's much better is to export a CSV from Excel itself and then read the CSV. As you export the CSV from Excel itself, you can have a good look at your data and see, well, does this even make sense? So for example, what we often have in Excel worksheets is that we have a section of data of one sort and then somebody puts different data of another sort on the same worksheet. Now if you read that as a CSV, in one column we might have pro-band identifiers in the same column but further down below we might have, I don't know, hemoglobin concentrations. So that's not good. So basically take one set of data that belong together and export that as a CSV format, save the CSV format and then read it into R. Well, but then how? How do we read this into R? Into a file called Genius Transcripts, what do we need to do? I don't know, you tell me. If you don't know, Google. What would you Google for? So if I need to Google for something like that, what I do is say R. I always started with R. I don't know if this just works very well or if my personal whatever machine learning algorithm is behind that somewhere in the big Google cloud has learned that I need results relating to the programming language R if I do that. But it seems to work very well. So I start with R. Incidentally, this one wouldn't know about my search preferences because I'm not signed into Google on this browser here. Okay, R. Read CSV into object. Something like that. First thing that comes to mind. Read CSV in R. Take the top hit. And apparently there's a function read dot CSV with a file argument and a header argument and separator arguments and so on. So that looks useful. So we have GS transcripts is read CSV. Oh, there's a CSV too. CSV, CSV too. Which one do I use? What's the difference? Can you spot the difference? Right. The separator is different. There's a second difference. The second difference is the parameter DEC. What's the difference here? One has a comma, one has a period. Why would they do that? Some people use decimals with commas. Who? Europeans do that. So if you're Europe, they rarely use thousand separators. But if they do, they use periods for that. And they use a decimal comma, not a decimal period. So be aware of that. If you get data from your collaborators from I don't know, Roche or Sanofi in Switzerland, your data might have periods and not commas. It might have commas in numbers and not periods. Now what happens if you read that data with CSV? Read CSV thinks that's a field separator and it splits your values with the comma. So you have to be careful. Inspect your data and make sure that your field separators and your comma separators are the same thing. If somebody's written a comma separated value file but put in numbers that have decimal commas, not decimal periods, you're in trouble. This is really going to be probably if it's a larger file in salvageable. Yeah. There's an underscore CSV and a dot CSV. One is called a reader and one is says, you tell us. What is the column on the right telling us? So this column here, the one in the curly braces, tells us which package this comes from. So I've got a couple of packages loaded. I should reload this one not to confuse us. But read R, okay, read R. I don't know when you've loaded read R. I don't know if it's in my functions but it's a really useful package. Did you use read R yesterday in your integrated assignment? No? So anyway, read R, R-E-A-D-R. Read R is part of, did you discuss tidyverse? No. But you did ggplot, right? Yeah. Without the larger tidyverse. Okay. Okay, so let's talk a little bit about tidyverse. Now is as good a time as in the tidyverse. You might have noticed that some of R's functions are very much designed to work in the mindset of a statistician who would have been working in the 80s and 90s and early 2000s. So for example, whenever you read in a file, like with read CSV or read TSV, any text in that file is transformed into a factor. Any string in that file is transformed into a factor. So since we, 99% of all cases don't want that, we always have to write strings as factors equals false. Otherwise this will mess up our data. Now, if you were doing statistics in that time, you would usually have descriptive factors in your data sheets, things like male or female, or some label for the cohort of whatever you were looking at. So this made a lot of sense to immediately transform strings into factors. We don't have that. We don't use strings as factors a lot. Usually what our strings are are simply accession numbers and it makes no sense whatsoever to turn them into factors. You can't regress on accession numbers. That makes no sense. So we usually need to turn that off. That's one thing. Another thing is that some of these things are slow. They're not always very robust against errors and so on. So people started thinking about aren't there better ways to work with complex R objects and make them more robust and work with them in different ways. And at the same time, one of the very prolific contributors to important advances in R, Hadley Wickham, who is now working with RStudio, who is basically the author of TGPly, right? That came up with different ways of using the programming language. You might have noticed that GigiPlot has a completely different mindset of how we go about working with plots. Basically building a data grammar. There were other innovations there. Different ways of merging of complex data frames and pasting them together. Things like not using intermediate assignment of variables, but just flowing data from one function to another function to another function with a pipe operator. And all of that has become quite popular in a way that in a sense this all constitutes a kind of language split. And things that are associated with this new way of using R are often associated with the prefix tidy. So tidy is something reading data and so on. So we often refer to that as the tidyverse of using R. Now some people advocate of using these different data formats immediately and teaching that way. Others advocate of just going the same way that we're doing here using very simple procedural commands and then later on learning more about what these other language alternatives and idioms are about and adopting them later. People who advocate doing the tidyverse things first say, well R is a functional language and all these tidyverse functions have been designed to work very well with the functional language so if you want to learn R that's the right way to do it. And they have a point. I say on the other hand R is just one of the many languages and paradigms we use and we should we're probably better off working in a way that is easily translated and easily understood with for people who come from other languages and that don't emphasize idiomatic use the language very much to be flexible and future proof and so on and of course that's also true and of course these two views are incompatible so we have to make a choice and you know go between these two functions. So why am I saying all of this? It has practical consequence. For example, read are really these are better reading functions. So the deficiencies of old R, read.csv and so on are not apparent in the way we use them here but once you get to very large data read our functions are extremely much faster and more efficient. The output of a read our function however is not a data frame it's something that is called a tibble it looks like a data frame it behaves like a data frame most of the time but not all of the time and so this is why I'm always a little bit hesitant of introducing the read our functions we're not going to do this here we're going to keep on using pure basic R in this workshop for the most part but if you ever find yourself working with very large data and you need to be very efficient about it by all means do explore things that you can find under the keyword tidy verse and especially do explore the read our package it's a very well written piece of software so I don't know why read our is on your computer but there we go it gives me an opportunity to talk a little bit about the tidy verse anything to add to that do you use tidy verse code in your lab? only ggplot okay so where were you we need read csv not read csv2 we're in the North American format here we need to give it a filename header is true by default separator is comma this is also there by default the decimal points I don't even know if there are any we fill empty spaces with na's we have no common characters to find on so that should be okay let's see what this does we have 1420 observations of 18 variables and let's have a look at what they are with this icon the little spreadsheet icon we can look into our data frames sources ensemble this is a gene feature start and end coordinates of the feature are defined some of them might have scores not applicable here there's a gene ID there's a transcript ID there's an exon ID there's a gene type so we see micro RNAs we see antisense RNA and we see protein coding RNA all transcripts produced from this locus are necessarily protein coding some of them for example are antisense which are important for regulation and we could have had variations and probes but we didn't so this is all just an A now 1420 observations many of these rows we don't need so we're not looking at antisense for example so let's remove all rows from this object that are not protein coding how do we do that remove all objects that are not protein coding all rows that are not protein coding so step by step what's the first thing we need to do we need to subset right but how something but how what's the semantic we need to figure out here the semantic so we need to understand what's in that file and how we can use it to identify rows that are protein coding genes so the first question is you know where does it tell me whether something is a protein coding gene or not so we look into our data right and there's a column called gene type and apparently some of these has value so what do we find in column gene type so the first step is is to identify how a protein coding transcript is labeled that's not even an R question it's a question about the semantics of your data you have to understand what your data means so apparently it's this is information we find in the column gene type and it's there so what different values we're looking at the column gene type is that how it's spelled with an underscore now what values exist in this column what are the different strings we have in that column how can we find that we could just print all 1480 values but we can use the table so the next step is to identify possible choices why do we need to do that well we've scrolled through this a little bit we saw that there's some rows that are labeled with protein coding there might be gene types that we didn't think of but that actually also identify data that we're interested in so in order to be sure we're doing the right thing here we really need to look at everything that's contained all the alternatives we have in that particular column so somebody mentioned table so let's try see what table does table of what if I run the function table what's the input table once what type of input did we give table yesterday a data frame a file name a vector a vector of something so table will go through a vector of something and then tell us what is what are the unique elements in the vector and how many of them are there so table so I need a vector I have a data frame how do I get a vector from a data frame in particular I need a column again sub-setting right we sub-set an entire column so let's sub-set our data frame genus transcripts something by what what rows do we want rows all of them so we want all the rows so we put nothing restricting in here and what columns do we want gene type so what's gene type it's a column name so we can use that string and we get a line transcript and anti-sense and anti-sense RNA and I have no idea why these are two different things anti-sense and anti-sense RNA why are they listed separately oh a question just let me go through them and CCDS genes and link RNA and LNC RNA I've stopped keeping track of the different RNA types what's the difference between link and link and link why is for introgenic of long introgenic non-coding right so this would be one of the examples you're interested in non-coding RNA and you think you see link RNA but there's also LNC RNA and you would need to discover that to get all of the RNA non-coding RNA types that you need micro RNA and miscellaneous RNA and protein coding so the rows that we're interested in are actually all the ones that are labeled protein coding okay so that would work there's another way that we would commonly use here and that's the function unique so unique is related to table in the same way that it picks out unique occurrences but it doesn't count well we could be interested in the cons here's something we need is actually unique now this way of doing sub-setting with square brackets is perfectly valid and in some cases you absolutely need to work this way commonly I find myself using a somewhat different syntax and that's the dollar operator dollar operators on data frames return columns so using a dollar operator gives me the entire column I think I've just gravitated using that because once I type in the dollar here I get all of the columns in that data frame and I don't need to remember what it's called and I can just pick it out gene type so unique gene A transcript dollar gene type gives me the same information that I was looking for in a different way almost do you notice something about that output what about that line here nine levels a line transcript anti-sense what is that where did that come from is there something wrong with our column here let's let's look at what this column is with the useful structure command this tells me this is a factor our gene type is a factor and not a character column and it has nine levels so there's nine different alternatives but what it stores are the integers that correspond to these factors so what read CSV did is it turned all the strings in our data into factors didn't I mention we don't want that I mentioned we don't want that so whenever you use any of the read functions turn that off strings as factors equals false so that's something that you really need to remember this has to go into your muscle memory why do we need to type this all the time why can't we just set this as a global option well the answer is you can you can but that will break everything every package that you will load that makes an assumption that strings are being read as factors so that I think setting global language options differently from what the default expectation is is something that is error prone I wouldn't do that I just roll with it and every time I read dot CSV or read delimited I explicitly say strings as factors equals false okay so we need to redo this command gnaas transcripts and now we get the unique values and now as expected the column is a column of characters i.e. strings so far so good now we said we want only rows that are protein coding so how do we subset our table to the rows protein coding only these so this is a little filtering minus can we use minus what does the minus need to attach to yeah but minus excludes things from the rows you can do this in different ways you can exclude the ones you don't want or you can include the ones that you want minus is something that excludes but what does the minus need some index so if we want to do it with minus then we need indices so we need to find the indices that correspond to the values we want right but if we use true and false we don't use minus but we might need to do something else with it so how do we get trues and falses in that situation a vectorist operation of equal exactly so we test whether the contents of the column is equal to something can you write that what does it need to be equal to the string protein that's what we're looking for so try writing that subsetting expression as you try that out you're probably at some point going to just print the whole table don't worry that doesn't hurt the computer it can do a lot of these so just experiment and try to come up with a way to subset the table so that it only contains the rows of that have the string protein coding in the column a transcriptology type once you've written the expression that subsets the table please show us a blue posted if you think you're in trouble and it doesn't work show us a red posted exactly I didn't even use that I used the same thing you have some transcripts and these transcripts are all already written I can't understand so yeah so for subsets do you know we only want subsets okay so if you are done with this do put up your blue I'm not going to write out the solution but maybe if you have troubles let me remind you of what we want to do with subsetting how subsetting works in principle I think in the pre-work tutorial I've written that subsetting is needed every day and it's really important and this is really something you need to commit to your muscle memory so let's look again at what we're trying to do here we're subsetting a data frame so that's our original data frame this means we subset by some condition and in the data frame we can address rows and we can address columns so to subset something is we put one of the index one of the selection mechanisms that we know of into the rows in a way where it will give us a subset of the frame so for example we can put in numbers integers and these are indexes and we can use numbers as a negative to exclude some values we can put in a vector of logical values that is as long as the number of rows and this will return every value for which the vector is true and not return every row for which the vector is false or we can operate by row names or column names so these are the three mechanisms index numbers logical values or row names and here because we want to filter by a condition the most convenient way to do this is by a logical vector so the challenge is to create a logical vector that says false for every entry in the column gene type that is not protein coding and it says true for every entry in the column gene type that is protein coding so somehow in here in the rows we need to enter a vector that has that property a vector which says true for every element of that column which is protein coding and false for everyone that is not so how do we get a vector of that type well we take a vector of strings names from that column itself and we compare it using the comparison operator the double equals sign compare it with the string protein coding and then we specify the second part here and that is all the columns that we need now at that point we could kick off all the columns that just have NAs we don't need them anymore let's just keep everything in by now there is just one thing in order to develop these things an idiom I usually adopt is the following I define myself a variable called cell which then contains either the indexes or the logical vector or whatever cell for selection and I define my selection here this makes it easier for me to get an idea whether my selection command worked in the correct way I can execute that command and look at my selection vector and see how many true values do I even have in that how long is the vector is it as long as the number of rows so that it's correct did this selection expression even work and does it give me the correct results which is often easier to tell from the vector itself than snooping around in the data frame and trying to figure things out so select something that creates a logical vector that we want and then we just use cell in here to subset by that vector I could also you know take this long expression here and put it in here but then these nested expressions become very long and unreadable I like to break these things up and breaking things up by defining an explicit selection is something that's turned out to be very useful so that said maybe some of the people with the red post-its can figure it out and if you can't keep your red post-it up so for this something that creates a logical vector here's an example if I have a vector that contains letters A, B, A, D, E and I compare that vector with the letter A I get a logical vector which is true, false true, false and false so you need to take that principle and translate it to an operation that you perform with gene A transcripts, dollar gene type as a vector where the information is that you're looking for and the comparison is for a string, i.e. protein cold yeah it's just a true step yeah I don't have that call to use I'm very very fortunate we've opened all of these so we're acting on this like for 27 days we've opened all of these so that's like why some of the stuff says that's not the right it's a huge issue it's a huge issue it's a huge issue it's a huge issue it's a huge issue it's a huge issue it's a huge issue it's a huge issue it's a huge issue it will be the maintenance of that where FeAMAG's where atclavovimble here it sounds like there's no difference so, we got aspired that way yeah So it might not make a difference, I've never done that, right? It returned the columns where the gene type was, right? But I've lost that column. And yeah, I don't know the sensitivity. So it's just really wide, so it just breaks it up in those samples. Oh, yeah, so it looks like you got it. So then you were saying this is not required, though? I don't know. No, you just leave it blank even on all the columns. Oh, okay. Yeah. Thank you. Sorry. Well, come on, Charlie. Yeah. So we just put in SEL, like I'm supposed to be putting it in. Exactly. Yeah, so SEL is a logical statement, right? So for every row where actually you have to be statistically correct, every element of this vector where it goes back, it's going to be true. And so for the purposes of this guy, every row where SEL is true, for those of you that see one of our code codes, that's SEL. And then so before the column is drawn, after the column is drawn, so you just put it in the spec, and then before the column is drawn. So SEL. That's it. That's it. So what is SEL actually going to do? So SEL is going to take SEL and SEL are by columns where SEL takes those that make the difference there into this new X object. So as for this genus, Charlie, transguards, take only these rows where it actually is, and put it in X. Yeah. Yeah. Note it's embedded in this box subsetting. So okay. Oh, it's a row where it's going to stay out. Right. And so, I don't know. Let's say we have a different one. A, B, one is going to be less than three. Right. So. So. Let's try something like this. And the first thing that's going to be successful is going to be SEL. So that's correct. Do you have a neighbor? Yeah. Okay. Let's show you a table. Okay. Yeah. No. Yeah, actually, except this one is a factor of varying on reason. But it's referring to, they're both referring to a factor of varying on reason. Okay. So it's like true, true, false, false, and we don't know why we're doing that, and we don't know just why. Right? Okay, so it's true for the first two months. Now, we want a subset. We need a subset. So many of that. Yeah. So many of that money has butter, and it's organic. So we could do some to your heart. So we want an agency or a subset that may be up here. And do it on the road. And then how do you put it out here? So what I'm going to do is I'm going to try to see, right? So just saying, we're just saying, which one is true? What's true? So that's what I'm going to do. So you have an agency? Yes. So it's just a square bracket indexing. So you have a vector. It's got one dimension. And you have a case of frame that's got two margins. Okay. I know. Exactly. Yeah, so it's like, I don't have to tell what that is. I don't have to say. It's built-in. Yeah. So what do we do? False or we can do it? We can actually value that. We can actually value it. So we can say which cell. So what it does is it says, it will be in the CSTS. True. Right? So now it's going to be a vector that's 1, 2. And now we're asking for rows 1 and 2. So we get the same thing. CSTS. Yeah. That's true. That's true. That's true. Are you just using the same thing? Yeah, the data is just a complete feature. So I think one could do a... Yeah, a vector. It's a one-dimensional, it's just a one-dimensional animation of the same type of value. So a vector is always going to have a C. A vector is always going to have a C. So that's how we call it. That's correct. That's only one dimension. So that's going to be a catnated vector. I wish I could have just called it. Thank you. Oh, yeah. So you don't really need to do a lot of this. So here you put your... Here. S. Yeah. So this is... S. And you want to say a chance? Yeah. That's what we're doing. So you look at this. Great. So it's just three dimensions. So forever though, where this is really, it's true forever though, where it doesn't need four to 15 minutes, it's going to be false. So for these guys, Gene has transcripts. Yeah. And no. So this guy. So it's got two dimensions. It's got the rows that you can see here. Are the columns. Those are before the columns. The columns are after. If you want all the columns, you just leave it blank. And then, so then here, if you want only certain columns, you want more, sorry, only certain rows where this is true. So to do that, you can just put in a vector, which is this. Yes. Exactly. It's the one you just defined. So you just put in this vector. Great. So you put in this vector. So you put in that vector. And now, if you check this, you can take it along. I wish. Oops. So none of them are equal to this. But they are equal to that. Great. So it has to be an exact match. Plank codes are both blank and not the code. This blank is wrong. So this one will work for you. Okay. Okay. Okay. Okay. Other stuff. You're looking for how many? How many? Why? Yeah. Okay. So you can see so you can find that's the text. Okay. This is the we want all for. Very nice. Okay. Okay. So I think the same number of rows. Yes. Okay. So what you trying to do? So this was a very, well, I don't know if there's any to do with that. But we did a little bit of that. OK. The way we did that. But then if I'm going to do that. So I don't want to do that. So this is also his essay. So you're going to do that? I don't want to do that. I don't want to do that. I don't want to do that. So do I then? Billing up. So there's a comment on how I did that. I feel like you're probably looking at the 190, but that's just right. So you're saying you're doing that right now? I hope she does. What do we say? It's not going to be cool. No, that's not going to be cool. So I'm not too sad. I'm just going to do a little bit of that. Because I failed to do the transcript. OK. So you're going to do it? No. Billing up. You are right. Okay. To send you off into the coffee break, this is the canonical solution. I've seen possible variants, but do try to get to that point. Thank you. If you still don't really understand how this works, try to think of a few sample examples. Examples that work in a similar way as the one that I've written up there and try to work with. This is really one of the key steps of understanding how to work with data in R. How to subset things and how to filter it. We'll be doing that again and again and working with that again. So the canonical solution here is to pull out the column gene type from our gene as transcripts, which is a vector of strings, and we compare that with the equals operator to the string protein coding. And the result of that is a logical vector of the same length as this column here, which is true for every element that was protein coding and false for every other element. This new logical vector, this vector cell has 1,420 elements, so I can use it to subset my rows of the data frame. If I take gene as transcripts and subset it such that I only use the rows for which cell is true, and all of the columns, my data frame which was originally 1,420 rows, shrinks to 866 rows. And that's because, remember this little trick, we can sum over a logical vector and thus count the number of true values in the vector. If we sum over cell, we see there were 866 true values in that vector. And this corresponds to the fact that we now have 866 rows left in the data frame after throwing out rows that we were not going to be using later on. Yeah? Let me check that. Right. So revisit that. Think about this is one of the key steps. This is something that you actually really need to understand. Ask us during the coffee break, we'll be breaking for coffee now. We will be back continuing with this at 11.