 We left off yesterday with asking, thinking about different ways of formatting the symbols we plot on scatterplots differently to represent some data, and we played with orange and red colors, and I said there ought to be a way to scale the symbols according to something that represents the age of the crabs, so that we could see how these, the crabs of different size or probably different age, how they relate to each other in their groupings after we plot them with principal components analysis. To recreate that data, you just execute these three lines, and one, there are many, but one possible solution for plotting this is in the file crabsplot.r, so this is a function. The function takes as parameters the x and y coordinates of the symbols we want to plot, a vector that contains factors about the species, a vector that contains factors that represent the sex, and a vector that represents the age of the cells. And then I'm also defining three other variables, main x-lab and y-lab, and every plot main x-lab and y-lab are the main title of the plot, the labels for the x-axis and the labels for the y-axis. Now I'm setting these to an empty string. This means the default value of these parameters is nothing, just an empty string, but I could put something there. If I don't set a default in my function definition and then I don't include some value for that parameter, I will get an error that a parameter is missing. But if I set a default that doesn't bother me, that parameter is not missing, it is initialized and defined, and I do not need to type it every time that I run the function. So that's the advantage. If I define the function this way, I can add a string for main x-lab and y-lab, but I don't have to. The parameters then become optional. So this is how we make optional parameters in function declarations. So the parameters are defined in a particular way. We start by reading the number of points from the length of the x-axis and defining this as some value n. The first thing is to create a color vector that we compute from species and from sex factors. Now if we take these factors as integers because there's two, they can either have the value one or two. So we can use this to pick a value from a vector of four colors. This means for aesthetic effect, we're coloring the blue males and the blue females slightly differently. The way I do this is very ad hoc. I first transform the first factor to the numbers zero and two from the original one and two, and then the second factor I leave as is. That means if I simply sum over these two factors, I get values one, two, three or four. And I can use these values one, two, three or four that are now unique for blue males, orange females, and so on to pick something, to choose something from the vector of colors that I'm using here. So my color set is these colors, two blueish colors and two orangish colors. My color index is a vector that's computed from the species factors and the sex factors. And then I create a vector of colors which is simply choosing one of the colors according to what color index I computed for it. Similarly, I need a vector of plotting characters. I need triangles for males. I'm using the plotting character 21 here which is a triangle with a border and 24 for females which is a circle with a border. So that makes my plotting character set. So the numbers 24 and 21. And the plotting character then is a vector of my plotting character set chosen according to the integer value of the m or f factors that I'm getting into the function. And finally I want a third vector which scales from minimum age to maximum age. So I need to define some scaling factor. What I'm changing with this is the parameter cex or character expansion which is the scaling factor 4 that you can use for labels or you can use it for symbols. And by trial and error and plotting a few symbols I find that a scaling factor of 0.5 is good for very, very small points that can still be visible and a maximum of 4 is good for large points. So I'm now getting an arbitrary set of numbers that are proportional to the age. And I need to scale them into the interval 0.5 to 4 to use them as cex values. So what I do first is I transform the arbitrary values that I get into the interval 0 to 1. And that's simply done by taking whatever age I have, subtracting the minimum so making sure that the smallest value in my column now is 0 and then dividing it by the difference between the maximum and the minimum number. So once I do that my smallest number is 1, my smallest number is 0, my largest number is 1. So this is a standard normalization that we often do if we want to map some data into a predefined interval. So subtract the minimum and then divide that by the difference between the maximum and the minimum. Transforms it into the interval 0.1. And then I simply multiply that with 3.5 and add 0.5. So I multiply it with the difference between S max and S min and add S min. And that moves the values that I've scaled from 0.1 into the interval that goes from S min to S max. And now I've transformed my ages into numbers that I can use to scale my characters. And then I simply plot the whole thing. I plot X and Y, main equals main. So main here has two different meanings. This word main is the parameter of the plotting function. This word main is the variable or the argument or the parameter that we got into the function. So that's a very standard way to write these things. It can be a little bit confusing because we're using the same word to mean two different things. Sometimes I prefer to do things like write main equals my main X lab equals my X lab to make it visually distinct. However, in functions, it's actually advantageous to name your arguments in the same way that they're expected here because we're passing arguments into the plotting function. And if I call the argument the same thing in the argument list of my function here, there's no confusion about what main is supposed to achieve. It's just called main because in plotting functions and points and whatever, it's always called main. So keep that in mind even though these are the same words, they have a different semantic meaning in the context of this definition. The plotting character is now this vector of plotting characters that we've just produced, i.e. So every single point then the plot function expects a vector of the same length as all the points and then it picks out the corresponding plotting character. Similarly for the colors, it picks out the corresponding color. Now there's two color specifications here. COL is the border of the symbol and BG for background is the actual color we give it. And then CEX is the scale. So once we source this function, we can plot the PCA, principle component two versus principle component three. Our species, sex, and age are the column one of the Crab's dataset for the species, column two of the Crab's dataset for the sex, and we'll take the first principle component as a standard for the age. We give it a main title, principle components two and three, distinguished crabs, X label as PC2 and Y label as PC3, and we get a plot that looks like this. So even though this is the same information as the one we've done previously with the plane triangles and circles, it has some additional information that becomes visible. And I think this would kind of be okay to send off to a journal in the publication. It has a title, what it is, it has labels on the axis, and a rather informative plot. Now yesterday we speculated that young crabs would be more similar to each other. And that seems to be true in general. So they're most different along principle components two and three, the larger they are. And here in that region we find more of the young crabs. It's more pronounced in the blue triangles perhaps, but there's a clear trend that they tend towards being more similar to each other at a young age. So that's nice. Just looking at this, we can infer some biology. So by reading this paper, and I read that title to me, that's somewhat like a biology person or something, that's somewhat vague and doesn't tie back to the measurements that they did in the crabs. So is there a way to pull out a principle components two and three, that it's actually this measurement plus this measurement plus a bit of this measurement that actually gives us the distinguishing in five out of the field, that's what the people want to know, right? Exactly. So the answer is yes and no. Of course, the actual principle components give you exactly that. It tells you this principle component is composed of 0.534 of the first dimension and 2.56 of the second dimension and minus 1.0538 of the third dimension and so on. This is correct mathematically, but it's also meaningless. In principle, the principle components are composed of all of the columns. So that's not helpful, but there are other methods and that's an excellent point where we can do something that's very similar to this kind of principle component analysis and then start interpreting and that's really important. So this is exploratory data analysis with models and that's exactly what we're going to do next for exactly that reason. I think we have everybody in the room now. So for those of you who weren't here before, you'll need to reload yesterday's dimension reduction project, update the files with tools, version control, and simple branches and then just look for the line number that we're currently discussing. So we've done some plotting with PCA, with principle component analysis. Now what I'd like to do is to explore the structure of data sets using PCA and other methods. So this is where we really try to understand what's in the data. And for that we'll use a data set of expression profiles. This is one of the classics of high-throughput biology, one of the first relatively high-throughput experiments by Raymond Chau in 1998, a data set of expression profiles of yeast cells. And the paper is this one in molecular cell, a genome-wide transcription analysis of the mitotic cell cycle. Who's seen this paper before? Oh, come on. Don't they teach that in Biow 1.20 these days? It was 20 years ago. That's atrocious. Anyway, so this was the first really high-throughput use of microarrays to infer biology from not essentially hypothesis-driven but what we call discovery science, just looking at everything at once and then looking at the data and trying to find out the patterns. Some people call discovery science fishing expeditions, but it's discovery science. We're trying to discover something. So that's the paper, where this comes from. And it's a data set that we've cleaned and where I'm giving you only a subset of 237 genes that were known or suspected to be involved in cell cycle regulation. So these are the cell cycle regulating genes in yeast. This doesn't necessarily mean that they are themselves regulated in a cyclic fashion. It just means that they are involved in some way in the cell cycle. So we don't yet know what their expression profiles are actually going to be. So the data is in this text file. It has a column on the left-hand side, which has a systematic gene name, and then it has expression values. In this case, these are log ratio numbers. So we skip the first line and then we read only columns 3 to 19, which are the actual data columns. So they've been classified. We're not going to use these classes. And then column 3 to 19 are the actual data column with the read table command. So 237 rows, and this is our standard V3, V4. This is what R creates if it has no other header information. So let's look at the general trends of the measurements here. This is a box plot of all the single measurements. So what does one column represent now? What's the biology behind this? What is one column? You're already abstracting it. The numbers in this one column come from where? Micrary. And what are we putting on that micrary, RNA? Where do we get the RNA? From cells. So we have one set of cells that have presumably been synchronized, then released at some time point, and then at some other time point, which is the first time point here. We take the cells, we extract the RNA. We wash it over a microarray. We compare the fluorescence that we get from that microarray with the reference population. And then we take the data, and then we store it. So each of these columns corresponds to one microarray experiment, one individual set of cells and one microarray experiment. Now, if everything is perfect, we would expect since regulation is kind of a zero-sum game, more or less the average expression over all genes should be approximately constant. And that's actually not completely the case. But given the state of the technology there and what they had to develop and to overcome, I think this is pretty good data. So it's nearly all constant around this level. There's one outlier here. We'll notice that more often. So something went wrong in this experiment, presumably. And all the fluorescence values became a little higher. That's the nature of actual experimentation. Now, if we're analyzing this in the context of a cell cycle, we're looking for expression changes. And so we're less interested in the absolute values of expression and more whether these changes are actually in a cyclical fashion. So to do that, we can scale all rows to have a mean of zero and the standard deviation of one. That means we're not really comparing anymore whether these are high expression or low expression, in particular because we're trying to find similar genes. We don't want genes to be considered similar just because they're lowly expressed. If something has a low expression level, it doesn't necessarily mean it's not still a very potent gene. Especially some of the most important cell cycle genes, of course, are transcription factors. And there's no need for the cell to have more transcription factors than their binding sites. So I vaguely remember the average concentration of the lacropressor in E. coli is 0.7 proteins per cell. So these can be very potent even though they're only lowly expressed. What we're more interested in is how the shape changes over the cell cycle. So what we're doing here in the following analysis is we're really comparing the shapes of expression profiles and not their absolute values. And for that, we scale it to a mean of zero and the standard deviation of one. There's a convenient, well, essentially this works in the same way as we've just encountered it in our little program to plot crabs, data where we scaled ages to different sizes of symbols. Scaling to a mean of zero means taking the mean, subtracting the mean from all the values and then the mean becomes zero. And then dividing by the standard deviation makes sure that you get a standard deviation of one. R has a command to do that, which is called scale, which can be automatically applied to data sets. However, scale by default works on columns. And here, we want to scale our rows. We want to scale the individual rows of the data set. So in order to use the scale, we could either iterate through everything with a for loop and then scale by hand. Or we can transpose the data set, scale it, and then transpose it back again. So remember, the transpose of a matrix is to change all the rows into columns and all the columns into rows. And so basically, you flip the matrix on its side and then we scale it and then we flip it back again. Transpose in R is just a T here. So we have a temporary variable x. We transpose the data. We scale the data. We take it back and then we boxplot it again. Now this has emphasized the outliers, which is fine. One of the possible interpretations is that we are now, since we expect a lot of noise in genes that are not expressed at high levels, we're now emphasizing the noise since we're amplifying it, since everything has the same standard deviation. So we're amplifying the noise relative to the signals. We'll change the column names instead of V1, V2, and so on. We'll call them T1 to T17 for time points and then we'll look at the header again. So that's the data we're going to try to cluster and work with. So column T1, T2, T3 are different time points. They represent different micro-area experiments and we've scaled the individual expressions. Now let's plot some expression profiles just to get a feel of what the data are. We've seen numbers and let's look at some actual profiles. I use sample from one to the number of rows in my data set. I use sample to pick out 10 random rows and I assign these 10 random rows to my variable cell for selection. And then I can run a lines plot of the type that we've just seen and I use CM colors which are colors that range from cyan to magenta. So it's basically a spectrum of bluish to pinkish tones here just to distinguish the colors. So you can see that we have these expression values here. We seem to have a few high values in that time point that we've already noticed where we will see some outliers and some go up, some go down, some are kind of random, some show an expected pattern of cyclical variation and some do not. So we can. Now let's explore that with principle components. First of all, we calculate the principle components. Again, PR comp assign this to PCA show and plot the principle components. That's what it looks like. What does that mean? What are these bars? Same thing that we saw yesterday. It's amount of variance of the entire data set over all dimensions that is explained by the first, second, third and so on principle component. So very different from our Crab's data where basically all the variation was in the first principle component, we have a slight fall off here. The first one is the largest, obviously, but there's not a clear cutoff where we could say we'll just throw out some of the principle components without any loss of explaining the data. Let's compare this to a matrix of random numbers. The question here is, are these principle components any different from what I would get from random numbers? So if I just get a set of random numbers of the same size from a random uniform distribution with the same number of rows and columns as our original data set and we plot the principle components for these, that's what random numbers look like. So this is our cell cycle genes, these are random numbers. So with random numbers, it kind of all just falls off in a linear fashion. So in terms of trends and principle components, this tells us that there is information in that data that can be segmented out with the linear decomposition like we do in the principle components analysis. There's more information in the first few principle components than in the last principle components than we would expect if the data set were purely random. Yes? Is there anything special about that? No, I could do something different and if I wanted to be absolutely sure that it's not due to the actual numbers, I could simply permute the entire data set. So reuse the numbers but rescale them. But this kind of linear scaling should be taken care of by the principle components analysis. So I don't know, I wouldn't necessarily say so. I mean, remember that the important information for our Crabb's data was in numerically very, very small principle components. If we would have said, well, that's less than one, let's throw it out and we wouldn't have gotten the result we need. Okay, so let's look at what this looks like in a biplot. This is how they get pulled apart. So these are the biplots along the different principle components. And yeah, we see that we can find clusters here and the question is now, well, are these similar expression profiles? What did we actually cluster? What does this all mean? What are these principle components? So let's look at the actual principle components. Here's a plot of the so-called rotations, i.e. the values of the first principle components, the first four. This is how they run. Now, basically these are the numeric values of these transformations along each of the dimensions. And we can see that the first principle component essentially has a big peak here. And this corresponds to that microarray dataset where we said this is systematically shifted up. So this principle component explains that and kind of if we subtract it, we shift everything down again and normalize things better. So that would be in the first principle component. The second principle component kind of goes up, goes down, comes up again and so on. That looks like cyclical regulation. The third principle component does the same thing going up, down, up and down, but earlier than the second principle component. So it's shifted in time. Mind you, the principle component analysis knows nothing about time. It treats all of these dimensions as independent. It is just us that know that they're ordered in time and that there's a correlation there. So coming back to the question, are these dimensions in any way meaningful? We can derive a certain level of interpretation from looking at how these principle components run. So each principle component, the loading of that, basically says how strong a particular dimension contributes to that projection. In micro area analysis, these are often called eigen genes. So that refers to the eigen values of linear algebra. Yeah, I'm not saying it's not real. I'm saying it corresponds to an anomaly in the micro area data that we've seen before. So if we go, if we think back on the general box plot you see that the general trends here look kind of like the first principle component. So there's a little bit raised here. Then there's one measurement which goes up and then it comes down again. So it seems that there's a general trend of variation in the data that can be removed which would probably simply correspond to noise because we would expect the average measurements to have the same mean. So we could have gone and scaled the columns after scaling the rows so we can do that. And after that, I think this wouldn't exist anymore. Probably then the first principle component would be what we see as the second principle component right now. So implicitly, this does a bit of that scaling for us. Okay, now we, rather than just look at the numbers I'd like a way to fetch the actual gene names together with the expression values so we can start defining, analyzing better what we look at. So I pull out the vector, show genes from the original table where I simply take the column one which has all the systematic names. So that's this here, YL040C, YL062W and so on. Let's define a function to list genes if we only know row numbers and see if the function works. Yep, row number, one, two, three, five, eight and so on. And now we can start exploring some similar genes. So we plot principle component two against principle component three and we plot the numbers. Now this is very small. You'll probably kind of see it better on your computer. Now I can squint with my eyes and look into this plot and try to find genes that cluster together in this two-dimensional plots of the principle components. So for example, if I select genes 104, 72, 86, 148 and so on and I overlay my plot with these gene symbols in red, we have a group of genes here and if I plot these genes, that's what they look like. So indeed, genes that are close together in a projection along the principle components have similar expression shapes, right? So these are genes that start high, then fall off, then rise again, then fall off, then rise again. And in this way, from these 237 genes, we can start pulling out different types of genes and exploring what these genes are. These are their systematic names using our list genes function. Now, is that different from just picking random genes? Would random genes look similarly? Do all of these genes look the same? For that to make that comparison, we'll just pick out 10 random numbers from one to 237 and plot these. And no, random numbers look very different. Again, we see that we have this strange microarray peak here but they look very different from our selection of genes of clusters under these principle components. Let's use a different plot. So these green labels here are the ones we just had in our first plot. And yeah, and I'm selecting a few genes from down here to see how these are different. Now, this is a different projection. This is PC principle component number one against principle component number three. My selection two, these red genes here and that's our matrix plot. Okay, so the difference here is our first set started high and then came low. This one starts low and then increases in the cell cycle. So from exploring groupings of genes in these principle components, I can find genes that apparently, well, they might be co-regulated but at least they have very similar expression profiles. And from that I can start analyzing genes and discovering biology. Now, one thing that's a bit of a nuisance is all these systematic gene names. Maybe if you release geneticists, some of these gene names will come up again and again and you'll remember what they are. But what I'd really like to be able to do is to print standard names. So we'll do a little digression and think about how do we integrate data sets? This is really important right now and I think for most of what we're doing today, we're still looking into individual data sets but really, really important information can be had from integrating data sets from different sources. So what I'd like to integrate is to improve the gene listing to get the common names of the genes, the standard names. And so that's the task. We want to annotate the genes with standard names. We have the systematic names and we want the standard names. You can easily imagine that instead of doing this, we would have a similar procedure if we would want to annotate gene names with go terms or pathway membership or whether we found them in a different screen in a collaborating laboratory and so on. It's a similar procedure. So how do we do that? So what we have is systematic genes, systematic gene names, what we want are standard names. Right, we need some kind of a mapping file. Where could that come from? Perhaps, can it? NCBI, perhaps? I don't know. What would you do? That sounds like a good idea. So let's Google it. What would you ask? Or did anybody find it? Or find something? I don't know. I know nothing about this. Where's that file? Oh, and the files in the project. Come on, that's cheating. That's a source. Where would you find it by like that? I mean, it's not just about doing what we're doing here. It's about learning how to solve similar types of problems. And as I said, you'll have the same problem when you, for example, want to, as a case, go to your pathway. Now, for this particular problem, how did we solve it? If we don't have that file. Yep, that's one possible way. That's very complex. I'm not sure whether that actually contains the standard names, though. So often, I wouldn't know how to solve this in Bioconductor. I would expect that there's a solution there, but Bioconductor is very large, and I would clear my schedule for half a day if I wanted to go and find this in Bioconductor. Or do you know of a quick and easy solution? Is there something in the paper? The paper's 1998. So one way to do it is via Biomart, the assembled Biomart package, which is Biomart? Good for them. Well, it's good that we didn't pursue the solution here. Okay, well, so did anybody find information like that somewhere on the web? Okay, yeast mine. Eight cool things you can do with yeast mine. Yeast mine is awesome. Okay, we can enter a list of identifiers. We could do that. They have a Perl API. Yeast.txt.uniprod. Oh, look at that. Okay, so that would be one possibility. Yeast.txt.uniprod. That has the systematic name, and that has the standard gene name. There's a transformation yeast tract. Orphanames to genames. I don't know what orphanames are. Yeastgenome.org? Can we give it a list of genames? So this is SGD. That's what we had before this yeast mine. Okay. So let's try that. How do we get a list of all of the genames that we're working with? Remember we made that vector show genes. So I would guess all I need to do is can't show genes, and then copy them in identifiers. Select type gene for organism create list. They found 215 of our 217. Unfortunately, that's a very typical outcome. If you get any gene list from anywhere, you will have duplicates, you will have typos, you will have genes that no longer are called, what they used to be called, because they're updated, and so on. But I think this kind of works, and we should be able to get symbols from that. So save the list, and now what do we do with it? List analysis. What would you do? We have a list. We have an interface. The entire thing. Perhaps we can only get the columns that we actually want. Manage columns. Remove columns. So we don't need the primary DBID. We want the systematic name, because we can't guarantee that we're getting the same order that we had before, so we need something to match the gene symbols back to our list of genes. We don't need the gene organism short name, we already know what that is. We absolutely do want the standard name, and do we want the gene name? Yeah, why not? That's good to have. So we can look it up, and see what we get if we look at data sets. And so apply changes, and then probably export it. Flat file format. Download file. File name. Gene names.text. And download a TSV file. That looks very good. Now one thing I don't like about this, and that's why it's good to look into the data, are these two quotation marks. That may increase the size here. That may be trouble for the reading process. So for these two, I will just change it by hand. And I take the systematic name, place that here, call that unknown. I once had a file that was rather large, where something was annotated as something, something prime with a single quotation mark. And that actually crashed R, until I discovered what the problem was. So I tried to read the file and it always crashed R. What happened is that the single quotation mark was interpreted as that everything that followed was part of a string, and that string just became too large to handle as they were looking for the matching single quotation mark. So if you have files that contain quotation marks, there may be trouble. So try not to have quotation marks in files. Okay, so I've edited this. It's always good to look into the data. Oh, there's another one. Let's also call this unknown. And there may be more. There's actually three more in this case. That's unfortunately a rather typical type of thing that you need to do when you do bioinformatics. We spend far too much time still downloading text files and editing text files. And the reason is it's really, really, really difficult to automate these things because the underlying biology is so unpredictable. As I said, some identifiers don't exist anymore or need to be changed. Some identifiers map to more than one gene all of a sudden because new things are discovered. Standard names and systematic names don't match. Identifiers are retracted. And so there's a lot of difficult things going on. The databases are trying to improve their identifiers, but very often we simply have to go in and look at text files by hand. Now, we have this pretty file. I should, I don't know, you might have it too, but maybe I'll just put it on the on GitHub to make sure we're all on the same page here. That'd be awesome, right? Okay, so the file is here, genames.text.tsv. That's not very fortunately named, let me rename that. How do I prevent cat from truncating the output? What output is being truncated? Genes, at least for me, it's truncated. Oh, okay. That is due to the number of print options. In that case, so you, in actually, why is foresight? Our studio prevents from printing very long things to the console. Sometimes, if I mistake, we start printing gigabyte large files. So it stops after 1000, so you can set that in global options. What I would do, however, is take that information if it becomes larger and then write it into a text file rather than displaying it on the console. Okay, so we have these standard genames.tsv. I need to commit the file. After I commit the file, I can push the change to the GitHub master. And now if you pull from GitHub, the file should appear in your list of files. What do we do next? Not time for a coffee break yet. I've lost my little note that I made. We have coffee break at 10.30? Awesome, okay. And then lunch at 12.30 and another coffee break at three. Okay, perfect. What do we do next? So we want to use this to annotate genes that we have in our dataset and that we find from PCA. So when we look at one of these things that are plotted there, the information that we have is a row number. So how could we, what do we need to do to match to now get that information and make it usable? So do we need to change it to the file that we've just downloaded to data frame? Well, in some way, we should probably read it into our first. So this is a tab separated value file. So I think read table or isn't there read.tsv? Well, we can use read.tsv. So read.tsv is comma separated values, but the separator is specified as a comma and we can re-specify it as a tab. So let's try this. Read.tsv. The file name was, I should have used the shorter file name. Yeast standard gene names.tsv. Yeast dot tsv separator is tab. Header is false. Anything else? Okay, now this worked without error. Do we have factors now? Absolutely right. Everything is factors. Okay, so we read it again. Perfect. And now they're all character columns. Okay, now we have two relevant objects now. One is this gene info and the one we already have is show genes. Show genes has the systematic names that we pulled out of the original table in the order of the data elements that we're working with. So head show genes gives us this and gene info gives us that. So what do you notice? Excellent. First of all, these are different strings. One is all uppercase and one is mixed case. And other than that, right? So this is just a vector of characters. This is a data frame with three columns. Anything else? What about the order of the gene names? Looks like they're in the same order. Right, so it looks like it's in the same order but that is of course deceptive. It doesn't have to be in the same order. So you need to be very, very careful. Okay, so your task now is figure out how we can change or what we would need to do to change our gene info data frame so that we know all of the genes are now in the same order as the original table. Right, so the order of the genes here has to be the same as the order of the gene names in here. Try it. And while you do that, maybe work on copies of these data sets. So call it cho genes one or gene info one and make a copy so when you destroy things, you still have it. And try to figure out a way to do that. If you're done, show us the blue post it and if you have troubles doing that, show us the red post it. But sometimes we would need to do it. It's just the case that this is case insensitive and some of the strings we work with are not case insensitive. Exactly. So yesterday's search, there was IDB identical, checks if your two vectors are identical to each other. So if you check, you can say they're identical to each other. They're not identical to each other. So we're gonna have to do it in the business. Yes, we're gonna have to do it in the business. It's set. So you're gonna have to set it around the gene info. It's close, so it is close. It's actually not. So here, awesome. So here's our backwards. Did you do that? No, because it's already open. I see, I'm choking. So you're gonna want to make your choking as a fashion convert. Oh, that makes it very comical. Yeah. Yeah. So yeah, this would be good to cover. Oh, do we want to do this? Oh, yeah. Yeah. Yeah, because I think the gene info might be long and children would be shorter. So while your children are in the right format, you can only take the ones that are the same. So what are you gonna do? You want to have four, four, four small ones. I wanna do, yeah. Right? So as I said, the quality is the same. Here, the auto quality, do you think? Yeah, it's a subset. Yeah. So there is a subset function here. If you scroll down, you should be able to see it. So if they say, I'm more at the quality, I was tempted to make the temp column to be greater than eight. It's in children's time. So you don't have to do it. There's no longer like how I would do that. So you don't have to do that. You could use a whole new split. So I have brackets. The original, the subset function. So it's one, two, three, four, five, six, seven, eight. I thought it was only so much different from the first one. Sure, I mean, yeah. So that's what this selection looks like. Okay. Oh, what a different argument. It's best if people hear about processes and then you know. So it's so identically, yeah, so again, so each of the responses, if they're the same, then can I have an image? So how am I going to be able to say something like that? So again, right. It's which columns that you're talking about. So what's the nature called? It's very, it's a chocky. All of their old people would say, so is that not true? Sorry, actually, I'm going to let you do it. Okay. So, I don't know, I don't know, well. So, because this one, this is the main one, right? So let me go to the last one. Oh, sorry, that's okay. And yeah, I think that's, then we'll go to the next one. So then I just want to know what you think. Would you get the one that's just the last one? I don't know, I don't know, I don't know. Okay, but it's not, it's not, it's not true. Of course. So what's the, at some point, should we do one to one for as far as the beginning? Are we on the two? Yeah. So the path of these two is the main one. Yeah. So those, yeah. And pretty simple this would work. They actually, where? Yeah, they're not, they're not. No, they're not. No, they're not. No, which ones are the main ones? Perfect. Okay. So you call it, or you don't have to, because the, I just want to know if you have any questions on what part. You don't have to, yeah. Can I just make the, do you have any questions for me? Okay, just this, I'm gonna try to close it to me here or what. And so it's not going to be one. That's why, so one, I'm gonna get this, so I'm gonna make it close. One, two, three, four, five, five, six, seven, eight, so it's not going to be one. So you don't have to, because it's not going to be one. Yeah. Once again. Perfect. At the point where this goes up, if anything else that comes down in business, we'll be able to raise it to the right answer. So it goes down to two. I think that's it. How are we going to move up again? I already did that. Wait. We don't have to put them together. We'll be able to do this on the other end. So you need to have a different strategy. You need to be able to do this on the other end. Yeah. And I'll be able to do this on the other end. Yeah. If you each one of them, do you want to do it fine? Sure. Yes. No problem. Good job. You're like so sweet. Oh, yeah. Good. And now you can do this. Maybe that's the thing. Oh. If Brett returns, he's going to use your word. Good. Yes. Good. I need some people. It's up. You can tell me the question. You should. If it's the same, it's so awkward. It is. It is. So this is, you'll have to get through that once you get through that. So, our pre-packaged scripts will only take you to the end of the workshop. So, once you get started, if you get started, you'll be able to solve it. Okay. All right. Thank you. Okay. So, I'm sorry. I'm sorry. I'm sorry. I'm sorry. I'm sorry. I'm sorry. Okay. That's okay. Yeah. That's okay. Okay. Yeah. I'm sorry. Okay. Okay. Good. Okay. Okay. Okay. Good. Okay. Thank you. Thank you. Good. Thank you. Okay. Okay. Thank you. Okay. Okay. Okay. Those guys are, first, So you're going to say children to your school and to your youth to the quality and accessibility. And it seems like what's happening for us now is that some of these numbers are not I've written up some problems that we've identified with the data sets that you'll have to keep in mind when doing this kind of thing. So one thing, of course, as you've all realized is gene names are mixed case in show genes and all uppercase in gene info. And we need to take care of that and the two upper function will help us normalize the ones that are mixed case so we can actually compare them. Another problem is there might be names that are duplicated and to find names that are duplicated we can use the duplicated function. So the duplicated function on something returns true or false regarding depending on whether that element has appeared before or not appeared before in the data set. And if you apply which to a vector of true and false you get the position of the true elements here. So which duplicated show genes shows us a number of genes that appear more than once in the show genes table and which duplicated gene info shows us there are none. So gene info they're all unique but in duplicated they're not all unique. Why would they not all be unique? How does that even make sense? Why do we have measurements more than once? Right so that can be intentional because we want controls on the microarray. But in this case remember these are microarrays. What you measure on microarrays are spots. So there can be different spots i.e. different microarray probes mapping to the same gene and that's probably what the situation is here. So we could use these and then actually average over the replicates here. But you simply have to be aware that some of the gene names appear more than once and handle that. And why is that a problem? Well oh yeah and then there's another problem. There might be names in show genes that are not in gene info so we can test that. If we say two upper show genes and ask are they in gene info we again get a list of truths and falses and applying which to that. Shows us element number 18 and element number 117 and show genes have no corresponding match in the gene info table. So all of these are small little problems and why it's important to realize this is our has a fantastic way to merge tables. So actually all we would potentially need to do is to merge two tables on the common identifier column. But with this number of problems these small problems we often see in real world data that makes it very difficult. You have to be absolutely sure about what merge does and what things can go wrong and how you recover from that. Because any of these issues might prevent an automated procedure from giving you exactly the correct data. So there are automatic ways to do these things and I would implore you to do a very slow pedestrian way where you have absolute control over every single step. I think maybe in some code I would use merge after I check that it actually works correctly by writing a loop and doing this. So this is the kind of thing where things can go wrong in subtle ways. You have small data sets of data values shifted or not represented and so on. Hard to catch. It seems correct. It isn't and it leads to incorrect results in the end. So should I type out a solution to do it slowly step by step or would you like to try a few more things? Did anybody solve it? Anastasia, what did you do? That would work. So doing it all at once with sub-setting is probably very useful if the data frame is very large. I'll type out a solution with a for loop instead here. The first thing is asking myself what output do I even want? Essentially, if you sort them first then you're changing the order and then you can't really use them in the way that we're planning to use them. Because then the rows will no longer be equivalent. But in principle, my sub-setting approach would look like is just giving it a vector of names that match in exactly the order including all of the duplicates and pull out the corresponding rows from the other table. Taking care of course that two of them don't actually exist and that would preserve the order. But let's do the very pedestrian thing and the first question I ask myself is what do I even want? What should my result look like? What do you think? Is it a vector, a data frame, a number, a plot? It's a data frame. So let's make a data frame. Let's call them standard genes is data frame. And there should be a column for systematic name which is show genes. And there should be a column for standard names which is simply initially empty. And there should be a column for, what's it even called? Well, whatever. It's the gene name. Let's call it simply name. Streams and structures is false. So now I have these here. I imported them from show genes so I guarantee I have all of them and all of them in the same order as I had them originally. And that's one of the requirements of this data frame. And I have a column called STD and a column called name that simply has empty strings. Now I can write a loop, iterate over every single value of show genes and do whatever I need to do. So my loop looks like this for I in one, two of show genes. I could have also used one in n row of standard genes. Should give me the same iteration. Okay. So systematic ID is show genes. Now I am looking for where I find this systematic ID. Oh, actually I did something that I shouldn't have done. I should use two upper here. Solve this problem right away. There we go. Okay. So let's not use show genes at all. Now I have all the information I need in my data frame in the first place. So for one in n row, standard genes, standard genes, dollar, sis. Okay. Next I want to populate the corresponding standard identifier. And so I try to find it. So the standard ID is many different ways. I can use grep for example, grep, sis ID in gene info, column one. So there are two possible outcomes now. Either I find it. Oh wait, I don't actually have it yet. Grep gives me the position where it's found. So in order to fetch it, I still need to column one. This is the subset. The grep will, if it exists, give me the correct position. Then I will extract it. And then I'll see what happens then. So let's see whether that actually does what it needs to do. So let's say we'll take, we'll assign four to I. This gives us the gene name YAR007C. Oh yeah, I didn't assign it. And it tells us I find this in row four. And applying this gives me the correct string. And that's not the standard ID, right? I need column two. RFA1. Okay. Now what happens if it doesn't exist? How do I handle that case? So we found that rows 18 and 117 don't match. So let's set I to 18 and try it out. The ID is YBRO84C. And if we grep for that, we get integer zero, i.e., this is an integer vector of length zero, or we also call this a null value. What happens if I apply a null value as a subset? I get a length zero. Am I allowed to assign that? Yes. What is standard ID at that point? Well, it's null. Okay, so it doesn't break at that point, but I still need to deal with that fact. So I need to have a condition. If is null standard ID, then I do something. What's a possible fallback behavior? Skip that row, not. Because then we have empty strings and we didn't want empty strings. So we'll put in something that makes sense, and what makes sense here is we'll just say, in this case, we'll use the society because we don't know better. So in the case that we don't find it, if there's no match, there's no error, but we can identify that this is no null, and we can then, instead of the standard ID that we thought we would find, we will simply take this ID. Okay. The next thing is we want the current name, and we can pull that in the same way, from gene info, from column three. So we'll do the same thing here, with another somehow usable fallback behavior. Okay. And finally, we have these two new values, and we need to put them into our data frame, and that's just standard genes $std of i is standard ID, standard genes $name of i. So after this for loop is done, our little data frame is populated with the information that we need. So this is one way to do it. There's one thing I don't like about that. I am repeating code here. Now if I ever think that it's better to use a different way to match it, for example, by working with case and insensitive matches or taking care of other things, I have to be really, really, really careful to do the same thing in all of the places that I'm using here. There's a principle when you write code that is called DRY. Do not repeat yourself. What I'm doing here is I'm repeating code. So a better way to write this is to pull out this repeated code, actually the part that's repeated is this here, and not repeat it, but do it once and assign it and then use the assigned variable. So much better is to write something like this here. So instead of repeating this grep operation once for the standard ID and once for the gene name, I do this only once. I assign it to the variable index and I reuse the variable index in these two cases. And I get an error. It says standard genes, standard ID, replacement has length zero, which is weird because I tested for that. And it's in line 18. So let's see what happened. Let's set I to 18 and go through this explicitly. This ID is YBRO84C. Index is null. Standard ID is also null. Is null standard ID is false. The character of length zero is not null. Okay, now it seems to have worked. And we have our systematic names. Our standard names and the gene names. So this is really what I would consider a very pedestrian code. There are many, many, many ways to make this much faster and much more efficient. But I don't think that it, that I think probably anything that we do with it would make it less explicit and easier to make mistakes and subtle things. It's a bit of a balance. As you get more advanced and more secure in your R programming, don't write like that anymore. But if you're new to this, really take it slow. Take it step by step and make sure that you can validate every single step of your analysis.