 So I'd like to pick up today where we left off yesterday. Presumably though, you've closed our studio in the meantime and you'll need to recreate the project. And also I've made some changes to the code that we're working with and that we're going on to work with. So as a first step, I'd simply like you to reload your project as it is. Because you might have made changes in the files and I want to make sure since I've uploaded new versions of the master script and some additional files, I want to make sure these changes don't get overwritten. So if you haven't made any changes in your files, ignore what comes next. If you have made changes, you open your file here and your changed version of say, our EDA dimension reduction is somewhere. And now save this as something with a slightly different name. For example, save as dimension reduction minodes. So simply that you have a second and different version that's updated. And the next thing is probably, and I hope this works. I can't test this because my system is always updated. Simply to say pull branches and see what happens then. That should update the script and that should update the new files that have been added. And hopefully this is not going to delete your old files. So what happened? If you look at your files, do you have a file called our EDA dimension reduction part 2? Okay, excellent. If you go into your file task solutions, does line 29 start with a more principled solution? Excellent. Does your file our EDA dimension reduction minodes.r still exist? Excellent. Good. Okay. So I think we're in good shape to continue. Now, if I look at my environment, of course, none of my data files that I've created yesterday exist. So the first thing I need to do is to recreate them. In the script part 2, which contains all the code from part 1 that we haven't been working through. The first step is to recreate the data. Simply select that block and execute it. And after that, all the objects that we've been working with should be recreated. So I didn't think of this earlier, but if this is a script that you work on at home and that you frequently have to abandon, how would you avoid having to recreate data files? Right. So if you want to share these, you can basically, well, like in a video game, install save points from time to time, where you simply save what you've done so far and reload it. And then you can start off from that point and assume that everything previous has worked. Okay. Now, in your task solutions, I've put in a more principled solution to plot the crabs data from the principal components analysis with specific shapes, colors, and sizes. And I'll briefly go over how we can do this, because it's a good view on the way the plot function works and can be customized. Okay, there's a question. So some of you get that, but not all of you? No, because I have open saved project. My project was saved before I filled yours. So they didn't, the state of their project, their session is not saved. They need to save the project. So you can simply reinstall the project, calling it day two, and that should have all the files. That should give you a current version. Okay. So maybe I'll just run through, there is one line missing here from part one. And you'll need to complete that in your version of the part two where we recreate the data. The line that's missing is to do the principal component on the crabs data set on columns four to eight and assign that to the variable PCA crabs. So that should recreate all the data. Okay, so I have to find this function and I'll just briefly show you what it does. List plots for me, blue, males and females, and orange males and females, and with specific plotting symbols. And being able to do something like this, as I mentioned yesterday, is rather important to allow you to visually identify and find structure within your data. And to do this, we use the fact that the normal plot function for most of its variables accepts not just single values, but vectors. And if it gets a vector, it applies that vector to all the items that it takes in. So the plot function I call here is x and y and I'm plotting the principal components, PC2 on the x-axis and PC3 on the y-axis, and this has 200 rows. But now I'm also giving it a vector of 200 rows that contains as a number my plotting character. And I'm also giving it a vector of 200 characters that contains my colors. And I'm also giving it a vector of 200 integers or 200 numbers that define the size of the plotting symbol. And here's how the function does this. So the function has parameters x and y, species, x and h, so this is how I drive the function. When I actually run it, I say something like plot crabs, PCA crabs x2, which is my x, PCA crabs dollar x3, which is my y, crabs one, which is my species, crabs two, which is the crabs six, and PCA dollar crabs one, which is a stand-in for the h. It is basically the variation along the first principal component. So you don't need to do this, but just as to illustrate it, I'll simply assign this to the variables x, y, species, so that I can illustrate for you how the internals of the function work. Then we have sex and h. So species is a set of factors of blue and orange. Sex is a set of factors of males and females, and they're arranged in the way that we have them. Good. So what the function now does is it takes my species, sex and age, and converts that into something that I can use to define colors, shapes, and size of plotting characters. For my colors, at first I define simply the four colors that I want to use in a vector, because what I want to do now is I want to take my vectors of factors, compute something with them, and use that to pick out the colors from my color set. So if I use my factors to compute a two, I would like to pick out the second color. If I compute something to be four, I would like to pick out the fourth color. And then all I need to do is to transform my factors, of which I have two different kinds, to give me numbers 1, 2, 3, and 4, corresponding to blue males, and orange males, and blue females, and orange females, in the way that they appear in my data set. So as you've seen, species is a set of factors, B and O. But underlyingly, factors are coded actually as integers. So factors are stored as integers, and there's a table that says each of these integers corresponds to something. So the integer 1 corresponds to B, and the integer 2 corresponds to O. And this is the way that the factors are actually used. Underly they're just numbers, but the numbers have a particular meaning, and that particular meaning is remembered when the factor is defined. But when I convert these factors into integers, I get a list of ones and twos. And similarly, if I convert my factors in the sex vector into integers, I also get ones and twos. Now all I need to do is to treat my ones and twos in a way where adding them together gives me some unique number between 1, 2, 3, and 4. Now I can't do that by just adding them together, because that would give me numbers in the range from 2. It would be either 2, 3, or 4. So if I simply add them together, 1 and 2, or 1 plus 2, and 2 plus 1, of course, is the same, and they would overlap. So I need to change them somehow. The way I change them is to subtract 1 and multiply 2 for the species and take the numbers as is. So this list of ones and twos gets converted into zeros and twos, and my other list of ones and twos just stays as it is. Add randomly 1 and 2 to either 0 or 2, I get 1, 2, 3, and 4. So the result of this little arithmetic expression is a list of twos and ones and fours and threes. And these are integers that I can use to pick out elements from my color set in the same way of sub-setting that we've done previously, many times before. So whenever a 2 appears in this vector, I will get this orange color, this blueish color. Whenever a 1 appears in this vector, I will get this blueish color, somewhat lighter blue, this one here. Whenever I get a 4, I get this reddish orange color. Whenever I get a 3, I get this somewhat lighter orange color. So this expression here now produces a vector, well I should define it first, produces a vector like this. So these are my colors, one specific to every single element that I'm going to be using and 200 of them overall. So now if I pass 200 colors to my plot instructions, I get a specific color for every single element. So let's try that plot x, y, cull equals crabcullers. So every one of them has the color that I've assigned it here. Okay, next task. So this vector of colors now defines specifically how I want to color my points. Of course you can imagine whatever your problem is, you can create information that is encoded in color and apply it to all manners of different plot problems. Okay, now I create a vector of plotting characters and I use the foreground, background fill plotting characters, 21 and 24. We've looked at these yesterday, one is a circle, one is a triangle. And I work in basically the same way, I define a vector with two elements, 24 and 21. And I use the plotting characters here and I can simply pick them out as integer sex already is a vector of ones and twos and will thus retrieve plotting characters that correspond and encode the sex of the crabs that we're looking at here. If you've astutely observed what we did yesterday, we relied on the fact that these are sorted 50 observations of one kind then another 50 observations of a second kind and so on. Of course this is a brittle assumption if normally your data is not as nicely organized, it's all over the place. But the way that we're doing this doesn't rely on that sorting anymore, it could be completely shuffled. We're taking the information directly from the actual factor values that we encode in each row. Okay, we assign that to the crab plotting characters and plotting character equals crab pch. And now we have the circles and the triangles in the way that we want them. Now, we'd like to use a scale vector. And for this, we use the plotting parameter cex. This is for character expansion. This defines the size with which a character is plotted. So if I plot all my characters here with the cex of 0.7, they get really small. If I plot them with a character of 4, they get pretty large. So this is usable. If they get even larger, there's too much overlap. If they get even smaller, I can't see them anymore. So this takes a little bit of fidgeting and experimenting. What is the right size for the particular plot that you're trying to produce? So I assign the numbers that I use here to variables so I can change them easily in case I want to. And now all I need to do is to take the numbers that I find in the first principle component, these numbers here, or that's what their distribution looks like somewhere from minus 30 to 30, take these numbers and scale them into the range, linearly scale them into the range 0.4 by 4. And that's this a little bit convoluted expression. So the way the scaling and normalizing works is, at first, I subtract the minimum age from the age. The result of that is that now everything is shifted so that the smallest value is 0. So I have this range of values here and now the smallest value becomes 0. Same shape, but the smallest value is now 0. And I divide that by the difference between the maximum and the minimum in my range. So if I do that, of course, then my histogram changes in a way where it then goes from 0 to 1. Now after I've scaled my numbers underlyingly from 0 to 1, so basically this is always the key step in rescaling data, same shape but from 0 to 1, after that I can work with it and rescale it into anything I want. For example, multiply it with the difference between my maximum and minimum scale, which is in this case 3.5, which will rescale this entire thing to go from 0 to 3.5, and then add smin to it and this will shift all my numbers to go from 0.5 to 4 and that's exactly what I want. So executing this gives me this vector of scales and that's what it looks like going from 0 to 0.5. And now this encodes the information about the scale that I want to use. Now trying this plot again, so my color is crab colors. Now actually that's not what I really wanted. If I define colors for plotting characters 21 and 24, I define the color of the borders, but I'd like the color of the borders to remain black. The fill is called bg for the plotting characters. Plotting character is encoded in crabpch and cx is now in crabpch, cx the character expansion. Thank you. What's my mistake here? Everybody notice my typo? I get an error in the expression column equals black. Comparison is possible only for atomic and list types. What do you do? What's the mistake? Does everybody see the error? The error is I've used two equals characters. This is not an assignment. This is a comparison. So I need to put this back and I get the expected results. Once again, if I say a equals 5, I can assign 5 to my value a. This is the assignment equals. If I say this here, this is a logical comparison. This asks whether this is true or not and the result is true or false. So in this case it is true and false. Because this error is frequently made by convention, R uses this assignment operator. So do that in your scripts, use this assignment operator. Also by convention, however, parameters require the use of an equals sign. So this is a bit of an inconsistency. Whenever specifying parameters, we need an equals sign for the assignment of the parameter value. Everywhere else we use this assignment operator, which I think is quite readable and avoids exactly this kind of problem. Now, actually no, because if I make it even bigger, I lose so much screen real estate that it becomes difficult to demonstrate the plots. They get too squished. So this is about all I can do. You have all the text on your own computer. I've tried this out. I think yesterday when we went through two stages of zooming, it became very difficult. Perhaps after our forced evacuation, if you have difficulty reading the screen and you're sitting in the back row, maybe you can swap with somebody who sits a little closer to the screen and set it up that way. Question? Yes? When you were putting it together, you had a value subtracted from the minimum to give it 0. And if you had used that now, so it's clearly a chain at 0.5 as the minimum. Yes. Because if you had used a 0, you wouldn't see anything. Exactly, because if I used an expansion factor of 0, nothing gets shown. So why didn't you just use something other than 0 for the minimum to begin with? Like using a point less than the minimum. The minimum was 0.5. Oh, okay. I see. I could do this simply by shifting the minimum over here. Now, the way that I do this normalization and scaling, and I think this is the same way to do it, I always take my data and I first linearly transform it into the range 0 to 1. After that, I reskill it to something else. So that's always my first step. You can save a few strokes and immediately roll all this difference and offset into one expression. You're absolutely right. Okay. So I'd like you to simply select and execute this function so that you have it available in your environment. We'll be using it for plotting a little later on. It should now appear in the functions here, plot crabs, and work in this way. Okay, to recap, because this is important, often when we plot, we would like to identify which elements correspond to which points. We can use encoding of categorical values and we can use encoding of continuous variables or other discrete variables in color, in shape, in size in order to make multi-dimensional relationships available. There are other strategies for even higher-dimensional data that people use for plots. So two things that come to mind is that you can also plot pie charts in place of these points and these pie charts can then have proportions of values that correspond to the five or six or seven data dimensions that you might be interested in. I think this is a good approach, but of course it won't work very well if the number of data points are as large as this one. One other thing that I remember for displaying multi-dimensional data or encoding this kind of information on a plot is faces. So faces have many attributes, whether eyes are open and closed, whether the face is wide or narrow, whether it's smiling or something else. So basically you can encode the expression of a face with a number of parameters and our visual system is extremely well-tuned to identify expressions and faces or little shifts and changes in faces. So what this approach does is it draws little cartoon faces while varying these characters and you can then also identify these little cartoon faces that have similar expressions relatively easily, even on a cluttered plot and thus make inferences about how similar particular data points are. I've never actually used this in my work. I always found it's kind of a neat, orthogonal, creative way to work with data so in the sense of playing with data and trying to become intuitive of it, it may be a good approach. Here I'm just mentioning it. Good. Now finally let's go back to working with our data. We left off having calculated principal component analysis of a normalized version of gene expression data from the yeast cell-cycle gene expression set. Now if you remember in the data file that we read originally in column 1 contained systematic gene names and I'd like to pull that out again with this expression here and call them show genes to have the gene names available. I'd like to use them later on to identify a little better what these genes actually do. I've created a little function list genes which uses cat and sprintif to print these genes in a more principled way. If I simply extract them and use a header I get them all in a line with this function here I get the actual index listed and I get the actual gene name listed. So the way this works, cat is similar to print but whereas print gives me the indices of the object that is printing and automatically a new line cat doesn't do that. Cat simply puts characters to the screen and does nothing else. So if I want control over what I print out to the screen I use cat. If I simply want some default formatted listing of the contents of my data objects I use print. And what I cat here is the sprinter function. So I define a string, this is basically the template for what I want printed out. My string contains %d which says this is a placeholder for an integer and it contains %s which is a placeholder for a string. And the d and the s are separated with a colon and a blank and the string is terminated with a backslash n which is a new line. So all in all this gives me one per line a number and a string on each and every line. The number it should print is i because I'm going through a loop here where I generate i's from one to the length of the vector that I'm giving it and the string that it prints is this corresponding numbers pulled out of my show gene names. So for what this function essentially does then is differently from head show genes the equivalent for these genes would be list genes one to six that's what that looks like. So number one is this, number two is this, number six is this. I like this because if I give it some random numbers what did I do wrong now? This function is wrong. Nice! Great error. Okay, do you see what it does? I just told what I think it should do. But do you see what it does? So I give it a sample of numbers from five numbers taken from one to 237. These numbers here. And it prints my gene one, two, three, four, five. Y, right. So I'm ignoring my contents of X. I'm just using my loop variable here for i in one length X, print i and show genes i. So of course is depending on how long X is simply the numbers one, two, three, four, five and so on. So there's a logical error in this function. It's syntactically correct but it doesn't do the right thing. How do I need to fix it? What do I need to do instead? Instead of showing genes in square print. Absolutely correct. So instead of i I need to use X of i. That means for the first loop iteration I get my first element of my argument here. For my second loop iteration I get the second element and so on. And not just the literal numbers one, two, three, four, five. So this doesn't give me the number one but it gives me my first argument. Not the number two but my second argument. Do you understand this difference? The principle is basically the same principle that we use pulling specific colors and specific plotting shapes out of the sets of colors and the sets of plot shapes that we define. We simply use a number to look into another object. So the number one, two, five isn't used as is but it simply is used to point into an object and retrieve the information. We call that an index. This is why we often use i for it as an index. Alright, now if you make that change so replace my code, my erroneous code with X of i here and here then the function should work as expected. If we give it five random numbers now it gives me the row numbers and the actual contents of the genes. Okay, so this is what list genes is good for. Alright, let's use this to study similar genes genes with similar expression profiles. So at first I prepare my plot here first plotting an empty frame for my principle component analysis along dimension one and dimension two which kind of spreads out my numbers here and then I use the function text to print not the points but the actual numbers because now I want to identify things. I want to understand what particular genes are. PCA tells me that they are similar. So this is my first view into real biology happening. We're not just doing maths now, we have biology. We say, well these seem to have principle these seem to have similar expression profiles. Similar expression profiles often mean co-regulation. Co-regulation usually means that they have some functional similarities. So now we can start discovering things in our data functionally related genes. So from this we could perhaps see that the gene in row 193 and 142 and 235 and 98 are kind of similar. So we select from this plot and we define a selection, I call this selection one and I re-plot this with a different color so my points get overwritten and I confirm that what I've selected is what I wanted. I didn't make a typo in the selection. Now then I use a function matrix plot and this is a very, very elegant way to look at highly dimensional, high dimensional data. So this is also called a parallel coordinates plot. Every x-axis point in this corresponds to something in the column and thus the parallel coordinates plot allows me to look at how the values, my expression values in this case vary over time and I get these expression profiles. Now there's a quirk in that. Parallel coordinates plot requires I believe the data to be in a column format. If I simply give it a vector like this here, no it's the other way around, it requires the data to be in a row format and what I get from my data frame is the data in a column format. So before math plot uses this I have to transpose the data. A transpose of a matrix basically means all my rows and I change them into columns and vice versa. So the transpose of what I have here with seven rows and 17 columns is a data object with 17 rows and seven columns and this is what math plot requires as input. So all the time points in one column for one object. So if you do this math plot and it doesn't look right and it doesn't have the right number of points and the points are all over the place and none of this makes sense, then you might need to transpose your matrix. This is the reason for this little t in here. So but used in this way we have these points and we have our genes. These are the genes we selected. Now do they all look like that? Is that actually something significant? So I think these genes look like they have very similar expression profiles. Now is that something that we'd expect simply from the structure of our data set? It might well be that after all we've selected genes that are either known or suspected to be involved somewhere in the cell cycle and maybe everything in the cell cycle is at a resting state when we start then is highly expressed as the cell cycle, the first cell cycle commences. So maybe this is nothing significant. So to find that out we can select 10 random genes for comparison. So we generate a random selection in which we sample 10 numbers from 1 to 237 using the sample function. So these are my randomly chosen genes and my random genes look like this. What does this tell you? So this is a computational control experiment that we've just done. How did that control turn out? What does the control tell you? It's just kind of random and so this means that if I pick random genes from my data set their expression profiles are random. They're not structured. They're kind of all over the place. There's a lot of variability in the data that I've used in my actual source data and I find that if I just pick random genes. So that's not what this previous plot of my PCA selected genes looked like. The PCA tells me they're similar and indeed their expression profiles are obviously more similar than what I would expect from random chance. So my negative control told me that my actual experiment seems to have significant information. Okay, so let's do it, make a different by plot and select some different genes. So now my first data set is in sky blue here if I use principal components 1 and 3. So it's spread out differently. So these genes cluster only in principal components 1 and 2 and they don't cluster as well in 1 and 3. But I could look at a different data set that seems to cluster in this and then ask well how are these different? So for example I can take this cluster here. 20, 227, 143, 128, 192, 75. And this is my cell 1. We get to that task in a moment. This is my cell 2. So how are they different? These are expression profiles. You all have a biological background. I show you these two expression profiles and ask we have two groups of genes here and we're looking at expression profiles during a cell cycle. How are these gene expressions different? They generally follow the same pattern but they generally follow the same pattern, but the rounds are early expressions. So they start early and they're down ability early. Exactly. So the time course of the expression of this brown or firebrick-colored cluster of genes is earlier. So these expression levels start earlier than those expression levels. What does that mean? Well from the way we normally interpret experiments like this we have the same thing happening but one thing happens earlier than the other. We might hypothesize that there are actually genetic regulatory interactions that some master gene switches on these genes and then maybe some of these genes are themselves transcription factors which cause the blue group of genes to be expressed. So by finding genes that have different time points of expression we can then start trying to unravel the results of the experiment in terms of genetic regulatory elements or genetic regulatory networks. Now do you know if that's actually the case? So we have our list of genes here and our list of genes here. Do you know if any of this group are transcription factors that influence any of that group? I certainly don't. I don't know if there are any in the group who recognize any of this systematic names. So what I'd like to do is to take you through a task where I have no idea how it will go. I haven't done this, it's not in the script but I think it's a question that comes up at that point and that's highly relevant for the kind of work that you do. I have these systematic names here. I'd like to know the common names. For example I wonder whether any of these in group 2 are cyclins and I'm too lazy to go into Wikipedia and look up every single one here. So I would like to take my list of gene names that I've done before and convert that into common names as a tool to be better able to annotate the data that I use here. So I don't know how to do that. I hope we'll find a way. Let's take this as a collaborative task and you shout out ideas and we'll try to see what they do and how we can make them work. I think SGD calls these the common names. Let's have a look. No, I'm not going to advise this. So to recapitulate what we have we have a vector of 237 strings which contain the systematic yeast names. What's our goal? What do we think we should be having? Find descriptive names for each gene. Yeah, yeah. So our goal is to create a vector of 237 standard names. In bioinformatics, ID conversion is the bane of our existence. It's really, really a nuisance. However it goes without saying that without tools for ID conversion in bioinformatics you can achieve nothing. Your collaborators will come up with their own quirky ways to name and label things and if you want to find something in the NCBI databases or in the EBI databases you will need the right identifier and it will be painful to look it up. I think this is an excellent opportunity to look into this topic. Okay, I need suggestions. What do your searches produce? Ben, what would you do next? Google what? Systematic names. So we'll Google, convert. I don't know, did you find one? Yeah. Okay. So we're going to find a source of data and then we might have both searches. SwissProt, everyone? Well... Is this a bioconductor fact? Now you said there's a table on Uniprot. Is that one of the links here? Or how did you find that table? Yeast.txt, Uniprot? Okay. Is that useful? So it's linked with SwissProt. It's like you work with both. You will bring out all the information about it to protect it. Okay. So what we're going to do next is we're going to look at some of the signatures as far as I can tell are the standard names. These would be the systematic names. So we have corresponding systematic and standard names in the same table. Let's see if the one we were looking for is actually there. It is. And what we get here is three synonyms. And that looks like something we can gather quickly. And take that apart and quickly get the information that we need. So this here is a data source. What's the next thing we need to do? Lauren, what would you do next? Right now it's a text file in my web browser. So thankfully it's not HTML formatted. It's actually a text file. I don't need to scrape the text information out. I don't need to get rid of images or anything. Right. So at first I do what? I mean create a file. Save it. Save file. Save page as format text. And I put this into our ADA dimensions. Is that the right location? Yep. This is where I keep my stuff. East.text. Great. Now what? East.text. Excellent. So in order to make my life easier I'll just throw out all the header information first. Because I don't want that. We'll have enough headache as it is. But we already know what part of the data we need. So the goal is to find this data here. There seem to be varying amounts of spaces in between here. But what we'll also need to be careful of is that there's spaces and semicolons over here. So this can be a nuisance. So is there any way that you think we ought to process this so it becomes easier? What's easy to read? What have we done before? What kind of text file? Yeah. That's one thing I could do. I could replace all these spaces here with tabs or with commas. Right, but I have variable numbers of white spaces here. Do I CSV or TSV? Should we try that? Go for CSV as a canonical format that seems familiar to us. Okay, the one thing that I'm most worried about is the spaces with the semicolons here. If I simply break on spaces then I would separate these two things here. But I shouldn't be doing that because that will put my information into the wrong column. So the first thing that I'd like to do is to replace every occurrence of a semicolon space with simply a semicolon. So edit replace and find semicolon space and replace with a semicolon. There we go. Okay, so it goes to the bottom. It shows me, oh, there's additional information here which I also want to delete. So that seems to be reasonable. Now I have ranges of blanks here of different lengths and I'd really just want a single blank in every line. Can I do that? I don't know if I should be showing you all these terrible hacks, but in practice very often how we get stuff done quickly. Of course the ideal way would be to read every single line line by line by line and then write a proper regular expression which finds within the line the data that I'm actually looking for and then uses that. That is the proper way to do it. But since we're doing this only once, you know doing it in some quick and dirty way that we can do while we shut most of our brain off is probably a viable practical alternative. So what I end up doing in these cases often is I simply replace two blanks with one blank and do that again and do that again and drink some coffee and scratch my tummy and do it over again and at some point there's nothing more to replace. So now I have a data set where every single line has a blank and these elements with multiple entries here are separated with a semicolon. Now I can do one more thing and actually separate all my single blanks replace all my single blanks with a comma then file save as yeast dot csv nice. Okay what do I do next? I have a csv file now I need to read my csv file. Yeah if I do that then the whole thing will be dumped on my screen so I'll also assign it let's just call it csv raw is read dot csv of my file name yeast dot csv and what about header? Header equals false because I deleted that anything else? separation comma is correct by default I could have put in explicitly a separation by blank here at that point quote and decimal and fill and so on is irrelevant strings as factors is false I want my strings as strings so let's see what that does it does something it's syntactically correct let's see whether it's also meaningful column one is my systematic gene name column two, sorry column one standard gene name, column two is my systematic gene name except that in some cases my column one still needs to be processed a little bit we can have opinions on which one of the several synonyms that are listed here we should be using but for simplicity we'll just use the first one if that comes up we'll get to that in a moment so far so good now let's let's test this can we actually pull something out so if we have if we have a string like YAL40C how do we find the corresponding row in our CSV raw how do we actually find our genes this is one of these pesky filtering expressions that you probably need at this point so I think we'd want the row in our CSV raw file that contains this string in column two can somebody tell me what to type Lauren what would you type what would you type to find the row that contains this string in column two right now right now I do everybody clear on what that does so this expression here hopefully gives me 6,727 falses and one true and this one true should point to the correct row and why does it not because it's not in there and why is it not in there why does it not find my string we've been searching all rows in that column to ask whether that string is present but apparently my string YAL040C is not in my table here uppercase exactly so do you notice this difference it does make a difference okay so since they're all lowercase where we come from we have to say something like two upper before we do our comparison okay now we have this problem here that our column one contains more than one item so basically I tell it to use v1 here just for aesthetics because I've been harping on that a lot not using magic numbers but proper column identifiers I'll change this to v2 really doesn't matter in this case so I'm assigning this so now I have this string what I want however is just this part any ideas how do I process this string to get only the first part now everybody who was here for the first day should be waving their hands all over the place now because this is something we've done lots we use the function we use the function then what function would you use yeah yeah we split the string we use strsplit strsplit on a semicolon so we get a list with one element which contains a vector of three elements of which the first element is what we need now we can use a shorthand for that first element of the list and the first element of that vector I could also have used unlist and then pulled out the first element of my vector or something similar so this is this one element cln3 okay now let's make this into a function let's make it into a function that loops over whatever input we give it which is probably formatted like our show genes i.e. a list of a vector of individual systematic genes and for each systematic gene it retrieves the standard name so let's make a small subset of our first vector of our show genes vector simply to test out what we're doing here so something like test show genes sample 1, 2, 237 for say five elements so this vector here we'll use that as a test now I need something that goes over these elements one by one and then gives me that result and I think all the components are there but we still need to pull it together so we'll write a little loop that loops over test and works what we need to do what should that loop look like for usually we call it i in something something do something so the first question is what should this be what is the range that my i should go over what's the first value that I should have the first value that we use in our loop is 1 we should use the length of whatever we want to loop over so in this case it's 5 but in the next case it's not going to be 5 so it's going to be something else what's the length of our vector well actually it's just and in this case it is our vector test is that correct yes this is 5 so should now be generating numbers i, 1 and 5 does it everything works as expected but now I need to use that i to access my gene names so how do I do that we've done something very similar previously for our list gene's function how do I specify the first element of my vector test if I want only the first element what do I have to write test is a vector of 5 elements to get the first element I type test square bracket 1 the second element test 2 the fifth element test 5 so since I don't want to type this by hand I use my variable i here every iteration of the loop then gives me one of these elements first, my second my third, my fourth, my fifth element so in this way inside my loop I now successively for every iteration of my loop I have one of the gene names that I'm interested in available and that is test i so let's assign it to make things very explicit we assign it to the variable sys now we need to use the value of sys to extract from our vector whatever we have up here so presumably we do something very similar to this here we just take the string and say standard is csv raw where we select from column 2 to upper of what we've just had here and use that to pull out the element in column 1 actually to confirm that this is true I'll print column 2, the column 2 value as well oh, sorry I meant to print this not assign it okay so I get my 5 elements for the first one I get this string here for the second one I get this string for the third one I get mcm3 there's only one med28, mrps12 and so on nice, okay so let's assign this on every iteration and then do this here so string split on standard right now I'm just again standard genames now what do I need to do next except that I'm only printing them so instead of printing them I still need to assign them somewhere and store them in an output vector so in order to do that I will make an output vector that I initialize to simply empty elements all this vector show standard and I will initialize it to the same number of elements that I have in my test vector so there there's essentially three expressions to build empty vectors like that one is character, which builds an empty character vector, one is integer or numeric and one is logical so this is simply an empty vector, five instances of an empty string but since this vector now exists I can assign values into the slots of this vector here so instead of printing this I shall assign it and I shall assign what I've read from the ith position of my vector test into the ith position of my vector standard so I get this vector now we could write a function to do this but of course since we only want to do this once we can also do this directly and simply instead of test we use show genes same thing so show genes here show genes here at first this is a vector of 237 empty strings and now I iterate over this position by position mungid as acquired and something didn't work I did something wrong I also of course need to replace this here ok so for the length of the elements of show genes build me an empty vector then for the first to the last of these genes I pick out the string that I'm interested in call it sys I convert that to uppercase I pull out the element in the first column that has this name I assign that to standard I split it up along semicolons if any I take the first element of that split and the first element of that result and assign that into the same position of my vector show standard so what did I get subscript out of bounds so this typically means that something went wrong with my assumptions on the input what the assumptions are here it probably means I'm trying to access the first element where there is no first element so this is in line 18 of my output yeah csv raw that's what I was looking for oh no this is not line 18 of csv raw this is line 18 of show genes this one here and where does it appear this is our little quick and dirty conversion trying out things, playing with things breaking things looking at the error, figuring out what the errors means, trying things differently until we arrive at the solution that we need now we have show genes for the systematic gene names and show standard for the standard gene names and we can now go back remember we had this list genes function which prints the genes but it doesn't yet print the standard gene names so we'll modify our function because we have this new vector here and we'll add a second placeholder for a string and we'll add a second little expression that pulls out the string which goes into this placeholder position from the list of gene standard names then we redefine the function I leave that up for a moment do you see how to modify this I add a second placeholder and to basically get something into that spot for the second placeholder I add a second expression this one here it works just in the same way as the one before but it pulls out the character strings from our standard names and now we get this I'm not going to make the change that all lines them up nicely into columns, you could either use tabs or actually tabs are so simple that I'll just change it with tabs instead of the blanks here we'll just write a backslash t backslash t backslash t is a tab there are also ways to there are also ways to print these things within the sprintf command in a left adjusted way in a right adjusted way with a fixed width to print floating point numbers with a fixed number of numbers preceding the decimal point and numbers following the decimal point to print integers with leading zeros or not leading zeros right or left a line whatever this is all very easy to install you'll just need to look up a documentation for sprintf formatting commands and this is a very flexible way to quickly produce tabular output of any sorting ok so this was our little digression of more easily being able to identify and the reason we were looking at this if you remember was we were wondering what these red genes do and if any of them are famous transcription factors I'm not a yeast geneticist at all but CDC 45 isn't that one of these cyclins anybody remember their third year molecular genetics course cell cycle none of this was known when I took my molecular genetics courses so I believe yes that's actually true I think this is one of the key regulators of the cell cycle that might be wrong I think red 27 is also a checkpoint gene so of course within the cell cycle all miners of checkpoint regulation gets activated or not ok so this confirms our suspicion that our simple completely mathematical and otherwise biology agnostic procedure of principal component analysis is able to discover significant biology in our data set not quite off the bat and the first thing we get from the principal component analysis is a scree plot i.e. the loadings of the individual components which ones are more or less important that really doesn't tell us a whole lot at all but we can get it there we can use this to distinguish genes and we can use it to identify and ultimately to cluster genes that are similar to each other now one thing we briefly discussed yesterday was the question well what are these principal components how do we interpret them and we had to conclude that's hard to say any number of dimensions along which we look at our data and in this case these are 17 dimensions i.e. the 17 different time points any number of dimensions may contribute in some way to the first, second and the third principal component and it's not easy to say what the first principal component looks at so what I'm going to do is basically discuss with you an alternative approach of things that you can interpret and this is kind of similar to things we would do in factor analysis factor analysis is an alternative approach for dimension reduction where you try to come with preconceived notions of what factors are and pull them out of your high dimensional data set in principle we can for example define three simple models and then ask to what degree do our observed gene expression profiles a long time correspond to any of these models so these simple models for example might be to use a half sine or cosine wave a single sine wave and two sine waves kind of looking like this and then simply taking that as an idealized model and calculating Pearson correlation coefficients that will tell us to what degree does an observed profile correlate with an idealized profile that I define to be a model so let's generate such profiles profile has to have 17 17 components 17 elements so I initialize this simply with the number 1 to 17 and then I calculate late cosine and sine functions that are appropriately scaled from 1 to 17 to give me the following three profiles none of this works if I don't do all of it here we go so these are three models that we can apply so the one that presumably corresponds the most to our actually observed data is the green one because the cell cycle dataset contains two cell divisions so we would expect a cyclically varying gene to be expressed with something that correlates like this here the red curve could approximately correspond to genes whose expression simply dies over the time course of the experiment so controls perhaps things that don't like the experimental conditions the black one would be cells that are highly expressed in the first cycle and then die out over the second cycle something like that any number of biological hypotheses as long as you can assign it to 17 numbers that correspond to some profile can be installed as a model like this so you can come up with anything that you're interested in now instead of calculating the coefficients of the principal components we simply create empty vectors at first and then calculate the coefficient of correlation for our profile with the dataset so we pull out for all the for all the data here we pull out the individual elements and then calculate the coefficient of correlation so this is what that looks like for a plot of one sine wave against two sine wave correlations so it looks very similar to our principal component plot at least in its general and overall structure this has the advantage that I can interpret the dimensions so if I look at this here the genes over on this side have a high coefficient of correlation with the double sine wave the genes over on this side have a low have a high negative correlation with my double sine wave so these are correlated with my double sine wave these are anti-correlated with my double sine wave the genes here are correlated with my full sine wave the genes down here are anti-correlated with the full one or if I use the half correlation these genes here are the ones that kind of die out over the course of the experiment and all these here don't die out and these genes down here seem to increase in expression over the course of my experiment anti-correlated with my model so the advantage of looking at models like that is that you can immediately interpret the plots that you get which you can't necessarily do in principal component analysis principal component analysis only tells you we have genes here that are similar and you don't yet know why they're similar genes that I find in this in these model based correlations that I can immediately interpret so if I have a biological hypothesis I can detect the biological hypothesis and then I can start for example selecting groups of genes that correspond to two clusters here here and up here and I'm just outlining where these are here so the first selection one is this cluster of genes the second selection two is this cluster of genes and third selection three is this cluster of genes now if I plot these their actual time courses I get this so the first set of genes predictably is correlated with my first model these are the genes that cyclically vary along the time course of the experiment these are anti-correlated so these are low when it begins and they are highly expressed in the intermediate stages and these ones here remember they came from the middle of the correlation with the with the sine wave i.e. they're not well correlated with it but they are high in their correlation with the single cosine function so this means they kind of drop off over over time so these genes here are the ones that we simply lose over the time course of the experiment so these are cell-cycle specific genes here and maybe something like environmental responders over there so for exploratory data analysis this is a complementary and in some way orthogonal approach what we've done before is simply discovering structure in our data where we had no idea of what that structure might be leaving it to our biological imagination and interpretation to note that similar genes have probably some biological reasons, some biological relationship, why they are similar and in this case we come in with a hypothesis of what our data would look like under ideal conditions and we confirm or disprove our hypothesis that genes like this actually exist I'd like to show you an alternative approach which kind of has similar results and once again is completely agnostic about what the meaning of our data is but it can be applied to high dimensional data and it has surprisingly good results to discover structure in high dimensional data and this is T stochastic neighbor embedding this is an approach originally developed by Jeff Hinton at U of T's department of computer science and I've put you the Wikipedia link where it's briefly described it's not too mathematical certainly do have a look essentially what happens here is that high dimensional distribution of points and we come up with a probability function that would describe this high dimensional distribution and then we map this high dimensional distribution into a two dimensional or perhaps three dimensional space this is our plotting space so we take a high dimensional distribution but we'd like to plot them in 2D space because that's something we can easily see so we come up with a probability distribution that describes in a 2D space and we try to adjust it such that the difference between the two probability distributions is small this kind of mapping has the property that things that are close together in high dimensional space are close together in low dimensional space i.e. if points are similar because they are somewhere in a similar element of space in our 17 dimensional that they will also end up to be similar in our two dimensional map plot now once again the details of what this mapping function is can probably not be interpreted in any reasonable way so it's like principle component analysis a way to simply discover structure but not to interpret this kind of structure but I'll just show you how it works there's a package from CRAN TSNE which you can download and install it has documentation and it works in the following way to run it to run TSNE you give it data to work on in a column format in the crabs example this is simple simply our crabs data there's something called perplexity which essentially in this context means how many neighbors do we want to data plot so whether the plot is going to be compact or not there's a maximum number of iteration this is a heuristic algorithm it basically tries many many many different combinations to improve its solutions as it goes along so by default maximum number of iterations is 5,000 and these iterations are split into epochs in every epoch it tries 100 times to improve its result and then it then it briefly prints its output to reconsider what it should be doing and as part of this we can define a callback function a callback function in this case is simply a function that is used to display the data it doesn't need to only display the data it could also dump the results or do some calculations or whatever but in this case the callback function that I define is simply a plot function and the plot function is a function of the data in two dimensions x column 1 and x column 2 where I use my plot crabs function that I've previously written so this is where this function comes in very very handily so rolling this together we define this function here and give the plot window some space set the seed to a number to make it reproducible and run ts and e so now the function is trying to optimize the segregation in a way so that similar points in 17-dimensional space remain similar points in 2-dimensional space note that ts and e knows nothing about orange blue this is not part of its data if this segregation appears in 2D space it is only because it was also apparent in 17-dimensional space and note that ts and e is not bothered at all by the confounding factor of age that pca had such trouble with initially it basically running with default parameters out of the box it was able to distinguish the structure in the data and display it in a meaningful way and given that pca out of the box actually didn't work on the crabs data I think this is a very nice result remember since our principal components were dominated by the confounding factor of age or over all size of the crabs plotting pch1 against pch2 gave us nothing so we basically needed some way to plot different principal components to figure out what we actually wanted now ts and e doesn't need any of that it's just able to retain the structure and display it here so for association discovery or data discovery or similarity discovery in high dimensional data this seems to be a surprisingly effective simple way to structure your data and to display it one of the applications I've mentioned this here is a flow cytometry exploration tool because you have high dimensional data as well there and for example we've seen the clustering structure in the craft versus host disease experiment ts and e would probably be able to pick out these clusters and identify them so we can apply the same thing to the cell cycle data and now I'll do this with a slightly different function I pick two selections not the same selections which I had before but simply selections for which correlation 2 is greater than 0.77 and correlation 2 is less than minus 0.6 there's a bit of overlap here so I get a small set of positive correlations and a larger set of negative correlations and well these are the very similar the positive and the negative correlations here because when I run my ts and e function now I'm going to write a plot function which specifically plots these data points into the cloud of points which we otherwise have available so in this way I have a little bit of prior knowledge that my positive correlations are these red data points and my negative correlations with the model are these blue data points now if ts and e works correctly these work but if ts and e works in the way that I naively would expect it I would expect these data points to be somehow cohesive they're cohesive in 17-dimensional space they're very similar and I would expect them to remain clustered together in 2-dimensional space well maybe not completely because after all some of these are kind of similar so maybe the cohesion is stronger for the red ones than for the blue ones but we'll see what happens so my callback function here my plot function is plot plot an empty frame add the points for the red and the blue selection or fire brick and turquoise is that the right way to pronounce it turquoise turquoise use the text function to plot the numbers for all the other columns so this is my plot function again it's a customized plot function for every instance so there we go the individual epochs take a little longer because there's more data to process and it's exploring space and moving around okay there we are, 2,000 iterations so this is partially true at least for the positive correlation my agnostic mapping directly from the high-dimensional space into the 2-dimensional space keeps points well together for the ones with the negative correlation that's not as clear perhaps that they are all related is somewhat of an overstatement there appear to be other points that are more similar to these 3 and this probably 112 seems to be a bit of an outlier to begin with but now I can start generating hypotheses about my data and testing them by looking at the individual expression profiles which I find around here now for example if I pick a cluster of genes in the vicinity of the originally selected values so this is my selection 3 which is around these here so in 17-dimensional space there seems to be a cluster up here and this cluster seems to contain genes like number 4 192 200 and so on I can select them out here and then ask well is that true, is the expression really in fact similar lo and behold there it is so in this way again I can also use T, S and E like I used the other methods before to discover correlations and similarities other correlations and similarities now if we take this list of genes here and plot that back on our model correlation plot where would they lie, presumably they would cluster very closely with this core of 4 genes which I defined here they are this way so these were my 4 original genes that I defined in the T, S and E that I defined and these are the similar genes with that T, S and E identifies as being similar in 17 dimensional space and here plotted on correlations of half sine wave or half cosine wave and two sine waves so you see that there's a certain convergence there in whatever way I look at the data if I do this correctly similar expression profiles or similar data sets are going to end up with similar analysis and I've shown you essentially 3 different strategies of exploratory data analysis to be able to discover this at home for self-study and in order to consolidate what we've done here you could ask where does this 2-cycle sine wave function itself end up if we put it into the data set so the way that you would do that is you would add a row to the original expression data that contains the values for your model and then the index of that row is something that you could use for plotting the coordinates of a somewhat colored circle and in this way you can track the relationship of your models with this high dimensional to low dimensional mapping and learn more about how your data set is structured and that essentially concludes our little foray into dimension reduction and I think it also essentially sets up in an ideal way the kind of questions that we would like to ask with clustering because you see that these data sets have a leading structure and now one question is well can we take an algorithm that simply finds all of these substructures and puts our data into categories where nothing is left unannotated so we're not just cherry picking here we're trying to identify all of the structure of our data all of the subpopulations that we potentially find so this is the topic of clustering which we'll turn to in a moment let me just mention this here very briefly because it's a very useful thing we can use sample for that so let's say we have a vector that contains the first six lowercase letters of the alphabet a, b, c, d, e and f now if we simply use sample on that by default we get the same number of elements as that vector in a random order without replacement so using sample in this way actually gives us a permutation of our data so if we apply this permutation of data to every single row in a single time point then there should be no remaining correlations and the positive control is then to ask well if we do that what does our correlation analysis do will we find something that or lots of things that look like they have correlations if we do that might question might put our whole approach into question and we might then say well you know if the time points are the place basically due to the nature of our analysis and perhaps the number of tests that we're doing and the potential for statistical fluctuations we will get things that look like they have information we will get things that look like they systematically decay or that they have an exponential curve or a sine wave or something beneath that underline while in fact it is just a result of a limited number of possibilities and random chance has to pick amount of that limited number so that's always a good and useful control to do using permuted data to make sure that the results that you get are actually different from randomness we'll take a 5 minute break