 And this sounds very, very esoteric, dimension reduction. I mean, are we talking about parallel universes and aliens here and going into the fifth dimension and coming back from it? But it turns out that it's not that esoteric. And it's actually a lot of fun and a really useful way to look at data. Because at first, when we looked at our descriptive statistics, we looked at one-dimensional analysis of the data, simply numbers that characterize features of our data. Now, we started looking at two-dimensional representations in our scatter plots. We started looking at how is one variable correlated with another variable. Now, of course, as you know, what we're really trying to do in our high throughput analysis is to get many, many, many more dimensions of data. Because there's a lot of information that will help us answer our questions. And that makes our problem special. There is a richness of data and a richness of information dimensions that also make it very hard to find solutions here. If you want to become a little more aware of that problem, you might look up the curse of dimensionality that's often applied to data sets. The result is, as your data sets become larger and larger, more and more information is in the edges. And less and less information is in the center where you would expect it to be. And it becomes harder and harder to find things in high-dimensional data. So one of the goals, thus, of this kind of analysis is to take very high-dimensional data and make it lower dimensional or, at the same time, analyze the informational dimensions that our data has. The principal tool here is principal component analysis as a method of dimension reduction. And I'd like to explain to you what that means, what principal components are, how we can use them for analysis, and how we can perform principal component analysis on data, how we can use the results to identify interesting aspects or interesting views on our highly dimensional data, and also talk a little bit about alternatives, such as projection or embedding methods. So what's principal component analysis? The goal of that is to transform a number of variables that are possibly correlated into a smaller number of variables that are no longer correlated. And we call these variables principal components. So if you think about our height and weight example, there's correlation there. And actually, once we've characterized the correlation, there's a lot of information that becomes redundant. Because to a good degree, we can predict weight as soon as we know height. So if we have a model of how to look at one, maybe the model is sufficient and we don't actually need to deal with differences anymore. And this would allow us to reduce the number of variables that we actually have to work with and that we have to analyze. So in an example, if we have a correlation of data, what this really means is that one variable can predict the value of another variable. So for example, if I do this essentially same thing that I did before, where I make a random set of x values and I make y values that I construct by simply multiplying the x values and adding the y values, I would get a correlated data set like this. Much of the information is not indigenous to this random variable of y1, but it's modulated by this additional information from x1. So I have information spilled over from x1 into my data set and if I plot it, it looks like this. So what I'm trying to do now in principle component analysis is to take these data clouds, these correlated data clouds, and turn them around along all of their dimensions and ask which orientation would I need to look at to see the largest variance in my data? That's the most interesting one. Where the largest variance happens, this particular view on the data gives me the most information. So the first so-called principle component is essentially the projection of the data into a single dimension that has as high a variance as possible. And the next principle component subtracts that variance, i.e. by positioning itself orthogonal, i.e. at the right angle to the first principle component and then asking, well, if I'm at a right angle to that, how can I turn then to get the second best view on my data? Once that is done, a third orthogonal principle component is defined as, given the first two, how can we define this third component again in the best possible way to explain most of the variance? So essentially, if we have a point cloud like this, the first principle component would lie along this axis where this variance is maximal and the second component would be orthogonal to it in the same axis. So we can restrict our analysis to a projection along that vector. So if we plot this here, whereas our distributions of x1 and y2 individually would be these histograms, after principle component analysis, the rotated view on the data would have a histogram that includes much more variation and variability than the blue component, which describes much less variability of the data. So in practice, this is very simple. We take a multidimensional data set. We feed it into the r function prComp or printComp. So these are two, well, I wouldn't say competing, but two principle component functions that are very similar in use and quite similar in output. They use somewhat different names for their output, which you simply have to be aware of when you use them. I think in my code, I use prComp, not printComp, or the other way around, I can never remember. It doesn't really matter, as long as you know how to use the output. So let's look at an example. The code that I showed you in these slides is in here. You can reproduce that at some point. We are going to look at real-world data set. And this real-world data set, oh, I assume that all of you have loaded this file. Do you use this? Anybody has not loaded this project? The new one, dimensionreduction.rproject, which is available at this URL. Okay, maybe I can just introduce the data set that we're about to build in the first place. This is a data set which comes with one of the standard libraries for data analysis of R, the crabs data. So some of our colleagues make really, really smart life decisions. So while we study bugs and critters and various things, they study crabs, beautiful colored crabs, which of course they need to collect. And in order to collect them, unfortunately they have to go scuba diving in the tropical waters off Western Australia. Everything on expense, of course, and not even take a vacation, I find this fantastic. I don't even know what they do with the trikes. Probably after that, the crabs wander into a pot and they're probably very tasty too. And imagine doing that with your standard, fair lab E. coli, no. Anyway, so these scientists study crabs and this data set comprises five different morphological measurements on 50 crabs each of two color forms and male and female sex. I don't even know how you distinguish the two in a crab, but probably they look slightly different or actually this is something that, that presumably we're going to find out that it's possible to distinguish crab sex simply based on crab shape. So the morphological measurements are frontal lobe size. So this is measured across here. Rear width, this is probably measured across here. Carapace length down here and width across here. And body depth, this is like using a pair of calipers and measuring the size of the crab. So basically these five measurements characterize the shape or some aspects of the shape of the carapace of these crabs. And now the question is, given only these five measurements, can you actually distinguish what these crabs are? Is there any of these measurements that would correlate with whether it is an orange crab or an orange species, subspecies or a blue subspecies as in this image here, or whether it's male or female? So to load and attach this dataset, we load the library and load the dataset and this is what that looks like. So the cons in our dataset, this is already a data frame. The cons in our dataset are species, B and O, sex, M and F, index, one to 200, and then frontal lobe, rear width, carapace length and width and body depth. And as you see, the first two are factors, the third column is an integer and of course the other columns are simply numeric. Okay, now let's plot them and see if there's anything useful here. We can plot columns five and six against column four and we'll annotate them simply by using the factors and using a plotting character which corresponds to the numeric value of these factors. So if the factor is a blue male or an orange female, they will get different plotting characters and in this way instead of having all points look the same, we can start distinguishing the points that we've annotated here and looking for generalities. This is actually quite important when you are looking at data to actually be able to identify and annotate the data points. So the way I build these factors is I take the element from the first column and the element of the second column and I separate them with a period and identify that as factors. So the different variables would then be BF, blue female, BM for blue male, OF and OM where this level corresponds to one, two, three and four. Our plotting characters one, two, three and four have a meaning for our one is the circle and two I think is an X and three might be a plus sign and so on. So here's what that plot looks like. Oh, actually I should start here. So in this plot here, I plot all the rows against four columns. So I have four different data sets here and when I send a data set like this to the plot function, I get this kind of a trellis plot. This trellis plot gives me correlations basically for all of the different combinations here. So this is the column front low correlated with rear width, correlated with carapace length, correlated with carapace width and so on. So what we see here is that there's not any of these that would fall apart into these nice subpopulations or categories as we saw with our flow cytometric data. So they're in a correlation between CD4 and CD8. We saw a nice distribution of four different populations. We'd expect four different populations here simply because we know we have blue and orange males and females but we don't see that. These data all seem to be highly correlated to begin with. Some of them are super highly correlated like carapace length and carapace width and some of them less so but it's not obvious which combination of these individual dimensions we could use to distinguish our data. So this is a case where linear regression simply on two dimensions does not allow us to distinguish what we're interested in regarding our data. So let's instead apply principal components analysis to these five measured dimensions. So the result of the principal component analysis is simply gotten by using PR comp on crabs data, columns four to eight. So let's see what we get here. This is the first result. So the first result says we have five dimensions i.e. so for that reason we can get at most five principal components once we rotate these into the optimal dimension. So rotating this into the optimal projection says most of the variance is in the first principal component and that kind of correlates with what we've seen before in the trellis plot that there is a lot of correlation but it always looks almost all the same between these data points. So this is the most important principal component here. But apparently it's not one that we could really easily use to distinguish between them. So the way, okay anyway, summaries gives us the individual values for the principal components and importantly this row here, the proportion of variance. So our first principal component explains 98% of the variance in our data and the other principal components kind of contain the remaining small values of the rest. Now we can plot this data. The principal component analysis is a so-called biplot. So in this biplot I label my points with the factor values I had before. So I believe number one was orange females and number two was orange males or something like that. It doesn't really matter. What matters is that I can see there's a little bit of information there. So threes and twos can be probably somehow subdivided along this dimension but there's also a large amount of overlap. There's not one section of this plot of plotting principal component one against principal component two that will allow me to securely identify the anything that is characteristic for the different tribes. So this is the default plotting the first against the second principal component. I can define choices here which tells the biplot function which principal components to use. So by default it is one and two if I plot choices one and three. Then things look very much the same. So apparently our most significant principal component does not contain the information that we actually need to distinguish our data well. Let's look instead at what happens if we ignore the first principal component. You had a question. The values of the principal components. And these are essentially oh my God I think yeah these. Oh the one two three four sorry. These are the factors that correspond to our different categories of traps. So one is BF blue females. Two is BM two blue males. Three is orange females and four is orange males. So for example we can see there's a lot of overlaps between the twos and the threes and so on. So let's have a look at what happens if we plot principal component two against principal component three. That's a completely different picture. In this picture all of a sudden we realize that yes there is something. There's a view we can under which we can view our data in which all of a sudden the structure in the data becomes apparent if we remove the first principal component. So in this view of two against three we are ignoring this super strong dominating principal component. And all of a sudden we can see the essential information that we're looking for that allows us under this particular view to distinguish the male and female and orange and blue category of traps. Can you imagine what that could be? Is there an interpretation of that? In this case there may well be an interpretation. In the general case the projection for the principal component is a composite of all dimensions that potentially all dimensions that are contributing and a kind of a physical interpretation of what this view comprises may not be possible. So principal components is a good way to partition data or to section it or to prepare it for clustering. It's not necessarily interpretable in a strong sense what these principal components correspond to. But in this case it may be. Any ideas? So remember what we were measuring. Lengths and widths. And all of these seem to be highly correlated along something which we didn't explicitly account for in our data. You say that there are some measurements that correlate with both sense and positive. There seems to be some segment measurement that correlates with all of them. Some set of them. Right. So some, I'm not sure I understood you correctly. Repeat that. I can't really say which measurements are associated with both things. But some relationship between them. Right. The physical interpretation of these measurements may be difficult usually, but sometimes we can make good guesses about what is going on in this kind of data. So Lauren had a guess. I know how to use that off. I know how to use it now. Is that you meant it? In this particular view, but as I said, the principle components are composites of all five measurements at the same time, just rotating them in different ways. You were saying that that is blue males, blue female. So if perhaps the blue and the orange were a couple of different species and the males of female were two different sizes would have kind of a consistent amount in those five measurements? In a sense, but only after we remove the first principle component. This first principle component seems to dominate. I mean, remember this plot here? It's all linear correlation all over the place. Everything was just pretty much linearly correlated. We can't actually see that information unless we remove this linear component. So what could that be? Age? Why? Right, overall body size and age. This is a confounding factor with this measurement. Remember, the crabs were collected off the reef and they all have different ages. If you have a little crab, all of its measurements are going to be small. And if you have a big crab, all of its measurements are going to be large. Now age or maybe body weight is not in this data set, so we're not accounting for that. But it's a confounding factor that influences everything. So principle component analysis is not just only able to allow you to look at important or less important views on your data. In cases like these, it actually allows you to identify the presence of confounding factors and to remove them and then to keep on working with the data as it is. So this here gives us a very nice recipe that we could use to distinguish simply based on measurements, our categories of animals here. And we would venture to guess that juvenile crabs might possibly be more similar in this plot than older crabs, where we would hypothesize that perhaps juvenile crabs, the sex distinctions are very small and they become more pronounced as the crab's age. So perhaps they might be more similar and the effect of remaining overlap here might be also explained in some way with age. So I'd like to have you reproduce this last plot without the vector simply by plotting symbols that correspond to the gender and type of crab, orange and blue circles for females and triangles for males. So this doesn't really have anything to do with principle component analysis per se. The challenge here is simply take a plot like this and plot specific elements from your data in different ways. Plot some of them with orange circles and some of them with blue triangles. So that's the task. So let's try to break that down into individual steps. So we don't want to use the by plot function here for a number of reasons. We simply want to use a normal plot of these values. And we can get these values. We can get these values from, right, we can get these values from these values here. So these are the transformed values, 200 rows and one to five columns for all of the principle components. So your task is to basically use these values of PCA dollar X and plot them with orange circles and with blue triangles. So what's the first step you need to do? Well, let me propose first run a PCA on the data and assign the result. Then probably you want to confirm that some part of this result corresponds to the data points that you want to plot, right? So confirm that whatever you call it, the columns two and three of your data set contain the values we are looking for simply by plotting them and seeing does the plot we get back look like that. Then figure out how to define an orange circle blue triangle, et cetera, as a plotting character. PCC one from PCC three. So remember, this is... So imagine I have a data set that's kind of a cloud that looks like this and it has a high variance in this direction. Now assume that this is the X and the Y axis along which I've measured. Now, if I want to optimize the variance that I look at, I would want to rotate my projection in this way so that most of my projection is captured in this dimension. The parameters that specify this angle of rotation as a function of X and Y, this is what corresponds to PCC one, right? And now you have to extrapolate that into five dimensions. So we have a five-dimensional data set that's in five-dimensional space, a point cloud like this, and we rotate it along five dimensions at the same time until you optimize this projection here and that's what we then call PCC one. Does that make more sense? No, PCC one has the highest variance, okay? And PCC two is orthogonal to that and has the second highest variance. Now, since it's orthogonal to that, essentially all the variance of PCC two is reduced. So basically in our two-dimensional case, if this is PC one and PC two has to be orthogonal, it is this projection. So essentially think of it as having an oddly shaped point cloud in high dimensions and we're shining a light on them and rotating that point cloud so that the size of the shadow we get is as large as possible but we can rotate around all of the axes at the same time. So after we do that, it's really hard to then say well, it's not hard. Of course, it's mathematically easy to define what amount of rotation is due to each of the five individual dimensions but to interpret the outcome. It's a little bit of that and a lot of that and more of that other to actually interpret it in physical terms is no longer really meaningful. And this can be applied to continuous data. The categorical data is actually not part of the data which we use for the principal component analysis. The principal component analysis is only done on the continuous data. The categorical data is something that we use to interpret the data after the fact. So we try, essentially the goal is to try to distinguish the four cracks and in order to do that, we try to find out well, what is the systematic difference between these four cracks? By simply trying to spread the numbers apart, trying to find out can we distinguish populations based on our measurements? After we distinguish populations, we can then go back and interpret this cluster to be that blue male and that cluster to be orange females or something. Okay, so with the steps of our task here continued trying to plot this with more meaningful symbols to begin with, the next step after we've defined on how we are going to plot the different subsets is to simply plot an empty frame of the right size and then use the points function to plot the points that we want with the colors and shapes that we want into that empty frame. That's one way to do this. So maybe let's quickly look at the first steps to get you started. Running the PCA. So I have my crabs data. How do I run a PCA on it? And what do I run a PCA on anyway? What part of the data frame? On all the numerical data. So this is columns four to eight. Okay, so our crabs have three columns of categorical data and five columns of numerical data that we want to use here. So it should run on crabs, all the rows and four to eight. It, what it? What do we do with that? Just as before we use PRComp and we assign the results to some variable, let's call it PC. So far so good, all clear? Run this. We see PC is a list of five values. The first element of the list contains standard deviations of the principal components. The second contains the rotations and essentially these rotations define the angles under which I've rotated, or well, the way in which I've rotated my data set for this analysis. These X seem to be the actual components here. So I'd like to confirm that PC dollar X2 plotted against PC dollar X3 actually corresponds to the view I had previously, simply by plotting it. Okay, that kind of looks like we had before. There it was in our by plot. Clusters here, clusters there. Except we've been using numbers in our first plot and now we're using simply the plot default, which are circles. Okay, now we'd like to plot these as orange circles and blue circles for males and females and so on. So how do we plot an orange circle? Here's where your Google skills come in or maybe not to mislead you. Maybe you should be Googling for how to plot a filled circle where we basically make a filled circle and then define its color. Any suggestions? Yes. Okay, PCH is the plotting character and PCH equals 21 defines a particular type of plotting character, which is, it's a filled circle. Is it? Point and rate the instruction. Did you? I did. So what's the instructions? It shows all the characters. Okay, so if you do question mark points, you see all of the different plotting characters. Now which one do we need? So if Google says PCH 21 is a filled circle, I think it's probably right. It's just so that we need to specify the line and the background separately. What we can also use is simply a circle that's filled from the outside. I think that's PCH 19, right? Let's try 19. Okay, but we still need to identify colors. So let's define, let's see if that works. The solid circle, 21 has a filled circle. Okay, but if I define the color for the solid circle, this is what I get a solid orange. Now, if I define 21, I do get orange, but I didn't specify the background color. So does it tell, does it say how to specify the background color? And that requires a parameter presumably, BG. BG for background color. Or let's swap that around for effect. Okay, so that's the difference between 21 and 19. So the shape 21, you can specify the line that goes around it and the background color separately with shape 19. You got a filled circle, and so this gets us partially there. Now, that's orange circles. How about, well, blue circles are probably pretty obvious, but what about triangles? How do we do triangles? 17, is there a filled version of that too? Yeah, it's 24 and 25, well, that one, no. Okay, so let's try 24. This looks nice to me. Okay, so that partially solves our problem, partially because now we're plotting all of them, and that's not what we wanted to do. We only wanted to plot them specifically. And something that's helpful here is that the data is actually organized to distinguish, so all the different categories are grouped together. So if we look again at the Crabbs data, rows one to 50 are all blue males. Rows 51 to 100 are all blue females. Rows 100 to 150 are all orange males. And rows 151 to 200 are all orange females. So simply by selecting the rows, we can define what we want to do in this case. If that weren't the case, we'd need to write a little function. Well, either throw this together as factors and then choose from the factors how we want to display this or loop through the entire data and for each row individually, then make a decision on how this works. Maybe I'm going to put together some sample code, but right now to solve this task here, we are simply going to make our selection of how we're going to plot this based on the row numbers. Okay, so in order to do this, we'll just plot an empty frame of the right size. How do we know what the right size is going to be? Well, we could look at the min and max of each of these values and then define our x and y axis limits. But we can also do something easier and simply plot and define type equals n for none. I hope I remember this correctly. Yeah, so this gives me exactly the same plot as before with exactly the same labels and dimensions and scale and axis, except that nothing actually gets displayed on the screen. And then I can use the function points which works very much like plot for my individual rows. So my first set is rows one to 50. And I kind of like the different frames here, so that's what I'll keep. Now, one to 50, what was that? Blue males, right? So we'll take, what should we use? Circles for males and triangles for females or triangles for females and circles? Circles for females and triangles for males? Something, okay, I already have circles here, so the males will be circular and these are blues, so that's how I'll plot. That's my blue males. Now we need the blue females, very similar thing. 51 to 100, we had 24, right? For the blue females, here they are. Just to finish this up, we'll use the oranges and this is 101 to 150. Is that the orange males? I hope so. And then presumably, 151 to 200, 151 to 200 will be the orange females. So this little simple example is supposed to illustrate to you that when we do complicated plots like this, it's often very useful to identify the individual elements and display them differently in order to be able to, in order to be able to identify what category the individual points belong to. So at first this is just a point cloud and the point cloud looks as if it has particular structure and thus corresponds to particular populations that would be interesting. But given that we can color points and identify them and we can shape them differently and we can size them differently, there are a number of options that allow us to basically add information dimensions to our plot and thus make the data even more transparent. So I think I have a solution here, a sample solution where I also scale this. And this is actually interesting. Okay, so this is scaled by first principle component. So this scale value here makes larger triangles and smaller triangles and circles based on the first principle component. So the small values are small in the first principle component, the large ones are large in the principle component. And what you see is actually interesting that as we hypothesized before, the smaller the first principle component which presumably corresponds to age is, the more similar the different crab sexes become. So the juveniles seem not to have significant morphometric distinction between males and females. But as they grow and size, they get to be more distinct. Presumably the males get wider shoulders and the females get wider hips or I don't know what that corresponds to in crabs. If the population was kind of random as going to the sea to pick it up, why, and then you'll expect different ages except if they were fairly aged, well, I mean if they were fairly aged then that's a more different solution. The first principle component that we had was so limited that it doesn't show you. Well, basically what this says is that since the five measurements we have essentially characterize the shape of the carapace, the shape changes similarly along all of its dimension as the crab ages. Just all the dimensions linearly get larger. This is why we have these very high correlations. And the subtle changes in proportions of the shape overall getting wider or more distinct in some directions, these subtle changes and differences are only picked up once we account for that first principle component. We basically remove it by looking at components that are only orthogonal to it. And that gives us the subtle shape changes that nicely distinguish. And we have that additional information. The smaller overall they are, the more similar the specific sexes that are going to be. As they grow larger, they evolve in different shapes. So the product that I actually tested I have but this is where I was trying to find the order with the bigger ones. Yeah. Which means it's just different. Exactly, exactly. Now I'd like to take you through a different data set. One of the classic data sets of systems biology. And I think we'll just get started with this and continue with this tomorrow morning. I don't want to skip over this data set because it's really crucial for our next topic and it's a great introduction for our next topic of clustering, of taking high dimensional data sets and then identifying parts that are similar and dissimilar. So this is a data set, was it 2002 or 2003, ages ago, before most of you were even born. That was the first high throughput data set that looked at the expression of yeast cell cycle genes. The first author on that is Raymond Cho. And this was really fantastic when it was published. Really the first paradigm of a new age coming along. We have yeast cells and now we can look at all of the genes in parallel in micro array analysis and we can learn so much about global variation of expression profiles in this. So this data set observed yeast, synchronized yeast cells through division cycles and plotted the expression levels of basically all the yeast genes against time points in the yeast cycle. So we have some genes that vary with the gene cycle that go up and some genes that go down and from that people then could slowly start unraveling the genetic control networks that basically control the yeast cycle and provide its effectors. So this set of expression profiles which is in this table here contains 237 gene expression profiles that of genes that are known or suspected to be involved in cell cycle regulation. So if we look at that expression data set here there's the yeast gene names are here the systematic gene names and these are the individual measurements. They are 17 time point measurements and these are the individual levels. So we can, this is a text data which we can read with the function read table. We skip the first row and we also skip the column of the gene name and the class. This is a class which basically categorizes the different type of genes that we see here. So let's see what we have here when we read that in. 237 rows of numbers and the function read table has given us arbitrary column names, which is just fine. All we need to know at that point is that these column names correspond to time points to successive time points along the yeast gene cycle. Okay, now the first thing we usually want to do is to compare the general trends of the measurements and we can do a box plot of that. So here we go. What does this tell you? What would you expect? So these are 17 time points of different populations of synchronized yeast cells undergoing cell cycle. What should we expect for something like that? And what do we see? We measure the gene, the cells are not inside but what's been measured? Expression, so this is log of expression values measured with microarray technology, one of the early experiments of global microarray analysis of gene distribution. That's what I would expect. On average, we should always see the same value. Do we see that? Well, yeah, not quite. So there seems to be a bit of between experiment variation. You might argue that perhaps there's a trend here which can be followed, but especially this discontinuity here between time point V11 and V12. That is probably not physiological. That's variation that is due to experimental methodology. So it's probably worthwhile to normalize this data a bit before we analyze it. If we're looking for expression changes in the cell cycle we're really not that much interested in the absolute values of expression but more whether the values change in a cyclical fashion. And so as a first step to normalize the data we can simply subtract the row mean from each row and then let's normalize the data overall to subtract the column means and divide the variation by the standard deviation. So basically what this then does is we normalize our data to basically only reflect the variation and we keep the within experiment variations all the same. So we assume there's no difference in variation between the when the first time point and the second time point the variation should always be the same and the mean should always be the same. And if the values here don't quite reflect that then there's probably something wrong with the values and we should normalize for that. So we define a function to normalize the data and run that over our column and that's what we get. So these are some simple normalizations that we apply prior to our data analysis to focus on the actual question that we have on the actual kind of analysis. And in order to help our interpretation we can change the column names. So for example, using the paste function and pasting together the letter T and the numbers one to 17, we can get 17 strings of T one to T 17 which obviously correspond to time point one, time point two and which we can then assign to our column names. So that's what our data looks like now. Now this kind of scaling and putting everything on a common value is actually or into a common range is actually quite important for principle component analysis because principle component analysis different from regression analysis is quite sensitive to the absolute values which you would look at. So in a simple example if my x-axis here would be condensed by half the variation here would be lost. If my x-axis would be condensed by one quarter the variation would appear to be maximum in the y-axis and this would completely change the analysis. So principle component analysis is sensitive to scaling. So very often the first step for principle component analysis is to normalize your data to a mean of zero and the standard deviation of one. So as to actually capture the actual, well not the absolute scale but the actual kind of variation that's in the data. Now this is the principle component analysis. Note that it looks very, very different. So we have two principle components which are somewhat larger and then it slowly tails off. We don't have this dominating principle component of the first one. So we can compare this for example to a matrix of random numbers to get an idea of what completely uncorrelated values under principle component analysis would look like. So this is a principle component analysis of random numbers. So the analysis that we've just done for our cell cycle genes seems to have a certain potential for dimension reduction where some of the dimensions are less important but it's quite different from simply random correlations here. Okay, so now we can look at the first few principle components and plot them in by plots and see whether there's anything we can learn from that. Setting the global plot parameters for MF row to be two rows and two columns I can plot four similar plots at the same time into a screen. So if I do the spy plot here I get these different plots here. And overall it seems that the structure of the 17 principle components is kind of well resolved and especially in the first and second principle components there would be interesting substructures. I.e. we can look at the expression profiles along the first and second principle component and find that there seem to be genes here that have something in common along these 17 axes. So these basically, and you had the question before of what these principle components essentially mean, this here is for the first, second and third principle component the contributions from the individual axes from the individual data points or the individual original dimensions. So this first principle component is dominated by this dimension here and also this dimension here. And you kind of see a structure here that says that the ideal gene along this first principle component starts getting expressed as the cell cycle progresses then its expression level drops again then its expression level rises again and then it drops again. So you can look at these correlations. So essentially these are now the individual combined dimensions that make up a principle component and interpret them as being consistent with expectations about how our data should be varying. So this first principle component is high in this first phase and low in the second phase. The second principle component is lower in the first phase and higher in the second phase. And the third principle component is kind of anti-cyclic to the two. So they seem to correspond to different discrete aspects of variation in our cell cycle data. Now, if we want to select some genes and look what we actually see here, we need to label them somehow with gene names. So for example, we can read out the gene names from the original table and then get the systematic identifications of the gene names. Now if we do a by-plot of our PCA analysis here along the first two dimensions, we get these numbers and now we can start identifying things that we find interesting and presumably possibly related. So I've selected some gene identifiers out here. It's kind of hard to see if you increase the scale of the plot, it becomes easier to see. So gene number 193 is down here, 142 is down here, 235 is also down here, 98, 54 and 119. So these are all genes that fall into this region of the plot and I define them as a selection. And here I plot the selection. So the genes that cluster together in this principal component plot as we can see in the so-called parallel coordinates plot actually vary in the similar way. So they're initially high then low then they get to be high again. Now, what does that mean? Well, these are genes that are high in their first principal component as well as high and low in their second principal component. So high in the first and low in the second principal component, i.e. those that would correspond most closely to the actual rotations of the first principal component which I plotted in the previous plot. And indeed they turn out to look quite similar to that. So for example, if we select them plot simply 10 random genes for comparison, that's all over the place. So looking at this principal component analysis of the first principal components allows us to find genes that have similar expression patterns. How else could we have done that? How else could you have looked at a data set of 237 values and found genes that have similar expression patterns? So this is one of the problems of high dimensional analysis. There's a lot of information in there but it's hard to visualize and hard to get at. So your collaborator gives you this data and she says, okay, here's my newest experiment. And you get the table of data and you have these genes and now you'd like to say, well, which genes are co-regulated? Which ones are similar to each other? But all you have is a spreadsheet of 237 columns 237 rows and 17 columns. So this is one way. Obviously another way is to actually calculate the correlation values. So for example, we can calculate the correlation between say gene 193 and 142. What should I be typing? So show data of row 193. Show data 142. Eek, what am I doing here? Never I'm doing it wrong. Ah, because it's a data frame. Correlation of X1 and X2 is 0.625. Is that a lot? What do you think? We've seen correlations before. Pretty decent, right? No? No? Okay. How high do you think the average correlation is? It should be more than people will know. Otherwise, it's just working for it. Okay, what we really ought to be doing is to run the correlations of all against all, plot the distribution, and then we know what the most significant correlations are. Okay, but once again, as a first view into our data and being able to identify groups of data, clusters of data, and plot them, plot their actual values of high-dimensional data sets, and being able to identify ones that seem to be similar, and thus allow us to define hypotheses about genes that co-vary in their expression profiles, and thus potentially are co-regulated, and thus potentially all form part of the same biological system, this is a pretty good way to start here. So correlation analysis would be one other possibility, and we'll talk about other ways to look at this particular problem, because it's a very nice paradigmatic problem. We'll talk about other ways tomorrow.