 But we will continue with PCA or start off with PCA in a time jump kind of mode. So I'd like you to open your RStudio. You've probably left the project that we've been working on together. So choose File, Recent Projects, REDA, which should reopen the project in its last state, and then pull updates on the repository to continue. And then open the file REDA Dimension Reduction. That's what we'll be working with. Now I've left the first statement here. The goal of principal component analysis is to transform a number of possibly correlated variables into a smaller number of uncorrelated variables called principal components. That's slightly misleading in the sense that, yes, indeed, this is the goal. But that's not actually what PCA does, as you know. So the number of variables and the number of dimensions that you get back from PCA is the same as the number of dimensions that you put in, except that the meaning of the dimensions has changed. So what PCA does in principle is it takes your data cloud in a high-dimensional space, and it rotates it in that high-dimensional space, so its projection, its shadow, on different other dimensions is predictable. So in one dimension, we'd like to have the projection of the highest variability. And the next dimension, we'd like the projection of the second highest variability, and so on. And moreover, we'd like to guarantee that all of these dimensions are not correlated with each other, or they're actually orthogonal to each other. None of the individual PCA dimensions carries information about any of the other. They are not the PCA uncorrelated. And that's useful and important, because that usually allows us to reduce the number of dimensions that we want to consider. Why? Well, if there's any kind of linear correlation between dimensions, then the PCA will factor that out. It will put all the variability that is correlated into one dimension, and no significant variability is going to be left in the other one, but just noise. And then we can remove that from analysis. So the first principle component is the projection of the data into a single dimension that has as high a variance as possible, i.e. that it counts for as much of the variability in the data. Now, let's look at a simple 2D example to illustrate that. That's something I love about R. It actually allows you to do interactive mathematics. So we set a seed, and we calculate two sets of normally distributed randomly variates 500 of each with a mean of 0 and a standard deviation of 1. So just a Gaussian distribution. And that's what these look like. So 500 variates in x, 500 variates in y. It's just a point cloud. It has a mean of 0, and it has a standard deviation of 1. So this is my value x1 and my value y1. And now we'll generate a second variable, y2, which will, one half of y2 will be built from x1, just two times x1. So if we would do only that, we would, of course, get a single line like this. So this is if y2 is just two times x1. And now we will add y1 to that. So this means the x and y are still in some way uncorrelated, but now y has a strong component of something that is predictable from x. I'll plot that again. There we go. So now we have a point cloud with a strong linear component. It's essentially the same random information that I put or uncorrelated information that I put in before, but I've added a correlation. Now, the mean of y2 is essentially 0. And the standard deviation of y2 is essentially 2. Well, that's not surprising because I multiplied y2 by y1. And I can rescale y2 by subtracting the mean and dividing it by the standard deviation. So now I've rescaled y2. It looks exactly the same. I've just compressed it a little bit, and I've shifted it on the axis. And as a result, I now have a standard deviation of 1 and the mean of 0, of course. OK, let's look at these two distributions. So this is a histogram of our x1, and this is our histogram of y2, a distribution which looks like a normal distribution, of course, because it was selected in that way. And the histogram of y2, which as a histogram looks exactly the same as y1, because what we see here is essentially the random component. What we don't see is that all of the values of y2 are actually correlated with x1. So this is a correlation component, but we don't see that in the distribution itself. If we look at the variable just on its own, we don't see any of these correlations. Now, indeed, most of the variances is explainable here, this. And so in a sense, a single dimension calculates would be sufficient to actually explain most of what's happening in this data here. So this is what we can find out with PCA. So we use per comp, run a PCA. We get two dimensions back. We see that the standard deviation of the first dimension is 1.35. The standard deviation of the second dimension is only 3.2. So it's only 20% of the first dimension. So now, if we would only consider the first rotated principle component, then I would consider just in one dimension 80% of the variability, and I could remove the rest. So all of that, which is unique now, has been mapped to this projection here. So now, this is the rotated version. This is what PCA has done with the data. So basically, it took the point cloud, which we saw in the lower right plan, it figured out where the dimension of the highest variability is, and it rotated that to project directly on the x-axis. Right, so the dimensions are just x and y. I've called them x and y because I've taken a distribution. I could have called them, I don't know, age and blood pressure, right? And then plotted them on the x and y-axis. OK, so once again, this is the original plot of x1 against y2. This is the plot of the first principle component against the second principle component. So that's the relationship. So essentially, you can, oh, let me put the two together into one plot. OK, I set up my plotting window with two rows and one column so I can have two scatter plots there. I plot the original principle component like this was that original plot, and now I plot the PCA. So you notice that that's exactly fundamentally basically the same plot. It's just this point cloud has been rotated. So it's been rotated that instead of looking at it from this direction, we're putting this on the x-axis. And the result of that is that almost all variability is on the x-axis, and very little variability is in the y-axis here. And here's the comb comes from a package that we installed, or it comes from one with R. Peer comb, as well as the other principle component Peer and comb are in base R. So just Peer comb. As you would realize when you type it in your installation without having loaded the package, it will just work. Whenever you need to load a package, that's hopefully, unless I forgot, going to be somewhere in the script. But then it's not going to work for anybody. Everybody is going to put up red post-its and we'll have to fix that. So once again, that's important to realize. What principle component analysis actually does, and basically is just rotate high-dimensional data sets in a way that the variability of the data sets is presented in an ordered fashion. Now, the one thing about this, so mathematically it's rigorous, it's fast, and it's extremely useful. One thing about it is you can't actually usually interpret what these dimensions are. Because in principle, every dimension here can have a little bit of the information about any of the other components here. So once I plot along the first principle component on the x-axis, my new x-axis has information from the original x and the original y-axis. So if I said that this was age and this was blood pressure, after I do the principle component analysis, this dimension has some information from age and a little bit about blood pressure. And I can't really label it in any other way except that this is a PC, a principle component. And that's especially annoying if I do scatter plots. Because usually all these scatter plots then are useful for is to see if there's structure in the data, to find if there are areas of the projection of the principle components where things come close together and are far apart from other parts of the data set. But I can't immediately identify that. So for example, if I would be clustering age and if I would be looking at age and blood pressure here, I might see that there are, if I find structure in the principle components, I can't immediately say that these are people with high blood pressure or people that are very young as subpopulations. Yeah, I think we'll revisit that point when we look at T-SNE as an alternative. Right, so basically we transformed these two histograms here through the principle component analysis. Now if we look at the histogram of PCA sample, the values are very, very large and very well distributed. If I look at the histogram of the PCA sample, the second dimensional or the second principle component, then the amount of information that is still left in that histogram is very much less. So that is then what allows us dimension reduction. We can make a choice about the principle components that we continue considering. Now ARM basically has two different algorithms for calculating principle components. One is peer comp and one is print comp. Fundamentally, they do the same thing, but they use different mathematical approaches. But they use different names for the elements of their result list. So if you've been working with peer comp, and then somebody shows you code that works with print comp, you just have to be aware of the fact that these two equivalent algorithms have different element names in their list. So things that are called rotations in peer comp are called loadings in print comp. And the actual values, the rotated values, are on the list element x of peer comp, but there are scores in print comp. So data dollar x give you the rotator results of a peer comp call, like we've done here. Data dollar scores would give you the results, the same results from print comp. One thing to note about principle components analysis is that it's sensitive to scaling. Because it gives you the dimension of the highest variability. So if I, again, have age and blood pressure, and I get some variability along the plot of age and years of blood pressure in millimeters of mercury pressure, or Pascal, if you're using modern skills, that's one thing. But then if I use the age in days, the numbers are very much higher. And then all the variability will be in the age in days, because they're numerically very much larger. And the principle component analysis won't really do anything, but it will just use that as its first principle component. So what you usually do to put that into account is to rescale the data. There's a command scale in R. And that centers and scales the columns of a numeric matrix. So this goes on. What does it actually tell me that it centers this on 0 and uses a standard deviation of 1? OK. So our help texts are a bit special. Anyway, so scale takes a matrix, two-dimensional data object, goes through all the columns. And for each column, it changes the data to have a mean of 0 and a standard deviation of 1, so that they're all numerically comparable. That's something you could easily compute, by the way. So it's just taking a vector, subtracting mean of the vector, and then dividing the result by standard deviation of the vector. So if you don't remember that R has a scaling function, you can simply write your own function. So this is just x minus mean of x divided by standard deviation of x. And since all these operations are vectorized, you will get this in one expression. OK. So one really interesting example of PCA analysis is basically the only prepackaged R dataset that I use in my teaching. But this one is really nice. There's a package called mass, which most of you might already have preloaded. So try library mass. If you don't already have it, you need install.mass in the normal way. And once you've installed library mass, you can get prepackaged datasets that come with it. And one of them is called crabs. So if you call data crabs, then the crabs data will be loaded. So these are morphometric measurements of crabs. So some people choose their research topics much more smartly than we usually do. These people chose to study crabs in the beautiful clear waters off of Fremantle in Australia. So they got to go scuba diving and collect crabs. I think that can't be beat. And they collect the two different species of crabs, blue and orange crabs, and two, I can't remember, two sexes or two genders. Probably it's sex, right? With crabs, it's usually the same thing. OK, anyway, so they collected males and females. And took calipers, were careful not to be pinched, while they were measuring the frontal lobe size, the rear width, the carapace length, the carapace width, and the body depth. So you do some morphometric measurements. And after you're done, you probably put the crab into a nice boiling pot of water and study it some more. I should do oyster studies, oyster studies. Anyway, so let's take this, yeah, let's annotate these crabs. So we have letters of, let's start here. Let's start with STR crabs. OK, so species and sex are factors. They have two levels. One is B and O for blue and orange. The other is F and M for male and female. There's a column of the index, which is just the number. And then there are five columns, numeric columns of the individual measurements. So in order to annotate them, we can paste the values of column one and the column two together using a separator of a dot. So we have D, M, and B, F, and O, M, and O, F for blues and oranges, et cetera. And then basically this can give us index numbers to identify them. So we have two different varying columns with two different columns with two different states each. But we can combine these two to give us four different states and so that we can distinguish the four different types of crabs, which we have in our analysis. So if we plot all of these data against each other, so this is plotting the crabs data column four up to column eight, so five columns, four, five, six, seven, eight against four, five, six, seven, eight, we can see immediately in this trellis plot the individual correlations. So this is the correlation of frontal lobe with rear width. This is the correlation of rear width with carapace width. There's a correlation here, which is apparently very strong of carapace length with carapace width, and so on. So very high correlations. But if the challenge is now to distinguish the blues and the reds and the oranges and the females from the morphometric measurements, or you could paraphrase this the other way around and say, do these different types of crabs have different body shapes, distinctly different body shapes, then it's actually quite difficult because there's a lot of overlap. So you see that we have different plotting symbols these are plotting characters, one, two, three, and four, which correspond to these factors which we've combined here, B, F, B, M, O, F, and O, M. So as you can see, there seems to be some separation here. I think in this dimension, the males tend to be, or the triangles, I don't even know what the triangles are, the triangles tend to be higher up and the circles tend to be lower down, but there's regions where this really totally overlaps and I could not possibly just taking the measurements in any of these two dimensions actually distinguish the types from each other. So there's something there but it's not obvious from just looking at the pairwise correlations. So if there's overlap like that, maybe principal component analysis can help. Maybe principal component analysis will show us this data in a way where we can start distinguishing them. So this is your task now. Apply PCA to the crabs data set to distinguish species and sex from morphometric measurements. Plot the results of important PCs as a scatterplot in which blue males are shown as blue triangles, orange males as orange triangles, blue females as blue circles, and orange females as orange circles. And if you're done and everybody else is still sweating over the task, you can think of how to scale the plot symbols with the mean of all individual measurements. So this is a little mini project. So what do we do when we have a task like that? Break it down into individual steps. How do we go about doing this? Are the steps. What would you do at home? Go to sleep and take a nap. I like the way you're thinking. On the morning of the fourth day of the workshop, I really do. Scale the data first. Yeah, why not? I didn't even think of that, but yes. According to what I've just said, that's a good. Good thing to do. Good thing to start. Subsetting. Subsetting. Subsetting. Subsetting. What would subsetting do? It would extract the different groups, right, but then our principal component analysis would not be able to find things that compare and contrast the groups with each other. So here in this case, you actually need to keep the data together. So scaling it is, looks like a good idea. I'm really curious because actually I must, I've never actually done this. So this is of course the same thing, five against six. Looks very similar, right? So what's different? Well the only thing that's different actually is the axis. So this goes from 15 to 45 and this goes from six to 20. Right, now after we scale it, this goes from minus two to two and this goes from minus two to two. So we're not scaling for us, we're scaling for principal component analysis. For us this looks exactly the same except for the different axes. Principal component analysis, this is very different because principal component analysis sees this here in a different way. You have to think of it as something like, so if we put them on the same numeric scale, so the X limits go from zero to 50 and the Y limits go from zero to 50, then the principal component analysis will say, well this one, perhaps six, is a lot more important than that one. There's more variability in that, right? Do you see the difference? So now they're on the same scale but projection of into dimension six seems to be very much more important than projection into dimension two. Whereas if we put them on the same scale, minus two to two, minus two to two essentially, then the dimensions look equally important. So this is the effect of scaling. Good point, thanks for raising that. Okay, scaling as a first step is a good idea but then apply the PCA, apply the PCA or assign the result of PCA to some variable. And then what? Plot. Plot what? PCA, PCA one versus PCA two. I mean, we can do it in higher dimensions. Is that the first thing you do with PCA? Is that the first thing you did yesterday with PCA? What was the first thing? I know that, you know. What was the first thing you did with your PCA? With the parallel ones? Look at the variances, right? So you inspect the PCs, you look at the relative importance, you maybe give that some thought, whether there's anything there that interpreted about the PCs. And then you decide on useful principal components to plot on or choose several and just plot them all pair-wise against each other and then see if your data then becomes interpretable, whether you can see structure in the data that you can, for example, use to classify. But for the plotting, I wanted you to plot in a particular way. So you should write your own plotting functions. Essentially, a plotting function that takes as input some of the PCA dollar X column one, some of the dollar X column two, or column three or column four as X and Y values to the plot and some information that allows the plot to plot circles or triangles and color them blue and orange, depending on what types the crabs are defined. So you'll need to somehow identify what the types are. As a hint, the parameter PCH or plotting character can be used to select filled circles and triangles. And as usual, the parameter called can be used to identify orange and to make orange and blue colors. So what you do to use PCH and to use code, you plot this not as a single variable, but you give it a vector of colors where each of the colors corresponds to whether the input data X and data Y is supposed to be orange or green. So if you, orange or blue. So if you have two X and Y points to plot and the first one is going to be orange and the second one is going to be blue, your color vector is going to say orange, blue. If the first one is a male and the second one is a female, then your PCH vector is going to say whatever the number is, but essentially triangle, circle. If you have five of these, you have five circles, circle, triangle, circle, triangle, something like that. So the length of the vector of PCH and call for color should be the same as the number of rows or the number of data points that you're plotting. And then for every single data point, you will get the appropriate color or variable. Right, if you want to scale this, the same thing goes for CEX. If you give it a vector of values for CEX or character expansion, you can get small and large values for every single data point. You can probably put all of this into one plot command, but you can also just plot an empty frame of the correct size by just plotting everything and specifying n. So this will just set up the frame and not actually plot any points and then plotting things individually with the points function. The points function works exactly like the plot function except it doesn't create a frame, but it plots into the last frame that was plotted and it then operates over individual points that you can put in. So what I often do, for example, if I want to label points is I plot a whole cloud of points with plot and then I use the points function to plot overwrite say three or five of them that I want to emphasize with a red dot. So you can either do this in plot or you can use an empty frame, plotting just an empty frame and then with the points just as a hint. We'll come up with different ways. That's just how I would do it. All clear, well specified, is the specification missing something? Yes, interpret your data. What is the interpretation at the end that we're looking for? The interpretation is the question, can we clearly identify the crap types from these five different morphometric measurements? Right. Yeah. Okay. So one of the, yeah. Put it in my head though. Okay. Okay. Okay. There were factors aside as to what is going to be the scale, the scale, and you know what is going to be the point of that. Did you say it somewhere? Yeah. Okay. Okay. So like I look at it for an expert's point of view. Yeah. So. I don't know. And then you kind of figure out what you're doing. Yeah. So I just took these. Yeah. And then I looked at the little gal or what my point of view is. And then I just showed it. So numbers, so we're annotating this piece. Okay. Okay. So basically you were. So I've got. Yeah. Yeah. You need to follow it. So you've got the two clues or the two problems that are in the email. Sure. Yeah. Okay. Just to get the data. I don't know what that is. So. I decided to make it right. I'm not going to do the whole set of scaling for now. Yeah. I need to do the scale. Yeah. The word of the chart is the scale. I'm not totally sure what the scale is. You have to be very vocal about it. Okay. Okay. Yeah. So what you're going to do is you're going to what you can add in? Yeah. Okay. So what are you doing? Is that right? Correct. Okay. You're going to. What? Oh yeah. to make this exercise. But you have to think about how it's going to work. Yeah, I saw the comment on our rooms. They remind everything. So, correct me now. But is that a new call-in? No, a new call-in. It's a new day. It's all part of the typical day. So, it's a new day. Which is an age-to-age row. Or is it a new theory to them now? Okay. Okay, what are your questions? E-M. This is E-M. Simple, right? I don't know. I don't know. I don't know. I don't know. I don't know. And this is eventually what's going to allow us to have a new circle versus a new trial. Because that is an order of data. So, variable is over a fixed one. Yeah, kind of lost. So, you can say, okay, this is the worst part of time. Exactly. Right. Computing is aware of your mind. So, you say, excuse it, go to jail. So, you think about that. You think about that. You think about that. You think about that. You think about that. You think about that. You think about that. Okay, so this is the factor of the way you are doing something. Yeah. The result, what we get from the actual factor is what we want. My problem is, I keep getting. And you're saying that this here is just not representative of the way you want it. So, you can be aware of that. which is such a basic fundamental message because I see the level to be of different combinations that the computer is seeing. So you can see that there are all these other observations. So all the level of answers, this is all of the combinations I've seen in the data. It's sort of like putting this into the data. Okay. Okay. Well, it's always worth it. That will do something. Okay. Awesome. This is looking good. Now, it already exists. Put the shapes again. And I think that you wanted it to be there. Shapes are either cc's or sex. And the colors are represented. Exactly. So yesterday it was being made in a picture of colors. You can do that. So you just start by making another picture. Yes, exactly. Put the number of observations in the data. So that's crabs. And then you're going to scale a pca with crabs. Square brackets look just like that. And so it'll come in. After all. So in the crabs, there are two actually. Very competitive, this class. This is the data. So this is the data itself. When for this vector. So just put this back. In this sphere scale pca. So as an e and r. So in r, it's ordering. All the values. All the values. So it does this first. And then it says. Then we do this one. And now it's rendered. All the goals. All the values. So now we go in. So now we check it out. And then all of the values. And then all of the values. And then all of the values. Exactly. So to change the cards. Exactly. So I'm going to say that. So I'm going to say that. So to change the cards. Exactly. So I'm going to say that. Exactly. So this is the end. Cause like the pca, just put in as code. So there's codes like one in a certain thing. So you can use those nodes. And so that takes us to whatever colors you want. Okay, so we're going to try this out. Yep, yes, yeah, so I'm leaving you with this. Thanks so much. So, rockets on this. Amazing what happened. Do you have somebody who knows what you're doing? Right now we're in turn two. And then she gets rockets as well. You have to try it out. You try it out. You went with this. So you can make it to the size of that. Okay. Okay. Can we get all of you guys to try it out? Oh, I'm going to try it out. I'm going to try it out. Can we get all of you guys to try it out? Yes. So here. This is just like, we're doing this in the same place. We're doing this in the same place. Yes. Oh, so I want to go back. I want to go back. But I don't know what happens. It's almost all out of words. They're all in one place. Yeah. I'm going to try it out. I'm going to try it out. I'm going to try it out. I'm going to try it out. I'm going to try it out. Okay. Okay. Great. I'm going to start. I'm just going to just try it out. Okay. Okay. Take the line out. We'll get right into that. Okay. So I'm going to go back and try it out. Okay. Okay. Okay. Yeah, I'm going to try it out. It just needs to be very effective. That's both, both are fine. So it's not going to be too long. It's a little long. We'll run this for just a second. Well, I don't think it will be actually just, yeah. What about, yeah, it's like, I would say, yeah, just beat it, the value. That is, yeah, thank you. And if you just copy and paste, yeah, just like do this one more time below it, and then you know it is still on here. Yeah, okay. No, don't run it, just like copy and paste it. So it shows you all the, yeah. I think it's because like, you have to get over my role. So I have to beat up the other ones. All the male and female, it's the trials. All the same thing. Thank you. I think most of you have made excellent progress along the way, and I think we can basically go through these principles quickly before we even break for coffee. So what were we trying to do? So the first step that we had decided on is to scale the data. And that's very simply done. Crab's S is simply scale. I want the rest too. So at first I assign, I just copy my data frame over. So now Crab's S is the same data frame as Crab's, and then Crab's S columns 5 to 8 is Crab's scaled. So I take the scaled values of columns 5 to 8, which are the numeric columns. Columns 1 to 4 have the annotations, and I overwrite that in my Crab's S. And it's 4 to 8. Thank you. So the structure is exactly the same. We have our annotations of species and sex and the row index. But the actual values are now different because they're now scaled. So this is the original box plot, and the numeric values are indeed on a somewhat different scale. And the box plot of the transformed version, of course, looks like that. So that's effective scaling. Next thing we wanted to do is to assign the results of the PCA to some variable and run the PCA. What do I type? PRComp. Crab's S4 to 8, I believe. So we have values along five different components and different things here. So let's see what we have on PCA. Okay, what is this? What does that plot show us? What's on the X-axis? The X-axis. Right, so the X-axis, this is a bar plot of the different categories, the different principal components. On the left, the first principal component, and then so on. What's on the Y-axis? The principal component. Or the contribution to the variance. Exactly, the contribution to the variance of each principal component. So you can see that most of the variance is in the first principal component. Almost all of the variance is in the first principal component. There's so much information in there. So why weren't we able to, when we looked at our plots, why was it impossible to basically, well simply, take the first principal component and be done with it? So if we rethink what we're seeing here, what this really says is that regardless of what we're looking at, what measurement we're looking at, if frontal lobe is small, then the frontal lobe is small, rear width is also going to be small, and carapace length is going to be small, and carapace width is going to be small, and body depth is going to be small. And if frontal lobe is large, then all of these measurements are also going to be large. What does that mean? There's something we weren't considering here. Points, like our different measurements are actually related, correlated? They're all correlated with each other in the same way. Why would that be? Because they scale as they grow in roughly the same way. Exactly, they scale as they grow in roughly the same way. So we have small crabs and large crabs, and small crabs have small frontal lobe and rear width and carapace length and widths and depths, and large crabs have large of all of that, and that's what we see in our data. What we have here is a confounding factor that influences all of the variables in a similar way, not exactly the same way, in a similar way. And the differences, the relative differences between these frontal lobes and carapace widths and so on, that's where the information actually is. It's not in the absolute values, it's in the relative values. And that's obscured, because if we plot the data like that, all we see is that everything is highly correlated. So in this case, the highest variance is in the first principle component. But that doesn't mean that that is the one that is the most important in interpreting our data set, it is the one that we remove. And then we will find, once we remove that, we're probably going to be much happier and much more able to distinguish what we're actually looking for. Jennifer, you have a question? I was just wondering if that meant that if you reordered the components, would you still get the same principle component? If I reorder my measurements, or the principle components? Reordering the principle components doesn't make sense because they're always ordered in a way where PC1 has the highest variance. And if I reorder my, if I simply permute the columns of my measurements, I would still get exactly the same principle component. So the order of the measurements is not important. And the order of the principle components is always designed, is determined by the height of the variance. Right. But if you remove PR1, then more of the variance gets assigned to other components. No, I'm not removing it. I'm just saying when I interpret my data, I can leave PR1 aside because it basically factors out something that is common, apparently, to all my variables, and it's just, you know, it's important, but it's not what I'm looking for. So usually when you think of PCA as dimension reduction, you have a reflex of saying, okay, we take the first five principle components and throw away everything else. Well, in this case, we actually take the first two, the second and third principle component, we throw away the first one, and we also throw away the last one as well. So it's not always that you always do the same thing. You just take the highest principle component. Right. It depends on what these principle components signify. And in this case, they identify a confounding factor. So how do you, what are the parameters for deciding which one you do? You look at it and make an informed decision. I don't think that there's a good principle way of deciding how many of the principle components you use. It's an exploratory method. I mean, you can't really technically interpret these principle components in the first place. So what you do is you look at them, you try to understand what's going on here, and then you check whether using any of them in any combination is helpful to answer the question that you usually have to your data. I've been looking. I haven't found a package that has a method to choose principle components. I don't know if you're aware of any. Your aims are different than your principles, too. Like yesterday, we were kind of looking for patches, like the cross-strait cancers, or the commodity across them. Like today, we're seeing, like, identify something that's like a summary of the data that separates these crabs. So I think that's the other problem, is just, like, you go into it with different aims, and then in, like, genetics, you go in with, like, creating them. I want to control for... But we already know what that is, right? So... Well, after thinking about what we're seeing here, and basically our nose being pushed into the data this way. Exactly. And what those differences are, like, do they represent something that you don't know about, like a batch that you're aware of, or do they represent, like, something with your aim going in? It's hard that you can see it one. Right. So, if we plot component two against component three, we see a lot of structure, a lot of clustering that wasn't apparent before. So we have a cluster here, and a cluster here, and a cluster here, and a cluster there. Now, do any of these clusters actually correspond to blue males and orange females? I have no idea. Just looking at that plot, we could not possibly tell, because the principal components themselves have information from all of them at the same time. So after the coffee break, what we're going to do is to find a way to color and to put in shapes and other identifiers. And this is a really, really important addition to your exploratory data analysis toolkit. You'll do a lot of scatter plots, but they become meaningful when you can label them in some way. Okay? And we'll do that after the coffee break. We'll reconvene at 11.