 So, we've got 10 more minutes, Francis, is that right? Yes. Okay. So, I could either like go like crazy on 30 slides in 10 minutes or I could sort of stop here and take a lot of questions and maybe go over something that you do not understand and I think probably that would be better. And then you can read the other material in principle component analysis, which is pretty small anyway. I wasn't going to talk too much about it. So, which one do you think is best? Okay. So, here it is, that's the definition. Now, I could give you a definition, but I'm going to try to tell you what it does, which I think is a lot more important than just the definition. So, you can have the definition, that's nice, but let's go through the examples, which would be better. So, there's two things. So, people often talk about singular value decomposition or principle component analysis. These are almost exactly the same things. So, PCA is kind of like doing singular value decomposition or SVD when you standardize your data. But in fact, the idea behind it, it's exactly the same. It's just the way you define it is slightly different. The idea of principle component analysis is to say, okay, let's try to look at our dataset. Typically, it will be a multivariate dataset. So, it could be a gene expression dataset or flow cytometry dataset. We've got several variables. And because you've got lots of variables, it's very difficult to visualize a multivariate dataset when you have more than two dimensions, right? Because in fact, we're not very good at looking at things, at data sets that have more than two dimensions because it's hard to visualize on the computer or on a piece of paper, right? Three dimensions, you can still do it. But still, it's quite difficult. So, what principle component analysis is trying to do is to say, okay, let's try to look at the data and let's try to put it in a way that's going to be much easier to visualize. Let's try to reduce the dimension of the data by only keeping the information from the data that's most relevant. That is, what we're going to try to do is to look at the data in a way that shows us the most viability in the data. So, let me show you an example of this. So, why is it nice? Because it's a dimension reduction technique. And this is very nice because sometimes you've got a huge dataset multivariate, you don't know how to look at it, it's very difficult. So, you can sort of simplify the dataset. And this is going to be easier maybe to do some clustering. And you're going to hear about clustering next. Or maybe doing some discriminant analysis. So, let's say you've got two big cords of patients, one that have some kind of disease, one that don't have the disease. And you'd like to discriminate between the two by taking a sample of something. But if the sample is multivariate, like you do a flow cytometry experiment, you take a blood sample, and then you've got tons of markers, you have no idea how to look at this data. It's multivariate. How can you find something that will help you discriminate with it? So, maybe you can try to reduce the data in something that's simpler to look at, and that will help you in trying to discriminate. In my opinion, it's most important when you do exploratory data analysis. So, we're trying to find the most important signal in data. And this will help us in projecting the data to look at things as we like. So, here's an example. So, this is a two-dimensional dataset. It's really a taller example. But I wanted to start with something like that because it's going to be easier to understand. So, what principal component analysis is trying to do is to look at the data, here's two-dimensional. And it's going to say, try to find the direction in the data that has the most variability. And that's this one in red. It's saying that if you follow this line, this is where there is the most variability in the data. And this is called the first principal component. Then after that, it's going to say, let's try to find the second direction where there's the most variability that's orthogonal to the first one. That will be the green one. In fact, here, there's only two dimensions. So, you can only have two orthogonal directions. So, the first one is the red. The second is the green. So, the idea of principal component analysis is that you are given these things in two-dimension. Maybe you'd like to reduce the dimension. Here, you can say that there's almost nothing going on in that direction. So, you could almost say, well, I only need that direction to look at the data, to explain most of the variability in the data. So, the first idea is to say, OK, maybe I can remove some of the things that I don't really need. So, here, it's a two-dimensional plot. But in fact, you can see that the green dimension, the green direction, does not show you very much. There's nothing going on in that direction. So, you could just retain these directions. So, you're going from a two-dimensional to one-dimensional data set. Then the other thing is, you might say, OK, maybe there's a better way to look at this data. Maybe I shouldn't use the x and y-axis like I have. But maybe I could rotate it so that the red line becomes my x-axis, and the green line becomes the y-axis. And it will be easier to visualize. Is that sort of clear what we're doing? So, trying to find the line of the most changes, the most variability in the data, then you're going to look for the orthogonal that is the most variability in the data. And in three-dimension, it's the same. You're looking at a data set in three-dimension. You're trying to look for the most variability in the data. Then you're looking to one that's orthogonal and the most variability in the data. You're going to find that one. And then orthogonal to these two, you don't have any more choices because it's three-dimension. So, let's say the dimension is n. Let's say you found the first n minus 1 principle components. Then the last one is given because it has to be orthogonal to all the others. Typically, you could do it either way. You could either say, I'm going to try to look. It depends what you want to look at. So, let's say you've got two-dimensional array genes cross samples. You might say, my variables are samples. And I'm going to try to look at the samples in terms of their gene expression. Or you could say, the other way you're going to say, I'm going to try to look at gene expression as a function of the sample. So it depends what you care about and what you want to look at. So typically, it will be genes. But sometimes, people want to look at samples. For example, let's say you've got 20 cancer patients. And maybe you'd like to explore the different types of cancer to see if there's any subtypes or something. So in that case, maybe you want to look at the samples in function of the expression. So this is kind of looking at the gene signature. Things that maybe will help you to visualize the different subtypes of cancer. And it's hard to visualize that when you've got 10,000 genes. But maybe there's only a few genes that will help you to discriminate between the subtypes. Well, here it represents nothing because it's a toy example. But in that case, it would be yes. So here, this is another toy example. So there's sort of two artificial clusters. And you can see that this time, PCA is saying this is the first direction where there is the most changes, the red one. And then this is the green. And it's orthogonal to the red. And this is the value of the contribution to each of the principal components. So you can see that this one explains most of the variability. And then this one explains a bit of the variability. So then using principal component, you can say, OK, let's try to rewrite the axis so that the first PCA becomes my x-axis, or the y-axis, and then the green, the x-axis. So you just rotate using the principal component and say, this is the best way to look at it. So I'm going to show you some examples that are more interesting, because these are sort of toy examples. Actually, let's go in R. And I'm going to show you that, because I think this is pretty interesting. So this is a data set that's available in R. It's called the crabs data. It's just a bunch of crabs that you there's. I'm going to show you the data. So you're just measuring a bunch of crabs. There's some blue and orange crabs, male and female. So you should have four groups, male, female, blue, and orange. So if I look at the data, this is what I see. So OK, and you're measuring four variables. And I can't remember what they are, but I think it's the length of the legs of the body or things like that. And based on the four variables that you measure, you will want to see the four different classes of crabs, either they're blue female, blue male, orange female, orange male. And in fact, here I show you with the same balls the different classes of crabs. And you can see that it's very difficult to discriminate between the four subgroups. They are almost all of the same here. So if you look at the data, this is all the possible 2D scatter plus using the four variables that we have in the data set. I can't see very much where the four groups are, even though I put the samples. So you can see there's a triangle plus circles and the cross. It's very difficult to see the four different groups. It's just because naturally maybe these four variables, these four projections are not the best way to look at the data. But maybe what we can do is we can say, OK, let's try to do singular value decomposition to try to find maybe directions that might be more interesting, whether it's the most variability, and try to project the data into these directions. So here I show you another two-dimensional plot. So if I zoom in into one of these 2D scatter plot, you can see there's one group over here. There's another group over here. There's one here, one here. So you can sort of see there's a slight difference between the grouping, but it's very difficult to see them. Now if I do a singular value decomposition and I'm going to plot the principal component or the data projected on the principal component, this is what, OK, this is just plotting the first and second principal component. So still you don't see it's a bit better, but you don't see a clear separation of the groups. So remember, there's several principal components. So typically you will look at the first few that explain most of the variability. So here I tried one and two, but in fact, one and two wasn't the best. So I've tried also two and three to see what I get. And if you try two and three, you're going to get something that much better, OK? So here this is two and three, and you can see that there's a much better separation. There's one group, another group, one here, and one here. So looking at the data that way, you can actually see a much better separation of the four groups, OK? And in fact, you can see that if you take into consideration the four variables, you can almost separate or discriminate the four groups. But it's very difficult to do it. There's information coming from the four variables. And if you were just to look at 2D scatterplots, you wouldn't redo that. But if you do principal component analysis, you kind of see that here, yeah. So this is really exploratory. At this point, I'm not trying to discriminate or anything. It's just to try to help me to visualize the data. Maybe now you could say, OK, I've done PCA, and I'm going to keep the first three or first four principal components. And then I'm going to try to do clustering on these. And you're going to talk about clustering next. But this could help you to maybe cluster your data better. OK, so that's a good question. So typically you don't really know how many you should keep. But what you could do, so here I look at an expression data set running a bit at a time. But I'm just going to show you that plot. Actually, it's in my slides. What you can do is that when you do PCA, not only will give you the principal component, but it will also give you the strength of each principal component. So here you can plot that kind of like a bar plot. Here you can see this one is very important. This one's pretty important. This one's pretty important. And then it's becoming slightly flatter over here. So you could sort of look at this plot. Of course, it's going to be slightly subjective. But you can see that after the first three components doesn't change very much. There's not much of a gap. So maybe you could stop. So looking at these kinds of plot can help you a little bit in choosing. But there's been lots of debates, lots of theory on how you could sort of choose. People will typically say, well, maybe pick the first principal component that explains maybe 80% of the variability on your data set or something like that. But there's no rule. There are more rule of thumbs that you can use. Which component is it? Each component will be kind of like a new axis in your system. So it's kind of like a direction where you should look at. So it's like a vector. So a component will be OK. Let's say you're looking at data in two dimension. A component will be like a line, like the direction in which you should look. So it's like a direction. Well, it's difficult to do 3D plots, because you don't see very much with 3D. So typically you will do 2D. Yeah, but you can in R as well. You can do very nice 3D plot. But the human eye is not very good at looking at 3D plots, except if you can maybe rotate some of these and do fancy things. But it's very difficult. So PCA can help you to get a good 2D plot. Because if you just do the basic 2D projection, sometimes you won't see anything. But if you say, OK, let's try to do PCA to look at the directions where there's the most changes, where it's the most interesting, then maybe doing just a simple 2D on the PCA will help you. PCA1, PCA2, PCA3. No, it's a good question. So if the variables are different, if the units of the variables are different, then it's basically the PCA will be a linear combination of the original variables. So if you've got two things that are sort of very different, you won't be able to put a measure on it, because it will be very difficult. And that's one of the criticism of PCA is that people would say, well, that's nice. But then the variables that I get, I cannot reinterpret them, because there's no scale anymore, there's nothing. So that's why I really say it's a way to explore data, to visualize data. But I wouldn't necessarily say it's a way to do statistical inference. It's just a way to explore data sets. It's different. So cluster analysis will try to cluster. PCA1 will not try to cluster, it will just individualize data. So cluster analysis, you might cluster, but then you still have the same problem is that the way you're going to look at your data might not be the right way. PCA1 will say maybe you can look at it in some other projection. If you've got in 3D, for example, you're trying to say, OK, I've got these three dimensions. These are the axes. But maybe this is not the right way to look at it. Maybe I should rotate it in another way. And this is the way I'm going to look at the data better. So it's just a way to explore the data. Whereas clustering, you're trying to do inference. You're trying to say there are that many clusters, and here are the points in the clusters. So I think it's time to stop, take a break, and I'm glad to take questions now.