 I'm going to turn it over to the wonderful Dilara who is taking over her chitra because she's feeling unwell. So welcome. Wow. Thank you. So hello everyone. I'm, I'm Dilara. I'm a fourth theory PhD students of computational biology at Gary Bader's lab. And today we'll be talking about dimensionality reduction, which is a pretty important concept. So if you decide to do anything in bio in bioinformatics or even just reading just, you know, not heavily computational biology papers, you're going to need the concepts. So things going to be very useful for you. Okay, so this is a quick overview of the stuff that we'll be covering today. So first of all, I'll go through what dimensionality reduction is, what do I mean by dimension, and why would you need to reduce this dimension. And, and then we'll go through a couple of very popular algorithms that are used for performing this dimensional reduction, such as PCA, teesley and you map. Okay, so what is dimensional dimensionality reduction. So dimensionality reduction is the process of projecting the data from a high dimensional space to a lower dimensional space but what does that even mean so well first of all I'm going to talk a little bit about what do I mean when I say dimension and to learn that let's let's just look at data first so let's define these terms using data. Okay, so this is, this is a day. Can you guys see my cursor. This is a simple example data that you, you might want to work with. Here columns are different mouse samples and the rows are genes. So you might be evaluating the expression of these genes in, in different mouse samples, for instance, would be RNA sake or anything. So here we have two genes. And we have six mice. Okay, so you will came across tables similar to this very frequently. So, usually what we have is that one of the axes of these table are the variables, and the columns are our samples. So depending on what where the data is coming from it might be transpose of this matrix so you might have your variables or features as your rows, and the samples as your, as your columns. So always make sure that you will define it that is very clear for you that what are my variables, what am I measuring and what I'm what and what are my samples. So, the diamond the direction of these matrix matters a lot when you're actually working in our. So, imagine that we want to, like, visualize similar data like this and here I'm just bringing a couple other examples that you might be working with. For example, you could be measuring 14 levels in different cells. Here we have again cells as columns so can someone tell me what. So what are the samples here and what are the variables. Exactly. And you could also do the same thing with a non biology related data. Right. So, for example, you could look at the score of different students in different like courses for math or science. Here are the features or the variables that are that we will be looking into. And the students are the samples, which are columns in this case. Okay, so again, let's, let's think that we would like to visualize the differences between these samples, how would we want to do that. Let's look at the simplest example first imagine we're only looking at one variable. So, we have the mice data again, samples are different. Different mouse samples and our variable is gene. So, in order to visualize this, I only need one axis of information. Right. So here, I have a number line. And I can, you can like, imagine that here we have as the center would be like, would be actually something like here would be zero. And then you can like based on where the zero is, and you define your coordinate system and just plot the points on one axis only. Okay, so what if we want to visualize two genes have that work. So, in order to do that, we can just use a two axis coordinate coordinate system. Right. So, we plot these on a two dimensional X and Y axis so basically what we mean by dimension, are these axes of information, which are equivalent to the number of variables or samples do you think variables right so the dimension of the data would be equivalent to to the number of features or variables that you're looking into. Okay. So, in order to visualize this data here. We have gene one as the x axis and the second gene as the y axis. And again, for each of these data points so this is data point number one, two, three, four, five and six. So we just plot them according to the gene expression values, right, pretty easy. Yeah, like, each of these axes is are considered as one dimension. So what if we want to move to three genes. So, what would happen is that we would have a 3D plot here. So we would have the third gene. Like, basically, coming out of the screen here, and we would put the data points on the space accordingly. Okay, so, so after the third dimension, it might be a little bit tricky. Right. So we can all imagine a three dimensional space that you know you would plot your data points on but how, how would I visualize this data if I'm looking at four genes or 1000 genes or in like realistic examples this does anyone know how many genes we have in the human genome, like approximately. Yeah, 20 to 30,000 right so it's basically untractable you can't like have 20,000 axis is to look into. So that's basically where dimensionality reduction comes into play. So what we do using this dimensionality reduction is that we transform the data to a few new variables which explain most of the differences in the observations. Okay, so one of the most widely used methods of dimension induction is principle components analysis. So this is very popular. A lot of the algorithms that you know you would use in your day to day life. For example, many of the clustering algorithms, a lot of them for the sake of scalability so for them to be able to run faster when you're working with. dimensional data, they have dimensionality reduction step using PCA. So it's, it's, it's very common not only in biology in any sort of data science fields. So, we're going to talk a little bit about how this PCA work. And after we're done with PCA will move to some nonlinear approaches like TCN human. And in the pipeline that that you've learned in module one. It's also part of that pipeline. Okay, so let's look at this example here. We have variable one as the x axis and we have variable two as the as our y axis, and each of these data points are more observations. So basically what I want to do is that I want to capture the most amount of information possible with going to two accesses to one accesses. Let's just look at a very simple example right now. So where do you think I can draw the new access. If I want to get the most information out of this data. I want to draw two accesses here right imagine I want to go to two from to one. So you can have only one dimension of coordinates is them. Where would you draw that line. I'm actually going to use the markers here. So we can have it here, here, here, here. The last one. Yeah, everybody agrees. Yeah, that's it exactly. So why is it useful. This, let me. So this was our winner home. Let me map the data here. And I'm going to show that in the slides as well but I, I find markers and the classical whiteboard. It's easier to learn. So we have two dimension. You want to go to one day. Okay, so this, this would be the coordinates of each of our data points in the two dimensional space. Right. I don't want to go to one dimensional space. I'm going to get rid of those two. And what I would do is that I would map each of these data points to this new access of information. Right. So what happened here was that we went from two to one. Right. Is that clear for everyone that's basically the concept of, you know, PCA and any other dimensionality reduction. But the point is that, like, in all of these different algorithms is not trivial to know how to algorithm, you know, using math and algorithms how would be actually figure out, you know, where this line should be drawn. And the whole point of like PCA is that some statisticians and mathematicians try to figure out some mathematical approach to actually figure out the coordinates of this new dimension. That's all the observations should be mapped to. Okay. So as we discussed here, the, the new coordinate system and the first, the first access of information would be drawn along this axis of variation. So that's basically PCA so can someone maybe explain a little bit more why did you like, why did you choose this line, like instead of any of the other ones it was very intuitive right but can someone tell me using like some like a statistical language, like why would you use that what metric but statistical metric where you're looking into. That's regression but but but no that it's it's interesting it's related but that's not that's not exactly what we're doing here. What what what measure of the distribution, like mean of distribution variants of distribution. Variance of the distribution, right so we're basically looking at variants of the data. Right, so we want to know which axis. So along all the possibilities of the axis that we could choose, which of them is maximizing the variance along the observations. So that's, that's like under the hood, how, how PCA works. Maximizing the variance so we're choosing an axis that maximize the variation along along this along this axis, right. For if I wanted to minimize that I would actually. So what's what's variance variance is basically how spread out spread out the data points are right imagine like this would be like our center, like this, this, this, like these distances like multiply by two but but that that's the idea like how spread they are the summation of all this, all the distance between different data points from the center. So this has the highest spread. Right, if if we were minimizing would be probably something along here. Like, like very little spread among, among the data points. Does that make sense is that clear. Is it clear that here has the maximum variance, like PC number one, the data points are spread along PC one aren't they are they all like very close to each other very dense in one region. That's not right. You're very spread it out along the axis. So PC one would be your new the exercise and PC two would be your new bias. And this data points are spread it out on this one. They have a high variance, all of them together, they have a high variance on PC. That's that that's conceptually how PC is working so projects the data into new access or dimensions, and the first principle component is the access with maximum variance. And the second principle concern and has the second highest variance so that's basically how the second and the third one are and like fourth one and along the way are figured out. And that's always the case like the first PC is the highest amount of variance. And the second PC shows the second highest amount of variance in your data and the third one is the third captures the third highest variance in the data and so on. And they're all orthogonal to each other. These accesses but that doesn't matter here. So let's learn about some terminology here. So here we're projecting the data. So mapping at the data into these new accesses, you know as we're showing here and I as I showed in on the whiteboard. So, these are called projections. So we call this the process of projecting the data into new accesses projection. And the amount of which the data points I rotated is called the loading value and we'll talk about that a little bit more in the next slides. And another way to look at loadings is that loadings are the correlation between the original variable. And the unit is scaled components and by components I mean these new accesses so I'm just introducing some new terminology for going to play around with them a little bit more in the next few slides. Okay, so again, these are the takeaways. So the we rotated data into new accesses and this new dimensions are called principal components, PC one PC two and etc. The maximum variance and the second and PC two has the second highest variance. So we usually You should you should try to imagine it and well here we're just only rotating the data. Right. Basically it's like we're not doing, if you consider both PC one and PC two, we're not exactly doing dimension reduction because we have two accesses in the beginning and the end up with two. Right. So this is just for sake of learning, but in high dimensions. So when you have like 10,000 genes that you're looking into each of these accesses that you would learn and like, and we'll also talk a little bit about how you would have a number of principal components to look into so there are some like proxy ways to look into that. Each of these components will be orthogonal to each other. Like that's like one of the assumptions of PCA. Sure. Okay, so the first part is that dimensionality reduction. You can think of it as like, say you have a cube and you put it in a hydraulic press and squish it down and becomes a square. This is basically what it is, except instead of three dimensions, you can't visualize it, but it's like there's 1000 1000 genes is 1000s of dimensions. And that's just way too much to deal with you want to squish it down, but you want to keep as much of the information as you can. So that's why she's saying you want this line here, because most of the variants most of the information is this line, like you could draw a line like you could draw a line like bear but that doesn't tell you. Most of where the points vary is along this line. They don't vary this way. They vary in this direction. So they're trying to squish it down, get fewer dimensions but retain as much of the information as you possibly can. Does that help at all. Okay. So variance is a proxy for information here. So higher variance means higher information. Does that make sense. Okay, so what are the potential applications of this so again, this is why this is a widely used method it's applicable to both omics and non omics data sets. And it shows where the dominant structure of the data is. It's, it's also very useful for identifying batches for example in circuit variable analysis, and, and on major like this on major variable effect it's basically the circuit variables. It's also very useful for machine learning in machine learning you usually want to train your model on a set of features that aren't like too many of them. So there's usually a process of feature selection and feature extraction involved so PCA is also used for that. And leads to more accurate modeling. Okay, so how would we do this in our. So, the base our function for for the PCA algorithm is called PR comp. So this is basically how you would apply PR come to your input data. So you guys have worked with the mouse expression data before right, or is this. Okay, so does do any of you remember what the columns and rows of mouse expression data was. Pardon. Yeah, yeah, yeah, but so do you remember what the rows were and what the columns are where the genes as rows or columns. Exactly. So does that explain what we're using here so this T here, it represents transpose. So mass expression data that the data that you look that you looked into before has the features or variables or genes as rows. Both, but PR comp assumes that the features or the genes or rows or the genes are columns, not rows so you should you should always be very careful about that that what are the assumptions of these functions that you're applying to data about the function of the columns and and rows of the input data. And, and the str function gives the structure of the output of pure comp. We will go into the details of this in a little bit. Basically, the output is a list of five elements. The first one is a step, which basically represents standard deviation of each of the principal components. Then we have the rotation. So rotation is the same as loading. Yeah, exactly. Well, we'll get into that. We have the center variable which is the mean of your variables scale is the variance of your variables and x is the principal components or the score matrix. We will go into the details of that a little bit later. You can also use the summary function to to look at the some of the features of the output. So, basically what it what this is telling you is that the first, the first row here gives you the standard deviation of your components which is equal to the stuff that we saw in the last slide. And then you have the proportional variance, which is the variance explained by each of this principal components. So you can see the first principal components has captured the highest amount of variance. And then when you go to the second PC, it's the second highest variance explained and so on. On the third row you have the cumulative proportion. So if you basically add these two will give you this number if you add all these three amount of variance explain give you this number. And of course, as you would expect on the final element you have one, which has basically you explain all the variance in the data using this principal components. Does that make sense to everyone. So, again, we're reducing the dimension of the data, but how would you actually choose the number of principal components that you would look into support visualization we usually use principal components one and two, but imagine you would like to use PCA to explore your data to like identify interesting signals in the data how would you do that. So we usually use a plot called a screen plot. And as you can see on the x axis we have dimensions or principal components. On the y axis we have percentage of explained variance. And these are basically the numbers that we saw here. So we're just plotting them. And usually people look for the elbow in this in this So whenever there is a steep elbow. So here is a little bit less clear, because like PC one has basically captured most of everything that is out there. But that's usually like when when some steep elbow is found is basically where you would put your threshold for for the number of principal components to look into. So what are the different ways that you can visualize the principal component analysis results. So this is this is probably the most common approach just to use a scatter plot to visualize each of your samples and your observations or your data points on PC one and PC two. So, here is the data is that again the mouse data that you guys looked into. And what do you see here what what interesting pattern do you recognize. Yeah, exactly so this the samples that are similar to each other's are kind of clustering with each other. Right. And that makes sense. Because PC one is basically has has captured like information from like a cup like the most important genes that are that have been out there. And like PC to has also captured some other interesting information that explains the variation among these samples. So, so we can basically easily see the clustering of M and NC samples here. Another way that you can visualize the results is a by plot so this is a little bit in this specific example it's a little bit complicated so I'm going to move to this, a little bit simpler example to explain how you would look into that. So, this is an iris data I don't know if you guys have worked with this or not but this is one of the, like base or input input data sets that you could look into and it gives you information about different plants. So, here, can someone tell me what the variables are and what the, what the samples are. Yeah, exactly. So, these first four, and this is just some metadata. And so these first four would basically be your variables and these are just like you know some information about the plants and the rows that goes you know one to six so that the data set it has like 100 and something plan. This is just the first six rows. To show the information about these different plants. So, when you apply PCA to this original data did one of the two of the two most important information that you get that the, you know, quickly looked into was like, first of all was the rotation matrix and the other one was this is the X matrix. So this X matrix and people use different terminology for this another name for this is the score matrix. And basically what it is telling you is the relationship between your samples, which you can see here as rose and your principal components. Okay. And, and the second matrix is the loading matrix, which gives you information about the relationship between your principal components, which are the rows here, and your features or variables, which are you know some some plant information here. And this scores here or these weights show that show the association between the original variable and and the principal components. So it kind of shows that PC one is made from 0.5 of this variable minus 0.2 of this variable and so on. Okay. So, which of these matrices did we choose did we use to perform this visualization can someone tell me, which of these matrices. So, did we use the loading matrix or the score matrix to visualize this. Well, score matrix exactly because that's the matrix that's showing us the relationship between samples and principal components does that make sense to everyone. Oh, so what what is not clear. The same thing that we are showing for blocks. You're explaining the school. These two are two of the outputs of PCA. Right. So this one this score matrix is usually the matrix that we look into. So this is usually used for visualization. It was denoted by x in the outputs of the of the power function that was performing the PCA. So here this is basically showing the relation that the score of each of the of each of the samples for PCA. So, we went from, for example, four dimensional data, we had four features here, right into two dimension. And this is then our new coordinate system PC one and PC two. And when I want to visualize that I put PC one on the x axis and PC two on the y axis. And do you know what each of this data points are representing. Well, it can be one of the four things because you know, there are a lot of them around. So it's definitely more than four. Each of these represent a sample. So this matrix that I'm showing here is a pretty long matrix. So I'm just showing the first six rows of the actual data. So it has 100 and something data points in it. And when you want to visualize that you basically and this one is also the same right so this is pretty long this has like 100 and something rows. So I'm just using this PC one this vector using that as here, this PC one on this scatter plot and the second PC to the y axis. Do you want to make some stuff. Yeah, well, this is, well, this is a data set that is not very high dimensional right you only have four features. But again, in my most of the biological data, especially in genomics, you're usually working with like thousands of genes it's just not available to visualize that many genes in two dimensions or three dimensions. So that's why you need to shrink them down. Okay, and basically here what I'm doing is that I'm coloring each of these data points based on this metadata that was provided which is some information about the species of these plants okay. So, so again what you can see clearly is that you know this, the samples from the same species or same group are kind of clustering together, right. So, so what is this, what are these arrows here and this is basically where the by plot comes into play. So, you can also visualize your data using this loading matrix, right. So, you can also have another plot here, just ignore these background points here just look at the arrows. And you can basically, you have principle PC one PC to as your x and y axis. And you can just put a, like, plot your data here plot your features here, and then like, use that as a just visualize that using an arrow. Okay, so that would be the loading plot. So you're kind of like showing where the features are on on your coordinate system. Okay. And people sometimes put those two plots. So this score plot and loading plot on on top of each other to to kind of like look at the data from every different aspect. So what would, for example, this. These vectors here would would show you with that for example this pizza links and this pizza with are hardly correlated with each other, each other, because you know they're, if you look at these two arrows, they're basically over length on each other. Right. So you would basically get the angle between these two arrows. And these two are kind of correlated, but the support links feature. And, and then this is a little bit different. This feature is kind of like is not as correlated with with the, with the other features compared to these three. Giving you information about how your features are related to your principal components. Why don't you say that. For example, the land, it's on 212. Right, right. No, no, no, you shouldn't. So basically how usually people. That's a great question. Usually how people visualize it is that they put some other, you know, yeah access on the top that would be like PC one values for loadings, and like PC to values for loadings. And these two would be for scores. And then overlay it all each other. If you have a lot of features as you know you where you can get these gene, like this mouse expression data before. It wouldn't really tell you that much because just you have so many arrows that you get overwhelmed and can read it that well. But you know if you're working only with a few features, you know it could be interesting to look at the by plot other than the scatter call. Okay. So the variation explained that that you're not really visualizing that here. Yeah, sure. So the question was that, where are you looking at the variation explained by each of the, each of the principal components here. So we're not exactly visualizing that directly here. That was visualized here. I remember one of the outputs of the PCA was from the Yeah, well, yeah, so you can, I guess like one way to look at it would be to see how spread it out. The data points are along PC one and PC two. So you can see that along PC one again data sets are pretty spread it apart from each other. And for example, even if instead of like PC to I have like PC for here. This the data points wouldn't be much spread it out. They would be kind of close to each other. Right. That's what but usually this scatter plot isn't used for like directly looking at the explain variants people usually use the screen plot or the elbow plot for looking into that. Okay. So, PCA is very popular again, it's one of the most famous dimensionality reduction methods that are out there. But, but when the structure of the data is very complicated PC might not be able to unravel the underlying structure of the data meaning that when you apply PCA to the data you wouldn't really like learn much about the data when you visualize and I will show you an example in two slides. So, when that happens people usually tried some nonlinear dimensionality reduction approaches for example, T's knee and you map so T's knee stands for T stochastic neighbor embedding. And again, it's nonlinear. So it's used for the data that cannot be spread it cannot be separated by any straight line. And basically what it does is that it finds few variables that represent many variables preserving the neighborhood distance. So, if you look at the mathematics behind it is basically what it's doing is that it's looking at the high dimensional data. And it's looking that how data sets are like clustered in high dimensions, like for example, when you look at 1000 genes how, how, how our samples, which samples are clustering together in that space, and then try to preserve that clustering. And it maps the data into two dimensions. So that's basically what it's doing. We're not going to get into the like mathematics of it today because it's more of an applied workshop, but both T's knee and you map. That's, that's the underlying intuition between how the, how their algorithms are working. So both you see a new map are great for visualizing single serenity data. And something else that you would have to know that. So PCA is solved analytically, meaning that it has one single unique solution, but T see a new map or stochastic algorithms that are solved numerically, meaning that it gets solved using in like, I don't know, like 100 or a couple hundred iterations. So when that's when whenever you hear we hear the word like being solved numerically, you should know that the algorithm can be sensitive to initialization. So basically what happens is that when you initialize your seat in a way like it with some, some number and we'll be can look into that later. It gives you one result. And if you read run the algorithm, but another seed, it will give you another result. So it's not deterministic. Okay. And that's one of the disadvantages of T's knee and you map. And one of its main one of these other main differences with PCA is that it focuses on local signal and neighborhood. It focuses on versus PCA that was looking into more global structures and global signals by explaining the maximum variance data. Usually in our or in any like for planning language that you use. You set some of the algorithms use like random sampling. You would have to do random number generation and the way that this random number generation algorithms work is that they, they use a seat, like they use a random number for for getting started. And that number is called, like, see that that's it. So if you like set your seat to like the number 10 you can choose any number that you want. And you rerun the data you will get the same thing. But if you don't set the seat and that not, and that number gets sampled randomly. Your result will be different. Okay. So, in trying to say that someone who might have a difference. Yeah, the same data set. Exactly. Yeah, yeah, your visualization would be different. Yeah. Oh, sorry, sorry, I keep forgetting about that. So they're asking about whether, you know, if you're analyzing data, that's, you know, in some paper, and you use a different seed for initialize initialization, and you plot your data. They can, yeah, usually for, for the case of, you know, for the sake of repusy stability, people share their quotes, and if you can look at their codes, you can like, look at what, what seat they're using. Yeah. Thank you for reminding me. Okay, we're a little bit out of time so I'm going to be a little bit quicker. So again, this is a single cell data. And here I'm showing how PCA might not be sufficient for unraveling the underlying structure of your data. So, usually in single cell data, each of these clusters would represent a cell type or like a cell state. And here you can see that, you know, PCA it's very hard to read what's going on and each color represents a different cell type, by the way. But, you know, like, he's needs able to kind of like form separate clusters for different cell types for if you just look at PCA it's very hard to interpret what that means. And this is the R package for these things called PCA. And something that you would need to know is that he has this parameter called perplexity, and it determines basically how to balance attention to the neighborhood versus global global structure and what that means in practice is that, like the smaller this perplexity parameter is, the more focus is on neighborhood and the denser your cluster will look like. We'll see the example in the next slide. So this is, this is some data that we're looking into. I think, yep, this is the artist data again. And this is perplexity number 1020 50 and 100 and you can see that as the perplexity parameter is increased, the data is more spread it out. And, and our final final algorithm for dimensionality reduction, which is again nonlinear is human. So you map stands for uniform manifold approximation and projection. And, and basically it's, it's conceptually very similar to tees need also uses the concept of neighborhoods. And the, the package that we use for that is called human. And if you look at, if you apply it to the artist data set, this is how it would look like. Okay, so we're done with the slides. I'm just going to quickly go through the tutorial.