 Hello and welcome to the final lecture of this introduction to machine learning course lecture 11 Today we continue talking about unsupervised learning dimensionality reduction and data visualization We started last time with principal component analysis and today we will talk about method called t-sne and and related Methods so just as a reminder in these unsupervised learning setting. We have some data matrix X Where samples are rows and features are columns We're not trying to predict anything so there's no Y matrix that we would predict instead We're trying to find some some interesting structure in this data. So we imagine that each each sample each row of this matrix is a point in the high-dimensional space and for the dimensionality reduction problem what we want to do is we want to reduce the number of dimensions from this potentially high-dimensional space to a low-dimensional space and today we're going to be talking about two-dimensional Data visualizations, so we will always reduce the dimensionality just down to two which can be plotted as a scatter plot like that This is also called embedding sometimes. So the task is to embed high-dimensional points into two dimensions to preserve Some interesting structure in this data. So for example if the high-dimensional data have three clusters Well-separated clusters like here then we would ideally want to see three well-separated clusters on the embedding and again ideally If there are some more complicated structures present in the high-dimensional data, we would like to see these complicated structures In the embedding as well not everything can be preserved But we would like to preserve as much interesting structure as possible. So this is the the task of Of T-sne and of today's lecture So let's talk briefly about what kinds of dimensionality reduction there are how we can classify the dimensionality reduction algorithms so the first possible classification is that The algorithms can be unsupervised or supervised so most of the time when we're talking about dimensionality reduction We talk we operate in the unsupervised setting for example PCA from last week But one can in principle consider supervised dimensionality reduction For example linear discriminant analysis that we discussed earlier in this course can be understood as finding linear projection that maximally separates classes, right? So this is in some sense dimensionality reduction that is that is guided by the by the class labels So this is a supervised dimensionality reduction problem another possible Distinction is between linear and non-linear methods or principle component analysis PCA is a linear method because in the sense that we are projecting high-dimensional data onto a subspace so projection is a linear operator as we discussed last time in principle one can imagine non-linear methods where The the mapping from the high-dimensional to the low-dimensional space would be non-linear kernel PCA comes to mind and finally there are methods that Don't construct any mapping from the high-dimensional space to the low-dimensional space so imagine you have all these points in the high dimensions and you try to position the points in two dimensions such That some important structure in the high dimensions is preserved in this in this embedding This is something that I will will be calling non parametric method because we we never construct a function Explicit function that maps high-dimensional to low-dimensional space instead. We're just optimizing low-dimensional positions of the points directly In this sense, this is a non-parametric method So I give here an example of multi-dimensional scaling which I will introduce in a minute is another example of that But PCA for example is clearly a parametric method because there's a there's a function making a high-dimensional data to the low-dimensional data Just a note on terminology often in the literature these methods That I'm going to talk about today such as t-sne are called non-linear dimensionality reduction I find this a bit sloppy because there's no mapping. There's no There's no function that can be linear or non-linear at all So I don't I'm not a fan of this term But they are often called like that and examples of these methods are multi-dimensional scaling t-sne U-map and more recent algorithm and actually many many others and Before we really start talking about about the the nuts and bolts of that Let's let me just briefly show you where these things are used Why why this is of use at all and actually there are academic fields where this is This has these methods have been used a lot and are very popular So here's one example is single cell Transcriptomics or also called single cell RNA sequencing. So this is a biological Technique that was developed and gained popularity just in the last several years where samples of our data matrix so our samples are cells single cells and Features so the columns are Genes so for example in this case from in in this paper 500 Thousand cells were profiled from from a from a mouse nervous system were profiled and For each cell there's information about how strongly each gene in the mouse genome and there was 25,000 genes Let's say in the mouse genome how strongly each gene is expressed in each cell Okay, that's the that's the data That's the input data to for example visualization algorithm that can then produce a picture like that where each point is a cell There are colored here depending By by some biological identity of these cells and you see a lot of a lot of Potentially interesting structure shown by this t-snip load Another example is population genetics where samples would be people in this case and features so-called snips single nucleotide polymorphisms, which is something that tells you where in which positions in the genome your Your genetic code is different from the average genome average human genome so again every point in this visualization is a person and then color I think called here ethnic origin of the people from the UK and again This is something like half a million people depicted here and again one can see a lot of very meaningful structure Appearing on this plot. So here's the the reference to look it up if you want if my third example would be behavioral Physiology where in this case each dot is a syllable sung by Songbird a zebrafish. I think when they are trying to when they are learning To sing a particular particular song and one can see how the syllables evolve during during training and features Spectrogram bins corresponding to the syllables and as a final example for something non biological This is this is an example taken from this great paper where? There was 15 million books Embedded in these two dimensions. So every point here is a book from a digital library Features of words. So there's in fact millions of features here and millions of books It's a very large and very sparse data matrix And you see clearly the books cluster by the language of course because the words are very different But if you look within the English language books you see that they that there is a very meaningful structure by by the topic and one can zoom in for example here the literature part the fiction and See further structure down the road. So again very complicated very large data set with a lot of complex structure hidden in it clearly in the relationship between between the books And algorithms like t-sneel and related ones can can make this beautiful, but also useful visualization and and uncover some some data that may be hidden there Okay Most of the time today. I'm going to be talking about amnesty data Which is a classic machine learning data set consisting of images handwritten images of digits So there are labels. There's there's 10 different digits possible here There's 70,000 of them in the entire data each images 28 by 28 pixels. So 784 pixels so our pixels the pixels are our features and seventy thousand is the sample size so We want to to visualize this in two dimensions the method that we already discussed is principle component analysis So I can show you how the PCA of this data Looks like here it is, right? In fact Quite some interesting structure can be seen here already So this is for example the digit one is It seems to be all grouped out here the zeros are here on the right these three digits on top Overlap a lot But if you look close at what digits they are 794 and if you think about pixel representation, then actually 7 and 9 and 4 overlap a lot in terms of having the same pixels So it does make sense that they overlap here Some of these digits are also similar in terms of how they are written so we we see with PCA works it shows us something that is meaningful, but Perhaps one can do better in some sense We clearly if you didn't have labels imagine that this is a black and white picture you would actually have trouble Or you it would not be possible to look at this and say ah, okay, there are ten ten different kinds of objects Present in this data set, right? It's it's it's all completely blurred together with the exception of maybe this this slightly denser cluster of ones on the left So I'm going to start presenting this what I called non-parametric methods by multidimensional scaling which is a very a very old method from the middle 20th century developed in the 50s and 60s by different people And the aim of MDS is to arrange points in two dimensions such that pairwise distances between points are preserved as well as possible So this this makes sense. Here's the last function of MDA or this seems to make sense Here's the last function of the MDS. These are original pairwise distances So we compute pairwise distance Euclidean distance for example between every every amnesty Digit and then we want to arrange points in two dimensions such that Euclidean distances in the embedding over here We're as close as possible to original distance and we just take the mean square error as our loss I can show you how the result looks like and it's not it's not really better than PCA in this case and it's it's it's not really impressive again ones are close together here and everything else is basically a mass and another Interesting or important point here is that I'm only showing 5,000 points here whereas for PCA on the previous slide that was all 70,000 and the reason I'm only showing 5,000 is that it is it is very Challenging to produce a multidimensional scaling embedding of a larger sample size because to do that We need to compute pairwise distances between all pairs of points So this is quadratic complexity in terms of memory and and runtime probably as well So if you take 70,000 images, you need to compute distances 70 times 70,000 times 70,000 that's the size of the distance matrix and even to compute this and store in memory This distance matrix is already prohibitive or at least requires requires a lot of memory and then that's before you even started Optimization, so it's really hard to scale MDS. Nobody really uses MDS for larger data sets but presumably if we could compute it on On all 70,000 M. Just images. We would see something similar at least that is that is my guess So why does MDS not show us structure that I will later show you is actually present In this data. Why does MDS fail in some sense even though the loss function seems to be pretty reasonable and And it has an interesting and important reason why MDS fails in this case It turns out that preserving high dimensional distances in the low dimensional embedding is just a bad idea It sounds as a good idea, but it is a bad idea in fact Because it is not possible to preserve the distances. There's no way to arrange points in 2d to preserve the distances And let me illustrate here by computing pair-wise distances and plotting a Histogram histogram over pair-wise distances in the high dimension in high dimensional space for these 5,000 points subset of the MDS data It doesn't look it doesn't look striking probably while some distribution, but note that zero is over here There are no small distances. There are no two digits that are very very The head that the distance is very small between them in the high dimensions most of the distances are Around three thousand in whatever units this is right, but the smallest Over here and there maybe Maybe go down to two one thousand and you can't put the points in 2d so that this this is true If you want to have some large distances here between this and this You can do that But there will be some points that are close together that will have distance of around zero It's just not possible to put point otherwise. Let me illustrate this like in a slightly different way I'm just going to generate a random Gaussian data in two dimensions So random uncorrelated Gaussian with unit variance and compute pair-wise distances between all points in this Gaussian And that is the distribution of distances pair-wise distances that one gets in two dimensions and this does make sense So there are some distances that are very small around zero Well, there's some some some average distance that mine gets and it goes Maybe to five for the points on the different sides of these two dimensional Gaussian And let me now show you what happens in ten dimensions Again, there are some average distance the distances are larger just think about computing the distance now between ten dimensional random vectors But the the the key point and the one that is very easy to understand actually if you think about how these distances are is Computed is that there are no points that have very small distance It would be very unlikely to generate a string of ten random Gaussian numbers Twice and then have very small distance between them. That's just doesn't happen So all these distances live here around centered at ten And this thing looks basically shifted to the right and if I scale the dimensionality up to 500 then here we are Now if this is the if these were The pair-wise distances in the original data and you want to arrange them in 2d Somehow to preserve these distances. This is just not possible You're trying to to fit something that will look like a blue Distribution to something that looks like a green distribution this of course fails and and that's why I'm there's often or usually typically does not produce And and interesting or meaningful embedding So the key idea of t-sne and and methods that we can call neighbor embeddings is that we just give up on that entirely We are not trying to preserve the distances anymore. That's not possible. So let's forget about it. We will Try we will aim to preserve the rank of the distance though or even more specifically We're gonna aim to preserve this this very left part We're going to find neighbors in the high dimensional data So neighbors these are the pairs of points that have small distance So they live here on the left side of this green distribution and those we want to make sure that are mapped Here to this part. So they are neighbors in two dimensions as well and the rest should be the rest So we want to make sure that this left part of the distribution is mapped to the left part of this distribution and that's what All these methods Are doing so that's the idea of what I'm calling neighbor embeddings following this paper Which was basically a landmark paper that That suggested this idea for the first time and really Was a game changer. I think So they suggested something that they called stochastic neighbor embeddings or SNE So it's very very influential Important paper beautiful paper. In fact, what is much more often cited though nowadays is the t-sne paper Also co-authored by by Jeffrey Hinton that came out later a few years later It's a just a basically one relatively minor modification of a SNE idea We'll discuss what it is a bit later and that's the t-sne method and you can see that it cited a lot and actually interestingly if you look At the citation count it just keeps increasing So more than half of these 20,000 citations came from the last two years and the reason I think why this is happening is that more and more fields for example in biology started to generate data that are very very Very good for very amenable to to this kind of algorithm so very rich data with and very large data and and people in these fields just like Using this algorithm this this kind of algorithms and that's why it just keeps gaining popularity more and more and And back then 10 years ago when when it came out It didn't seem as useful at the time All right, so let me just show immediately how the t-sne of amnesty looks like so this is the the default t-sne Picture of the amnesty and it's beautiful. You see that every digit is actually its own island, right? There's almost no overlap between different digits. There's a white space in between digits So we see that there are 10 clusters in the data Yeah, just great How does it work? So the idea of the stochastic number embedding as an e and t-sne is that the So we want to preserve neighbors, right? And the loss function is the so-called Public lidler divergence between something that we will call pairwise similarities or affinities in the high-dimensional and low-dimensional space So similarity is like the opposite of distance Similarity is large when the distance is small and the affinity if the same for the affinity We will say that two points that are very close to each other. They have small distance They have large affinity the points that are far away. They will have zero or near zero affinity or similarity So we will define these affinities between all pairs And we will make sure that they sum to one Okay, high-dimensional similarities will be called p's and low-dimensional similarities will be called q's and Once it's done the loss function is this thing here, which is known as the KL divergence So you can see that if all p's are equal to all q's and this is zero So that's what the algorithm tries to achieve Yes on this illustration, right? So I found some neighbors of this point i here and some close neighbors So this will be pairs that have high affinity and the affinity of this point i to something over here will be either Will be very small it will be near zero or we can even set it to exactly zero to simplify things So only these things have have a non-zero or at least Substantially non-zero p values and if we look at this loss So it's not symmetric here p and q do not enter symmetrically But we immediately see that we will pay high price in terms of this loss function If we take two close neighbors in the high-dimensional space and put them far away So two close neighbors some of these pairs They will have a large p value, right? So this is this this logarithm term will enter with this large weight in the loss function And if q is is small then that's the price you're paying What happens with the points that are far away to begin with well? They don't even enter explicitly this function right because p term for them is nearly zero So you pay high price for putting close neighbors far away You do pay a parade it's not correct to say that you don't pay any price The other way around for putting points that were far away close to together and how to see it here Is that q is normalized to some to one as I mentioned? So if you take points that were far away and put them close together They will get a high q value So you will spend some of this some of this Q weight that you have because it has to sum to one on this modeling the useless pair that that doesn't enter this loss function Okay, so Normalization of cues is actually the part as we will also see later is the part that makes t-snew want to keep Far away points far away Okay, so that we have the loss what I need to tell you is how to define p and how to define q And then I need to tell you how to minimize this thing So let's go over it will start with high dimensional similarities the p values and what t-snew does is that it? essentially just computes the gaussian Kernel here it puts the it takes the Distance in the high-dimensional space in this case It's Euclidean distance, but it could be any other distance in fact But we will didn't we will just talk about Euclidean distance today So this is the distance between high-dimensional points, and then it's exponent With a minus sign of that so a gaussian gaussian kernel the the larger the distance the the smaller the p ij So this is what one can call it a directional Affinity so pji is not equal to pij over here and notice that the indenominator We're just normalizing everything so that this thing sums to one Okay, the larger the distance the smaller the affinity and they sum to one per point and There is the sigma i squared term here So this is a variance of this gaussian kernel the standard deviation and the width of the kernel is chosen adaptively to To achieve the so-called To achieve this the desired value of the so-called perplexity, and I don't want to Spend too much time explaining that you will see in a moment why Think about perplexity as the effective number of neighbors. So if we are in a very in a very dense part of the high-dimensional space Then the sigma will be small so that this gaussian kernel covers Approximately 30 Neighbors, okay, this guy high-dimensional gaussian kernel if it covers around 30 neighbors Then the perplexity is around 30 if you're in a non dense in a very sparse region Then it will be a fat gaussian so that it again covers approximately 30 neighbor so we just want that we want to choose this Adaptively the width of the kernel so that around 30 values are large and everything else should be much smaller okay, that's the intuition here and That's not symmetric So we'll just symmetrize that and then divide by n so that the entire p ij matrix sums to one So it's symmetric by construction. It sums to one by construction every point has around 30 large affinities where the perplexity parameter regulates this value and If you're a little bit confused by that, that's okay But I'm going to show you later that actually this isn't very important one can One can define uniform Similarities which are much much simpler. We'll just say we'll just take 30 nearest neighbor for each points For each point we take 30 nearest neighbors and say that the affinity all these 30 nearest neighbors is exactly the same one over 30 And it's zero everywhere else Okay, so this is something I will call uniform similarities So I replacing these two lines by these very simple definition and I will later show that in most cases It just produces the same or very very similar result So here all affinities are the same Here they are not exactly the same the closest point has a large a bit large affinity the points further away have a bit smaller affinity Important is that once you go beyond hundred neighbors or so Everything is zero Alright, that's how we define high-dimensional similarities now the low-dimensional similarities are defined similarly So we compute the distance in the two-dimensional embedding. That's my wise. I put the distance through the kernel Which I will show in a second and that's my Q values divided by this normalization factor They're just some over the entire pair wise over the entire data set So all pair-wise distances here. So this sums to one by construction. This is symmetric by construction We just need to choose this kernel in the original paper the SNE paper Use the Gaussian kernel here as well. Okay, so that's That now concludes the Setting up the S&E loss function. T-SNE made one change. It's suggested to use a t-distribution kernel In this case, it's also Cauchy kernel Instead so instead of this function T-SNE uses that function. That's the only difference between S&E and T-S&E One over one plus distance squared. So this thing decays exponentially this thing Decays as one over d squared. So it's called Heavy tails. Okay, if you plot this Gaussian kernel and the t-distribution kernel the t-distribution kernel has heavier tails We'll get back to what it means or know what it Makes to the embedding in a few slides. Okay, what we need to discuss first is how do we optimize that? So now we have P's, Q's, we have KL diverges between P's and Q's as a loss function How do we optimize that? Turns out one can just use good old gradient descent and it works So the loss is optimized via gradient descent For example, one can start with a random configuration of points We will talk about initialization a bit later and then just run gradient descent and that's the final embedding that you will get So let's try to work out quickly how the gradient works here So the loss is P times the sum of P of logarithm of P over Q So it's actually two terms P log P. That's a constant. We we're not optimizing over it So I will remove it. I'm not even showing it here. What is left is minus P logarithm Q Okay, now remember that the Q was defined as this W Which is the kernel of the of the distance in low dimensions divided by normalization factors So I'll write it like that here and decompose in two terms remember here that the sum of P over all pairs is one which is why I can write the second term like that And now if we look closely at these two terms We can see that the first term can be Interpreted or it will generate attractive force between two points in the embedding That were neighbors originally and the second term will generate repulsive forces So to see that actually one can just one doesn't even need to compute the derivative, right? That's not gradient yet. That's just rewriting the loss function So if we look at that this has to be small so wherever P is non-zero this has to be small So W has to be small, which means the distance Sorry, there is a minus sign W has to be large. So the distance has to be small. So wherever two points were neighbors Their distance Should be as small as possible in the in the embedding. So this will this term will try to pull neighbors together Okay, that's what we want. What does this term do again? We want to minimize that and now there's a plus in front of it So we want to minimize every every W, which means that the points will feel repulsive forces and That's the balance here so one can I will show you tees near optimization in a second and It basically works as kind of physical many-body simulation the points are flying around in two dimensions the neighboring points feel Attraction and want to get close But there's also repulsive force between all pairs of points that come comes that originates from this normalization term And there's this balance between attraction and repulsion and in the end we get some embedding So we can to compute the gradient We actually need to take the derivative and if one takes the derivative of this then indeed this term ends up giving you attractive forces and and these term ends up giving you Repulsive forces and then on each iteration you compute these forces for every point make a little step in the direction of this gradient Which just means you move all the points then you recompute the gradient and so on so that's roughly how the optimization for this for any method actually for multi-dimensional scaling it works similarly and For for tees near it also works out like that. So think about this interaction of points as an intuitive picture Behind the optimization and now I can show you the gradient descent optimization of the amnesty data I will let it play a few times And notice that we're starting here with random initialization that that's the Gaussian blob in the beginning And then the point starts moving and relatively quickly they form They form these islands of the same color, which is the same digit and then slower. You see that gets like progressively better and Yeah, I don't know this this this salmon colored cluster for example you can watch that I could slowly gets together and this Violet cluster also gets together and so on the end result here is Though not as good as I showed before right, so why is it not as good as I showed before so I'm not doing I modified it a little bit to show you that or Just to say it better I I did not employ one trick that is usually employed I will tell you about the trick in a second But let's I think it's very important to see here and understand why this happens So these blue points and these blue points feel attraction some of them are neighbors They want to get together, but there's all this stuff in between and they feel repulsion from that So these two groups of points they feel attraction. They want to get closer, but they can't they cannot because of this stuff in between So in the end you converge to something which is a local minimum moving them closer would decrease the loss function Moving this entire island over there would actually Would actually be a better solution, but that's not you can't get from there from here So you're stuck in a bad local minimum here and It happens because the repulsion is too strong in a way It does not allow these things to get together even though they want to That's the intuitive intuitive picture of that. So how can we fix that and in fact already the original papers? Suggested a very useful trick how to how to do it better and the trick is well They can't they want to get together, but they can't get together right these things So let's increase all attractive forces temporarily and hopefully this will let them go through this all clusters in between And and connect and then we'll decrease it again. So this is trick called earlier exaggeration and I'm showing the animation of amnesty Using earlier exaggeration and now you see it's it's it's much better now Everything is very neatly separated and importantly I started from the same random initialization as on the previous slide Okay, and and here it works because until now look here Now now the early exaggeration was turned off and until it was turned off It the attraction was so strong that every every digit. So every cluster could Collect together right here, and it's actually interesting embedding in its own right We'll get back to it a bit later Then we turn the early exaggeration off and that's the moment where every cluster expands a bit because now repulsion Attraction is weaker repulsion is stronger and you get the final t-sne result It's important that the learning rate is is large enough so that there's enough time in this early iterations 250 gradient descent iterations by default For earlier exaggeration to work But there is a good heuristic For the learning rate actually suggested recently That does the trick. Okay so one final but a very important remark here or a topic is that Well, if we want to scale t-sne up for even for amnesty 70,000 points But maybe beyond for 700 points and 700,000 points or 7 million points Then we need to somehow we need to think we need to to speed it up because naively you compute the pairwise distances between all points In high dimensions in low dimensions That's n squared memory and n squared complexity because on each gradient descent iteration You need to take care of all pairwise forces. There is attraction. There's also repulsion between all pairs of points So that's clearly this is not going to work So in order to use it and I already used it in these Amnist Animations we need to do something with attractive forces and with repulsive forces to make it feasible So let's discuss them separately attractive forces first. So attractive forces. That's easier part Only small I always told you only a small number of similarities will be will be not will be large Most similarities even if you compute the entire distance matrix all n squared values And then put them through the Gauss and Kernel most of them will be around zero So we can say we're not going to compute this near zero values We will just for each point find its nearest neighbors a small set of nearest neighbors That's called constructing a k nearest neighbor k and n graph of the data So it's a graph where every point is every sample is a point and if a point is a nearest neighbor Among the nearest neighbors of some other point then there's an edge between them. That's a graph So for example, if you want to use perplexity 30 Which is you didn't this standard default choice for teesney then we can take 93 times larger They k for the canine graph can find 90 neighbors of which point and compute this Gauss and similarity between them by the time you've got to 90th nearest neighbor the PIj is basically zero and for everything else. We'll just say it's exactly zero This makes it a lot faster the optimization because you don't need to you don't know You don't have n squared attractive forces. You have n times 90 attractive forces roughly but even a Larger gain you get or maybe another Justin I'm not sure what's larger, but another very large gain You can get if you use something called approximate Canine graph so it turns out and this is something I don't have time to cover today that there are algorithms even different approaches to Constructing canine graph approximately Approximately canine graph means you construct a canine graph, but there can be errors so maybe you find 90 neighbors for each point, but Actually only a five are really among the 19 nearest neighbors and five Points that you found as nearest neighbors are not really nearest neighbors But it works well enough so that doesn't make any difference for T's new for example and it still works exactly as well So these algorithms are great work pretty good, and they're much much faster than finding the exact canine graph So that allows us to deal with the attractive forces effectively for repulsive forces That's a whole large topic that I don't have time to discuss in detail But there was different approaches were suggested over the years of how to approximately Compute these repulsive forces between all points. So instead of computing exactly all n squared repulsive forces We're computing them approximately so there are different ways to do that With recent ones having actually linear complexity. I Don't have time to explain what this does. I will very briefly explain you what bonds hard does even though You shouldn't use bonds hard anymore now because there are methods that are much faster, but it's slightly easier to explain So I was on this slide Try to give you just the gist of what's happening here. So imagine These are your points and I should say that the technique this bonds hard method was developed in physics in computational physics to solve The problems of many body simulations and can be just used here So you have points arranged like that you construct this partitioning of the space so that the dense regions are partitioned finer And the sparse regions are partitioned Courses so to say and then if you need to compute so after this is constructed if you need to compute the Some of the repulsive forces that this point for example feels so for it's pointing down here let's say we can instead of computing the sum over all these points we can Coscrain it and say for this for these all points over here The repulsion that this point feels is roughly the repulsion that it would feel if it were one heavy point Okay, so we're computing just one term over here But the closer you get to this point the the more fine grained You You you group These points so this allows this can be effectively implemented efficiently sorry implemented and can work pretty fast and Allows one to actually embed things like amnesty Even though as I said There are even faster approximations, but that's the general idea You somehow approximate the sum of the repulsive forces that each point is feeling. Okay, great So with this I'm done with technical part I explain you how to optimize the loss function explain you the loss function itself and now let's discuss various Parameters that are in t-sne and what they do with the embeddings So I think traditionally most people think that perplexity is basically the parameter That one can adjust in t-sne So again, that's the essentially the K regulates the K in the canine graph So how many neighbors each point is feeling attraction to and this is again the same image of Amnesty with perplexity 30. So let me just show you what happens with much lower perplexity and much higher perplexity and what happens is that there's less attractive forces around here. So this thing like inflates more and It looks a bit like soap bubbles. I think Here you have more attractive forces between more distance points. So what happens is that larger? Clusters so to say collect Closing together in the embedding But the thing is that very small values are really useful Actually, they are usually it looks like that and it's not very useful So to use a perplexity that is much smaller than 30 in order of magnitude is almost never Useful and to use a perplexity that is much larger than 30 is almost always impractical or even completely prohibitive computationally because the larger the number of neighbors that you want to keep track of The the more attractive forces you have to deal with so if you have a data set of million points and You want a perplexity of hundred thousand this will just not work This will not fit the memory and you will never it will never converge. So If you have a large enough data set then you basically Stuck with using perplexity of 30 or 50 or 100 But this doesn't make any difference for large data set, but you cannot increase it enough To to to start seeing something qualitatively different So in fact in most practical cases perplexity is not a parameter that you can meaningfully vary at least in my experience Ah That's one thing that I promised show you about the affinities So this is the again the default Disney with perplexity 30 and this Disney I made with uniform Affinity in the high-dimensional space with 15 neighbors So each point for each point I find 15 nearest neighbors and it has the same affinity values to all 15 Okay, and you have to look very very closely to spot any difference here And even to make sure that I didn't by mistake I'm not showing you the same picture twice But look for example this yellow cluster looks a little bit different So it is not the same picture, but it's very similar and you can take perplexity 300 and a uniform similarities with K 150 and they will again look very similar So it see it doesn't have to hold mathematically, but in practice for most data set it holds very well So these entire Gaussian affinities in the high-dimensional space. This is actually not important You can just take uniform affinities with 15 nearest neighbors and and the rest just works the same way okay Another important thing is what happens in low-dimensional embedding with these Similarity kernel I mentioned before that S&E original paper used Gaussian kernel and a T is in teesney used Koshy kernel and The teesney paper made a big deal out of that They said that they are addressing something they call crowding problem of Essany by replacing this kernel and the intuition that they that they present in the 2008 paper is that Sometimes you the embedding one like the teesney loss function wants to preserve for example, I don't know 15 nearest neighbors of a particular point But there is not enough space so to say in a two-dimensional space to to keep all these points closer So you have to make some sacrifice and put some of the nearest neighbors a bit further away And if the kernel is Gaussian then you will pay a high price for that So the the argument is let's take a kernel that decays slower So a heavier tailed kernel than Gaussian and then maybe the sum of the nearest neighbors can be allowed To move a little bit further away But the Q value will still not it will not go down by that a lot. Okay, so the Heavy tailed kernel is more permissive so to say In a sense and so the crowding problem is is addressed or solved So the funny thing is that the original paper doesn't actually show for example the Essany of the entire Amnist And it wasn't until until recently that that we implemented that In modern teesney implementation to see what happens and here's what happens. So this is the Essany Result on the entire Amnist and this is the teesney result the Gaussian kernel Cauchy kernel So what you see here is The crowding problem and in a sense. Yeah, there is there's some overlap between clusters And there's very little or no white space in between the different clusters and that's something you you get with teesney Funny thing or interesting thing is that one can think of that as as a as a family of kernels Where this is a t distribution with infinite degree of freedom if you if you know that from your sadistic classes And this is a t distribution with one degree of freedom And one can vary the shape of this kernel making it more or less heavier tailed and see what happens with embedding So here you move like imagine moving from this embedding here But one can make it even more heavier tail and if you do that you get a picture like that So this is using even heavier tailed kernel than the Cauchy kernel and an interesting thing happens in that the Each digit split splits into more fine clusters So you'll see finer cluster structure and the interesting thing is that you can if you if you look at what images form this cluster Then you see that at least in some cases, that's that's meaningful islands. So for example, the four can be written Like that open on top or it can be written as as it's printed over here when it looks Like that. Okay, and some people write for Open on top and some people write for closed on top And you'll see that one of these islands Corresponds to one of these handwritings and the other islands correspond to other kind of handwriting So you can One can actually show that at least some of these islands are meaningful and one can show the same on other data sets So that's something I find even more remarkable. So here's for example analysis Of the of that the same library data set that I mentioned in the beginning but a subset of it of Russian language So it's half a million four hundred thousand points over here colored by the by the year when the book was printed the Russian language Change changed Orphography in after the revolution. That's why all pre 1917 books are there and later books are over here, but the interesting thing is that you decrease the This parameter the you increase you make the kernel more heavy-tailed over here You get all these islands are they meaningful or are they not meaningful? Well, we can find for example all the poetry books and it turns out that the poetry books are all over here So this island is the poetry island makes a lot of sense Here all the poetry books are also together So t-sneed does like a default t-sneed does group them all together But it doesn't separate them from the rest visually and all the math books that you can also just find by keywords Concentrated in these islands. So this is the math island and here all these books are in this corner But if you didn't know that these are math books, you would not suspect that this is something separate So what happens here is the heavier tailed kernel emphasize this fine cluster structure something else I said before that T-sneed preserves like exp the loss function of t-sneed explicitly is constructed to Preserve local structure of the data if points are neighbors nearest neighbors Then they t-sneed tries to keep them as nearest neighbors in the embedding At the same time t-sneed will often struggle to preserve global structure of the data Another way to formulate that is you you do the gradient descent you start from some initialization But actually the loss function has many local minima and sometimes they are bad local minima I mentioned this before remember the first time I showed the amnesty animation. It was stuck in a local minimum We could solve that one by early exaggeration trick But in some cases the initialization will play a large role as well And to show that I can use this very very simple toy data set where the data is just Two-dimensional circle with some noise added and then if we if we run t-sneed on this data So we taking two-dimensional data and embedding it in two dimensions Of course, this doesn't make a lot of sense, but just for the sake of this toy example. I can do that So here's the default t-sneed with random initialization on this data. It's actually a nice picture. I think it looks like a knot but Clearly the the the global structure is is messed up here if you use PCA the first two principal components of the data to initialize t-sne in this case PCA just coincides with the entire data, of course Then you'll it will converge to something like that. So these are two and Embeddings that the gradient descent converge to so this is a local minimum of loss function This also is a local minimum of the loss function, but that's a bad local minimum That's a better local minimum and I think the Take-home message here is that it absolutely does make sense to use informative initialization For example PCA. There are other choices, but for example PCA to to initialize the the embedding Okay, because because why not then you? You will converge to a better local minimum That's true. Not only for t-sneed but for any kind of neighbor embedding algorithm and Yes, I have an animation that shows actually what happens if you if you optimize the circle data with six different random initializations I think it's fun to watch because you can see very slowly how during the early exaggeration phase They are very slowly unwrapping. That's an interesting phenomenon But very very slowly and then once the early exaggeration is turned off it will happen in a second you will see that yeah, they stop unwrapping and And and and they make gaps the overlapping Overlapping parts make a gaps because there's the repulsion that they feel and in the end you get this six random knots so Interesting thing happened during earlier exaggeration where this is slow unwrapping because actually one can show that there is a Mathematical relationship between strong early exaggeration and technique called Laplace and eigenmaps. I don't have time to introduce properly today but that Explains why it slowly unwraps But takes a long time to unwrap fully. So in this case you you end up with it with the knots like that And this doesn't only happen in this funny two-dimensional toy example, but let me show a real world Application where the same thing happens. So this is a single cell transcriptomic data set where there's a lot of different clusters That's the the paper I take the data set from but Even though there's a lot and a lot and a lot of different clusters There's three very broad groups that you see here in fact these are Inhibitory neurons these are excitatory neurons and these are non-neural cells in a mouse cortex But doesn't matter for now. What matters is that if you do PCA of the data you immediately see that there are three Very distinct groups, right? But if you do t-sne you see that maybe there's hundred of Small islands in the data clusters in the data, but all these cells for example the gray brown cells They form a bunch of different clusters and these clusters end up in different parts of the of the data So I am using earlier exaggeration and everything else as I should be doing here and still it ends up like that And that's because of the random initialization if I rerun with a different random initialization all these I will see the same islands But positioned differently So what one can do as I already suggested Let's just initialize t-sne with PCA because why not then there is this global structure already in the data in the initialization And then we'll converge to whatever it converges. Let me show what happens start with PCA here's earlier exaggeration phase now It's over and that's the normal t-sne optimization phase and here are all non-neural cells They started in this corner in the initialization and they of course end up in this part of the embedding together And all the excitatory cells will be on the right and they will They will still occupy the right part like the bottom right part of this embedding in the end. So Again, general recommendation always use informative initialization in fact more than t-sne implementations do that by default The last part that I want to cover today is to discuss the Effect of exaggeration so I mentioned before that Early exaggeration means multiplying attractive forces by some factor during the early iterations very useful trick To achieve to get a better better convergence But what if we kept the exaggeration throughout the optimization? What if we don't? switch it off in the end and what if we don't use 12 but different values and see what happens Maybe you already noticed on the previous in in my animations that this early exaggeration phase actually produces some interesting Embeddings and then I always switched early exaggeration off and in the end. That's the t-sne result But maybe we can study what one gets with different exaggeration values and indeed it turns out that one gets a very interesting family Of embeddings. So let's discuss that briefly. So this again is it t-sne default t-sne Without any exaggeration no in the end I turn early exaggeration off and I'm left with this embedding now imagine that I'm running early exaggeration And then after it's done. I keep on exaggeration of four until the end I will end with this embedding over here and if I use exaggeration 30 throughout I will I will get this embedding on the left So that's the that's the spectrum of embeddings here There's more attractive forces on the electractive forces are stronger on this side and the attractive forces are weaker So the repulsion is stronger on the right side. That's why I call it attraction repulsion balance And it turns out that one gets very interesting embeddings along the way for example, let's look here why first first feature is that there's a lot of white space which can be Useful to have or I don't know pleasing aesthetically or emphasize that that these are different different clusters Another interesting feature of this embedding is that for example these three clusters are together and these three as well So if we look at what these three clusters are then we see that this is eight five and three Which overlap in the pixel space and the same is true for seven nine and four over here so we get Larger cluster separation, but at the same time we see like a larger groups of clusters that are Connected by these nearest neighbor edges They actually attract together stronger and collect in this in this larger groups and this makes a lot of sense We increased the attraction here, right? So everything becomes denser points get closer to each other these two Digits they feel also increased attraction. So they basically glue together like that Okay, and for then the rest somehow balances out with repulsion But if we increase exaggerate increase the attractive forces even more Then all of them all digits will feel at least some attraction and and will glue together in something that looks like that and does not have does not have White space in between anymore So we analyzed a bunch of different data sets which you can look in this preprint from last year and show that this always happens and You always get More so we interpret this as having more continuous structures here on the left and more discrete structures emphasized here on the right You don't have a lot of continuous structures in the amnesty data So I will show on the next slide another data set where this is clearer, but already in amnesty It's clear that we're getting at least more like a larger scale structures emphasized Here when we when we increase the attraction Whereas you can say that this emphasizes some very local structure more precisely for example The threes seem to split a little bit in the three group in two different groups in this embedding here And this is something that is completely lost over here because attraction is stronger So one point is here that actually this is an interesting hyper parameter at least empirically it produces it produces Useful embeddings often But its separate point is that it turns out surprisingly maybe that many or several other Algorithms that were more recently developed in the last several years They produced embeddings that are very similar to somewhere on the spectrum So you map for example is a method very related to teesney, but it works very differently. It involves Some stochastic optimization and so on I'm not going to present it in detail to explain how it works It appeared a few years ago and became very popular in some in some fields For example in single cell community uses you map a lot now But so it works differently But once you run it and you look at the embedding it turns out that it looks you map of amnesty looks almost identical to this picture. That's basically you map of amnesty Give or take and this happens not only for amnesty But across the range of very different data sets and one can analyze it mathematically the loss function of you map and show that actually does Look very similar to teesney, but attraction is stronger And this is true not only for you map, but for several other methods. We see that actually they produce outcomes that are Somewhere on the spectrum always to the left of teesney For the methods I know Some correspond to exaggeration for maybe some other method correspond to exaggeration 30 But they live on the spectrum. So that's pretty interesting even though the methods may be pretty different but in the end that's the that's the meaningful Family of embeddings that one that one seems to be getting here with different algorithms and here's my last example for today This is a very large single cell transcriptomic study with two million cells Cells come from mouse embryos during development So that's the original citation and we use this data set in that paper to play around with for teesney and here I'm showing you the The teesney plot that is actually taken from the original paper. So original authors did teesney. That's the teesney They also clustered the data in in a bunch of different clusters and that's the colors that I'm that I'm showing here if one does the teesney Using high enough learning rates and the PCA initialization So high enough learning rate as I briefly mentioned before is needed for the earlier exaggeration to work properly Then one gets a picture like that and if you look closer closely at this You will see that this is a much better result than on the left. For example, there is this pink cluster over here Number 15 I think Which is here split in three parts one two and three they belong to the same cluster But they appear three different parts here This doesn't make sense and that's the the bad local minimum They this could not the early exaggeration was not strong enough or the learning rate during this phase was not strong enough for For these to collect together. So if we set the hyper parameters right, then we get this result, which is much better But that's not the the most interesting part here The most interesting part I think is what happens if you increase the Acceleration here the final not the early exaggeration, but the exaggeration overall and then we get this thing So this is let's say the default teesney with no exaggeration in the end but the hyper parameters the optimization parameters set correctly and this is what you get if you use exaggeration for and Incidentally, if you use you map for this data, you will get a very similar result To that so and what happens here is that there are larger structures here appearing So here if you look at this you have no idea that there was maybe two very large continents in this data Whereas it's super apparent on the in here And if we look at this this slightly larger Continent then these are actually cells that correspond to neural development in the mouse embryo And moreover if one if I if I show you the the labels for all clusters that That have something to do with neurons you will see that there is a progression from a very early So called glia cells that then give rise to neural progenitors Then they later develop into mature neurons. So you can see that there is a neural development from The bottom up here to the top and that's just the progression of the of the cells during the during the mouse Embryogenesis, so that's pretty cool that we see this one-dimensional time access essentially in this embedding and The interesting thing is that you cannot see it here if you know where to look Then you can see that in fact this progression starts somewhere here and then goes like that to this orange Cluster and on top here. So this is this neural This is this time access of neural development, but if you don't know that you will never see this here on The other hand Some things can be seen on the right that cannot be seen on the left For example, there are all these small islands here, which I don't know if they are biologically meaningful or not But they the data Suggest that there are some small fine clusters in the data that one sees in in in the actual default T-SNE over here, but here would increase the traction. This is just gets collapsed together So you don't see finest clusters anymore, but You see actually larger scale structures that is interesting and and in this case we know from biological knowledge that a priori knowledge that there should be this continuous sub manifold in the original data So I think the the way to think about that is the continuity discreteness trade-off that one gets with higher traction You emphasize continuous Continuous sub manifolds with higher repulsion You get more cluster structure emphasized And I'm going to end here on this slide, but I will say in the end that actually I think This this field of two-dimensional embeddings and visualization of complicated high-dimensional data By complicated I mean data that have some maybe continuous sub manifolds in it, but also cluster structures and maybe cells During development like you know split into several evolutionary Branches So there can be not only one-dimensional Manifolds, but some kind of tree manifolds in the In the original data, but there can also be clusters. So that's what I mean by complicated data And embedding this complicated data into two-dimensions faithfully is a very complicated problem that T-SNE solves Great, but there definitely can be further improvements For example a very I think Interesting question that we just see on this slide is that this embedding tells us something important about the data This embedding also tells us something important about the data. Can we somehow have the two things combined into one embedding? That would show us both the continuous structure and the cluster structure and have some wide space in between large Distinct areas And so on and so forth. So I think this field is definitely will see More exciting developments in the in the years to come as more and more data sets also become available Like this one Where where this can be applied Thank you