 So, I was told that you are a final year engineering graduate and you picked up this project. So would you like to just talk a couple of words about it before we start? So currently I'm working as an intern in Algo asylum and me and my colleague Abhishek Deshpande, we are both interns, both final year students and Shriram Karandekar is our mentor there as well as Varad Deshmukh who is from the University of Colorado Boulder and we just all love working on mathematics and data science and are working on this project. Varad and Shriram have been guiding us for the past few months on this and UMAP is a very interesting tool and we thought why not make a pie icon talk about this because it's a powerful tool and everyone of us could use it in our day to day lives. Yeah, sure. So looking forward to your talk, let me just put you on. It will go right ahead. Yeah. Okay, should I start? Yes, go ahead. Okay. Hi, my name is Som Joshi and I'm here to present a talk about understanding UMAP. It's a state of the art clustering algorithm and in this talk we will be exploring at how it can be used efficiently. I'd like to take a moment to mention my fellow colleagues, Abhishek Deshpande, Varad Deshmukh and Shriram. Now we have been working on this project because of our shared motivation of interest in mathematics and data science. So let's get started. So what exactly is UMAP? It stands for uniform manifold approximation and projection. It's a topological data analysis and manifold learning technique. It's based on the Romanian geometry and developed as a well-defined black box in the form of a Python library. Usually it's used for dimension reduction and unsupervised clustering, but also for supervised dimension reduction and metric learning. Now this makes it seem like UMAP is a powerhouse of advanced and complex mathematics, and it is, but this talk is not aimed towards delving into the mathematical realm. The main focus today is to take a practitioner's perspective on this tool and learn how we could use it effectively. Since the algorithm in itself is a black box, the major changes in the output can be seen by changing a few input parameters. An uninformed approach would be to tweak these parameters and hope for a better result. But rather than relying on chance, we can effectively put the basic knowledge of these parameters to good use. And we need not understand the whole theory, but just a sneak peek inside this black box would definitely help in better understanding and applying this tool. So as a visual aid to know what UMAP does, let us consider this example of UMAP application, the MNIST handwritten digits dataset. So it's a standard dataset of 28 by 28 pixel images, and these images are flattened out to a vector which has 784 dimensions. So ideally, when we run a clustering algorithm, we should get 10 clusters, one for each digit. And the colors in these plots represent all the classes, and we expect all the colors to be grouped in the same cluster. Running K-means algorithm on this gives us the left-hand side output. But this is not exactly what we were looking for. So while it gets some cases correct, for example, one on the right side, most of the other clusters are kind of a mixed bag. UMAP, however, does a great job at separating these clusters. If you can look at the image on the right, you can see that the clusters are clean, appropriately separated from each other, and there's no real overlap there. In terms of performance, UMAP completely dominates the field by setting very high standards of faster and better run times. If we consider common clustering algorithms, PCA and K-means are used in standard introduction to clustering lessons, but they are mainly focused on very straightforward algorithms. TSNE was considered to be the cutting-edge algorithm for clustering right to UMAP came along. But we need not know much about TSNE, but know that it was consistently better than PCA and K-means in terms of both accuracy as well as performance. So in the table on the right, we can clearly see how good UMAP is in comparison to TSNE. For Amnist and Fashion Amnist, it reduces the 784 dimensions in little over a minute, whereas the TSNE takes around 15 to 20 minutes. And for an ultra-high number of dimensions such as 200,000, UMAP performs nearly 19 times faster than TSNE. And as I previously said, we can easily harness the power of this amazing algorithm. So let's now have a visual and functional understanding of the effects of the main parameters. Let's consider a dataset which I think most of us would know, the Iris dataset. The representation is generally how the algorithm sees it, just as four dimensions and 150 data points. But we know that these data points belong to one of the three clusters. These are the color-coded according to the species of certain flowers. And let us adopt this color recording for the rest of the presentation for a better understanding of how UMAP performs. But since the UMAP will be projecting it into two dimensions, let us consider the 2D projection of our dataset. But do remember that UMAP will be considering the full four-dimensional data. So the 2D representation is quite straightforward. As we can see, all three groupings are clearly described. And let's now see how UMAP works on it. So in general, we can define UMAP as a four-step process. These are a broad overview of the full UMAP algorithm. And each step has a complex mathematical background to it. But we don't need to delve into the nitty-gritty details of the equations to understand what the algorithm does. For example, the first step in the process is plotting a fuzzy, simple shell sets. Now, this term seems too mathematical, but it is quite straightforward to understand. So what we do is we plot sets around each point, such that it encompasses a fixed number of points. For simplicity, let's assume these to be circles. And this is where our first parameter plays its part, the n neighbors. So after constructing a set, we create something called as a simple shell set, which in a very, very broad manner means that we create an edge between two points. Now, we can meaningfully measure the distance between two points. And we could also assign weights to these edges, depending on how far apart these points are. And that is exactly what fuzzy property means. So this is a visual representation of the fuzzy sets. As you can see, each set is formed in a circle to cover n nearest neighbors. In this case, let's consider them to be five. So consider these two cases. We can clearly see that the yellow point on the right-hand side has a much larger set. As it has to assume a larger radius to reach five neighbors, whereas the purple point has a much smaller circle, as there are more points in proximity due to the density of points there. Now it may look like the yellow circle set has more than five points, but remember that the data is four-dimensional, and we are only looking at a two-dimensional projection. So this kind of looks like an illusion, but actually the circle or the set only contains five points. The next step in this process would be the generation of a graph. And with the data points as nodes and the new formed edges from the fuzzy sets, we can fully define a graph where points are connected to the n nearest neighbors. This is just an adjacency matrix. The edges between two nodes are also weighted thanks to our fuzzy sets, which apply a weight from 1 to 0 for increasing distances. So does the edges play an important role in clustering data points which are alike, since the weight of an edge is directly significant of the importance of that connection. So this is kind of the graph formation that we would expect. We identify which nodes are connected to each other in the higher-dimensional space, and the weight of the edge is proportional to the density near a point. So for example, if you see the center of the purple or green clusters, they have more weight attached to them as compared to the isolated yellow points that we saw. Note that all of these processes are still taking place in the higher-dimensional, and the algorithm has to start to plot the data into a representative two-dimensional space. The next step is to create a force-directed layout. This is where the data actually gets projected into two dimensions, and the dimensionality reduction works its magic. After this, we will have a low-dimension representation. So in our graph, there is no physical locations associated with our data points. It's just a matrix, which has been plotted accordingly. So when we want to project it onto the 2D space, the main question that arises would be where do we map these points? The point of creating the force-directed layout to get the correct and stable position for each point and the edges have the weights which tell us what forces are acting on a particular point. For a simpler analogy, consider that every edge coming from a node is kind of a string which is pulling it in one direction. And applying forces from all directions, we are trying to find the balanced configuration, where for every node, it gets pulled in such a way that the final effect is nullified, which gets us into a stable configuration. So not only do we have the connection between the nodes, but we can define forces of attraction and repulsion based on these connections. This can be used on a force-directed graph to get a projection of a higher dimension to two dimensions. And we can observe the mappings made from our higher dimension to the 2D plane. Being ahead, the final output of the plot is hence derived from the force-directed layout. This is plotted into the Euclidean or coordinate space. The UMAP forces the force-directed layout to convert the higher dimensions to the coordinate planes. And this is the final output. But this is something that the algorithm presents us with no understanding of the color coding that we initially used. And we see that it has formed some clusters. Now, if we assign the appropriate colors with respect to the original points, this is how it looks. And thus UMAP is successful in separating these clusters all by itself. So now we have a fair idea of how the algorithm works, but how do we control it? How do we optimize it such that it fits our data set? So let's have a look at the variation of the input parameters. So the parameters affect each step in this process in some or the other way. So let's rewind back to the first step and have a look at it. The parameter which affects this process is the n neighbors. The process of plotting the fuzzy sets depend on the value of these n neighbors. Let's see how increasing the number of neighbors changes our sets. So larger n neighbors has led to a formation of a larger set for every point, and the plot is significantly bigger. Now if we compare the same point that we saw last time, we can see the difference. A choice of n determines how locally we wish to estimate our data. A small n will mean that we want a local interpretation which will more accurately capture fine detail in the structure and variations. But a large n would mean that we will miss these data sets but have a broad understanding of the global structure of the data. Consider that if we take n as too high in comparison to our data sets. So we have 150 data points and if we take 130 neighbors, we are actually comparing them and telling that 130 people are kind of alike, which makes no sense. So we must find a good balance between n neighbors from the lowest number to the highest possible amount. The graph formation directly depends on the fuzzy sets and these steps also relies heavily on our choice of n. And this forms a graph which can be visually very appealing in terms of n neighbors. So on the right, we have our fuzzy sets which led to these graphs and for n equal to 5, our nodes are kind of sparsely connected as the lower number of neighbors dictate a local connectivity. But as we go on increasing, we can see that our graph is now more dense. The number of edges increase and the weights for the edges between the closer nodes have more thickness due to their proximity. As we go on increasing n to 30, we can see our connections now being made between green and yellow clusters together. The closeness of these clusters has induced the creation of nodes and leads UMAP to believe that these clusters are similar in some sort of way. And even the purple cluster has become more dense because more strongly weighted edges have been added to it. When we consider the force-directed layout, we project it from a higher dimension to a two-dimensional plane. But the meaning of the distance in a higher dimension and in two dimensions changes and it does not match one with the Euclidean space. Actually in the higher dimension, they don't even match each other. But that's a math trivia that we don't want to go into. But what we want is the distance to be standard in the Euclidean or coordinate distance. So here we introduced our second parameter which is the minimum distance or the min dist. So based on this distance, we define our Euclidean distance and as a standard and set some algorithms which I previously mentioned to strategically place the points under two-day space. So that is why it's called a force-directed layout. We actually force points to go into the final layout. And this is also influenced by N neighbors. Let us see how. So in a force-directed graph with N equal to 5, notice some details about the placement of the clusters. The black cluster and the white cluster seemingly intersect each other and the red cluster is separated from them. Increasing the number of neighbors gives us a surprising insight though. If we compare the force-directed layout with N neighbors as 30, the difference is clearly visible in terms that the cluster are more tightly packed to each other. But if you notice clearly, there's a number of connections formed between the red and the white clusters as well. This went unnoticed in the previous case. The focus on our local connectivity with N equal to 5 made us ignorant towards the global structure and the significant similarities between these points went unnoticed. And as the number of neighbors plays an important role in conserving the significance of the data. Obviously, the final output that we get depends on both the input parameters. So let's have a quick overview of these variations as well. So this parameter overview chart changes the N neighbors over the x-axis and the minimum distance along the y-axis. And based on these different combinations, we can get different results out of the U-map. Let's look at the N neighbors where we keep our minimum distance to 0.1 and cycle through N neighbors as 5, 15 and 20. For a lower number of N neighbors, the algorithm leads to over-clustering. The yellow and the green clusters are split into many smaller clusters which are falsely recognized. This is mainly due to our low input, which forces U-map to focus on the local structure and divides major clusters as it misses the larger perspective. So this localization or over-localization is aggravated due to our force-directed placement, which places one of the purple clusters far away from the other ones. Now, if we increase our N to 15, we can notice that the front, that the force-directed layout has been averted and the other clusters are well-placed as well. But this is the default input for U-map and it generally works for most of the datasets, considering that you have a moderate amount of data. But for a large dataset of, let's say, 500,000 points or 1 million points, we may need to change it. And to make sure that the value is significant in comparison to the total points, however, we can still notice that the purple clusters are somewhat apart. And since the algorithm cannot see these colors, it may misclassify it as a separate cluster. If we further increase the N to 20, the two of the three small clusters in the purple patch have been combined. And more neighbors also ensure the proximity of the third cluster. But this poses another problem. The yellow and the green cluster are now merging into each other. And here is where the minimum distance comes in. So let's look at the minimum distance parameter and what effect it has. Let's keep N neighbors constant at 20 and cycle through the minimum distance values. And since the values are from 0 to 1, we can take 0.1, 0.5, and 0.9 to be our trial values. If we look at the previous graph where our main distance was 0.1, the packing is very tight. And if we try to separate these clusters, we can see that the significant amount of error has been introduced here. Many of the yellow points are crossing over to the green cluster. And many green points are also overlapped by the yellow point. We reduce the density of the packing by increasing the minimum distance to 0.5. And we can see that there are visible changes. If we observe the purple cluster as well, the separated part has loosened up and now it appears as one single cluster. So we must have a balanced approach because as increasing minimum distance gives us a good separation between these two clusters, it's also still introducing some kind of error in it. If we go on and increase further, this happens. So even though we achieve a very fine separation in these clusters, only one or two points are passing through the other side. But the weak connection on the separate purple patch gets worse. By increasing minimum distance, we force the weaker cluster connections farther apart. And while we do achieve a fine separation between these distinct clusters, we may end up separating points in the same cluster. And that is why it's very important to find a good balance of minimum distance. For example, let's see something that we should not do. A low number of neighbors and a very high minimum distance value is not something that we want. We are basically ruining our output. The extreme focus on local structures makes us lose sight of how a single cluster is placed. The purple cluster gets pulled apart from the rest of the purple points. And the high values of MINDIS makes this data sparser. And this forms a kind of meaningless spread of points. Note that these sort of values could be used for a particular dataset where, let's say the points are 50, where 5 N neighbors are a good chunk of 10% of the total data. But as we go on increasing the total number of points, the N neighbor should be representative of a significant similarity in points. And in this case, this output totally goes against what we are trying to do. Now, if you look at the major applications of UMAP in research, we can observe that the studies of research paper looking at populations, the genetics to visualize population structures has been done here. And the input data is from a UK biobank, which has a gene type data. Gene data from humans can have more than 10,000 dimensions to it. And moreover, there are 500,000 data points. And this is a typical example of a large dataset. The first run on this data was using PCA. As we know, it reduces it to two dimensions as well. But PCA is a linear algorithm, and it does kind of get the corners right, where you can see the African, the European, and the Asian ethnicities. But it finds it difficult to place certain mixed race ethnicities into the correct clusters. It forms a kind of trail which doesn't lead to separate clusters at all. But UMAP does a great job at this. It forms distinct clusters. Similar clusters are placed near each other. The prominent cluster here is the British population, and the Irish population can be seen as a subset down below, which is correctly representing the geographical and ethnic variations in that area. Another example we could look at is the India and Pakistan clusters. So since they are nearby, they share a similar genome due to their geolocation, which is directed some way to the Asian ethnicities. On the left side, we also see that the African represented by the red color and the Caribbean represented by the orange colors have same origins. So although our force-directed placement arbitrarily places points in the Euclidean space, it does a great job in maintaining the relativity of these points. So we can get meaningful insights from our data. The PCA output, as we can see, does not provide such a deep insight into what this data means. But from the UMAP output, the placement of the clusters, the actual clusters itself, the trails following the clusters, all of these gives us a great insight into what the data suggests in terms of genomic similarity. And if we consider the runtime for these two, TSNE was also run on this, and since it's a cutting-edge algorithm, it took two hours and 15 minutes to run this dataset. So UMAP is two hours faster than its best competitor right now. And that's just on the background of hardcore mathematics and great optimizations applied on it. And as I said, it's not really difficult to harness this power. And with the right amount of data and right knowledge of parameters, you can do that. So as a summary, the key takeaway from this talk would be understanding the UMAP parameters. If an inexperienced programmer looks at this black box and decides to use it without any knowledge of the parameters, they'll usually resort to random trials of different variables. Then your chances of getting fruitful results are technically based on pure luck. But if we understand the UMAP procedure and we understand that it can be controlled by a few parameters, albeit its mathematical complexity, then we can make the appropriate changes. The perspective for clustering varies subjectively for each dataset. Depending on the dataset, we might need to have an eye out for finer detail in the local structures or look at the global structures to see how various things interact with each other. And UMAP kind of gives us the control whether we want to focus on local or global structures. And that is the power of the UMAP parameters. Even after being a black box, knowing their complex mathematics, we can know what changes these parameters do and it also has in these higher dimensions and control it much better. And this will help us to tackle most of the higher dimensional dataset which has a lot of dimensions as well as it's a very fast algorithm and you get efficient usage and better results. So that's it for the talk. You now know what UMAP is and I hope you had fun trying to use this amazing algorithm. I hope you know what parameters to change and thanks for attending. Our GitHub IDs have been provided on the top right and the links for the references and slides are on the bottom left as well as my colleague would be presenting them on the hop-in chat. And we are open to any questions if anyone has any. Yeah, sure. Thanks, Soham. So that was a great talk and I personally learnt something or two from it. So I think one of the things that a lot of people had asked was specifically how is it kind of different from any other supervised learning technique. I think you clarified it already in your talk but I'd like to re-emphasise on that point on how it's for example how is it are you using fuzzy seamless clustering but then I think you already said if you would like to touch up on it. In terms of the fuzzy clustering the clustering doesn't happen in a fuzzy manner it uses a smooth KNN distance technique to get K-means out of it. The open the fuzzy open sets are formed using a smooth KNN distance and we consider them only for making connections. As for the supervised clustering this can be used for supervised clustering as well and it works great. So if you can head out to the read the docs version of the U-Map there has been a lot of documentation of supervised learning as well as unsupervised learning on it with excellent performances and as we know that supervised learning will obviously give us a greater performance so yeah and also it's used for metric learning So I think Abhishek has also put out the links on the chat so can we check that out. So this one is can we use LIME or SHAP to get a gist of how exactly U-Map is performing internally in case we need to sort of understand why it does what it does. So if you want to go for it in terms of a approach where you actually do something but there is a great sci-fi talk on U-Map by Leyland Makines who created it where he explains more details about performance and the techniques as well as there is the paper which explains it in detail and the U-Map read the docs so it quite clearly explains how things happen, why those happen those weren't quite the content of the talk today because it is much of a practitioner's approach and going into mathematics is out of context here but yeah you can check it out. Thank you for that and so there is one more question actually how is the performance on the SPARS data sets? So U-Map actually performs amazingly on SPARS data sets as well and it's actually very it's obviously slower than other data sets but on SPARS data sets it actually performs better than other algorithms. Sometimes PCA performs better than U-Map because some straightforward data sets are kind of a linear logic to them and PCA may give us results very quickly but U-Map is technically good on most of the data sets considering that a manifold is something which we can stretch out to be in any form. The reason that we connect the fuzzy-simplisher sets is to create this manifold in such a way that these points which are similar get closer to each other. So a 2D plane is flat and the manifold changes the variations which even if the data points are far apart from each other, the manifold will fold into each other with some kind of way where these points get closer to each other and that's why it's so fast and that's why it's so efficient. That's quite interesting. I think there are many more questions probably folks who reach out to you on ZULIP. Sure, I'll be on ZULIP answering anything. Thanks a lot for the great talk. Thank you.