 So, I am working as a data scientist currently in server security space and I am also a architect. So, actually ensure both the data science we build also works in production. So, to actually solve a real world problem. Talking about I guess real world, I guess all of us are working in data science field, somehow in enterprise solving the science problems. So, I guess everybody knows people come to us and say you are a, I think do solve this machine learning problem and give us some answers. So, they do come to us, but everybody knows we never get something called a label data. You don't get labels. Life is pretty hard for a data scientist. So, you have to turn to unsupervised techniques and when you use those techniques, it's most often they don't work straight forward. So, you have to then actually develop new things to make things work in real world. So, that's what the talk is about going beyond the normal basically. How do we solve the real world problems and figure out those unsupervised machine learning problems and finally get the thing working. So, I'll quickly go through the agenda. So, we have three main themes I want to cover. First, the motivation of data geometry. So, what do I mean by data geometry? How can we understand the geometry of a data? So, first I'll go through a analytical part. That's mostly theory. So, it would be short. Don't worry. I won't bore you much with theory. So, then would be the applied machine learning part. Where will we take a few examples, understand how does a manifold work, what are spaces and other stuff and work with examples. Last would be the technology part. How do the same things in big data? So, let me start with the motivation part. So, in big data I think the main concern is we have to discover interesting patterns and we have to uncover unknown knowledge from the data because we just thrown the data and just they come to us and say find some patterns, find something. So, typical ways doing clustering. Find some homogeneous groups like clusters basically. So, the homogeneity can be defined mathematically in terms of similarity or distance. So, I'll talk about them more detail later on. But if we summarize in short, if you have some notion of similarity like how to data objects are closer or farther, then we can actually use it to do two things. Create clusters which are more homogenous. As in collect or join those points which are very similar to each other based on some notion of similarity. Similarity could be anything, could be like some of their properties, could be they're referring to the same thing, anything. I'll just come to those in details in subsequent slides. Other part is basically separate apart the different clusters or the heterogeneity. So, these are the two main principles we want to basically obtain from a cluster analysis. So, talking about geometry. Where does the geometry comes in? And the data. So, I'll just pick a few points from the research people have done worldwide. So, one of the most important things people discovered is the curse of dimensionality. So, as you know, we live in a three-dimensional world. We see data as how do we see the world? Like in an occlusion space, we have the three dimensions and that's what basically we see and apply the senses. But in a high-dimensional data, the same intuition doesn't work. Why? Because in high dimensions, most of the mass is actually not real the mean. Like we have a typical Gaussian bell curve. We expect the majority stuff to be at the center, at the mean basically. But when you go to high dimensions, that doesn't come to be true. What happens? It's mostly around a shell around it. So, people classify this or account with an example. They say imagine an orange. So, it's a high-dimensional orange. These most center or the mass of the orange would be around the skin, not in the pulp. So, that's what a high-dimensional orange would look like. It may not be the typical one that we imagine. So, that's where our intuitions break down. And what we apply typically does not apply in high dimensions. But we also have another benefit from high-dimension, what I call as the blessing of non-uniformity. So, data which is non-uniform, normally clusters around in small, small notions. Like in the case of orange, around the skin, there would be patches of data. So, it would be around its local groups. So, that's a good part that you can capture. So, this would be some of the natural thing that data would be having. So, what's our job is to understand the natural shape or geometry of data in that domain. So, whatever the data is, let it be that way. We don't enforce it to be mapping to an equitably flat space or some other space. Let's discover the real geometry of data. So, that's the key idea. Let me give a small example to illustrate the point. It's a simple example. Everybody knows, I think we need to create clusters from this. We have two circles, the inner circle, outer circle. So, although it's a very typical intuitive problem for everybody, we can figure out which is the cluster. But if you have to solve it by a cluster, it's like a K-means. It's hard to draw a line between the two circles and separate them. So, it's not separable by any simple algorithm. If you try a classifier, try an SVM, mean SVM, they also don't work. But you have some complex algorithm that will make them apart. That's possible. But giving the idea, no simple algorithm can work. Coming from the context, if we look, focus on geometry or the metric space that we have, if we simply transform into different geometry, then maybe the job becomes easier. So, let's say we convert it to an R-theta space, the radius and the angle over here. So, now this is simple, just a linear line. And I can use a simple K-means algorithm to split it basically up. So, it's a trivial thing. So, the point is, geometries can tell us what's the inherent nature of data is. If you capture that, it's pretty easy. The job becomes easier. This was a simple case, but if you go to more complex cases, then the undoable things becomes doable. So, that's what's the main idea is. So, now I'll first start it with the main thing. Yeah. So, I'll tell you two parts. First, the theoretical part and then the applied machine learning part. Or you can use tools and other things to actually understand the geometry of the data. So, I'll quickly go through the analytical part and then we'll come to focus on the big data part. How do the same things in the big data space? So, first the analytical part. Talking about topological spaces. So, in the world of mathematics, this is the most generic notion that we have. So, in general, any space can be classified as a topological space. It's a topology, basically, that we understand. In simple words, it's a set of points. It's a notion of a set theory, basically. You have multiple sets in the data. So, those points have some neighborhoods. In this image, you have a yellow color, a blue color and a different color. So, these are defining neighborhoods for data. So, then there could be some set of axioms we can define that relate the points to the neighborhood. And then, if we just pick all these groups, let's say the intersections and the unions of these groups of data, they collectively can make a space. That's called a topological space. The most generic notion of a space that we can define. So, what we can study from these spaces? Connectedness. So, basically, how they're connected. So, that's where the clusters will come from. Then, the compactness. Basically, it's data dense. Is it two spars? Dimensionality. We talked about earlier. Continuity. Then, there is also possibility that in the real world, you can't define distances in many conditions. There could be holes in data. So, point A and B may not have a distance, basically. That's a real-world problem. But, in equal space, you have to have a distance. So, that's why your assumptions will fail over there. So, how do we tackle those cases where we break the continuity or we have holes? So, that's the generic idea. So, the first, we'll take one sub-example called the metric spaces. The topology was a bigger notion. We'll focus on a sub-area called metric space. So, metric space is basically just one more constraint added on the topological space where we also have a distance function that every point has to obey a distance. What does a distance mean? They are four properties to defining a distance. Like, a non-negativity, it has to be positive, definitely. Identity. If two points are equal, the distance has to be zero, basically. And then, distance xy should be same as distance yx. So, and the last is the most important one, the triangle inequality. So, you'll find lots of measures called similarities and dissimilarities. But, they may not be distances. So, because if we have points like x and y, so, xy plus yz could be greater than xz. But, that should concern that this should be smaller. Like, we have a triangle. If this is enforced, then we have a real distance, basically, that we can use in geometry and actually work upon finding clusters. So, we need to have all the four things to classify it as a proper distance. I'll take the most intuitive example we know, the Euclidean space. So, that's the formula we already know. We are learning from school, basically. x1 minus x2 square and square root of that. So, that's the most typical one. And supplied in majority of the clustering algorithms that we do, it's by default over there. But, yeah, it has drawbacks. Although, it's pretty intuitive. How we can imagine, like, the points are in different spaces. And then, they're basically mapping it to a coordinate space. It's easy to imagine. But, yeah, it has drawbacks. It may work well in many scenarios, but we need to ask the question, is the assumption of Euclidean metric valid? It's like, what if the geometry is not Euclidean? There is not a flat space. There is something else. But, we are just forcing it to that. And what if we are actually distorting the actual geometry? And we are maybe then adding some new notion of data, which is not present data, basically. So, adding new information. So, all these are nuances of an Euclidean space. That's been used by everybody by default with all the major algorithms. I'll take the most obvious example, the Earth that we know. So, on an Earth, I think everybody knows, let's say we create a triangle. On a curve, the sum of all angles of triangle would be more than 180. But, on a flat space, it would be 180. So, a curved space is the most obvious non-equivalent space that we know about, basically. So, if we, let's say, use the red dotted lines to find distances, we are actually distorting the real space. So, the points are, the distance on this is not really a straight-line distance. It's a curved distance. So, if we actually use those, that would be the real geometry of the data. So, to measure those, we call, measure something called a geodesic distance, basically. So, the lines over here, the curved line A, B and C, these are along the curves of the surface, of the sphere, basically. So, that's what we call a geodesic. It's measured over some of the great circles. So, great circles are basically the circles which are having the same origin as of the sphere. So, if we move along great circles, you will exactly have only one single path on the great circle, leaving aside the points which are basically exactly opposite. So, if you leave the opposite points for every other two set of points, there's exactly one great circle on that sphere. So, that's one way of looking at a geodesic. So, the concept gets more complex. I think we can only imagine a sphere by default. So, but things get more complicated. I'll show you just one example of a more complex geodesic. Sure. Yeah. So, the origin has to be same, basically. For a great circle, the origin of the circle and the origin of the sphere has to coincide. So, basically, the diameter has to be same of the circle in the sphere. That's a great circle. If I'm using along the great circle, then you'll always find exactly one great circle for two set of points which are not opposite. So, they do a unique distance. So, you'll have a unique thing and it will always give you a single answer. So, what's the question? But then you can have multiple distances, basically. You have multiple versions to have a generic version and to be basically there should be no issues. So, you have a defined concept, basically. So, define a great circle and that will have a unique distance. So, just to remove any ambiguity, basically. So, that's the idea. So, this was a simpler case. I'll just show you an case of a more complex geodesic. So, people must be hungry. This is not a doughnut. It's called a torus, basically. So, it's a complex thing. If you place an insect on top of that, it works in a straight line. Just look at the part of that. Just observe this, a small little animation. The insect is just walking in a straight line. Nothing else. But this is also a geodesic. The earlier one, earth or a spear was a simple thing. But, yeah, it completed it one round. So, just completing one round had a very complex path it took. But this is also a geodesic. So, things can get complex. And we need to figure out different geometries, how we can do those things. So, that's just an example, just to show it. Coming to the point, how can we relate distances with geometries, basically? So, the key idea, whenever we use a metric between two points, like Euclidean or geodesic, we actually define the geometry. So, by using a geodesic, we assume it to be a spear. By defining Euclidean, we assume it to be a plane. So, we actually impose that. So, a metric has basically one-to-one correspondence with the geometry. So, that's why we need to be thinking what kind of distance function we are using. So, every distance function is imposing a geometry on data, basically. You could be using your cosine distance, you could be using Euclidean, it could be anything else, it could be dice. Everything is imposing a geometry. So, that's where we need to understand is it the right geometry or is it we are enforcing or distorting the real one. So, now though, as you can understand, like even case of spear also, we could enforce multiple geometries between set of points, like a plane, a spear, anything. But, which is the right one? We don't know. So, as there are multiple infinite combinations of geometries possible, the problem is to choosing the right one. So, one useful measure for choosing the right geometry is called the Gaussian curvature. So, this can help us choose among multiple available geometries between the given data, which is the most appropriate one. So, I'll just quickly go through what does the Gaussian curvature mean basically. It's a number, a number given to every surface. So, I am showing three surface over here. So, the k value in the first one is actually negative. It's and if you see the, it's curved inside, just focus at the surface and it's flat for this and bulging out for this. So, k is positive over here and it is zero, it's flat. So, although I think people get sometimes confused, they say, how does it look flat? It's a cylinder, cylinder is not flat. But, look at the surface. The surface is flat actually. So, the surface over here is bending inwards. Here it's flat, it's bending outwards. So, that's basically telling us that these are different curvatures and even a plane and a cylinder will have a similar geometry. So, we can also convert a cylinder to a plane by just using like the radius and the angle, that gives you the same occlusion space. So, this is same as a flat plane. So, now if we just use these indexes and I just will show you how these different index can give you different distances. So, every curvature has a distance. So, the most typical one, the occlusion that we saw earlier, is for a flat surface. If the surface curve outside, then we have a dot product and a cos inverse basically. That's the spherical distance. Now, if we have a hyperboloid, it gets more complex. So, it's a cos hyperbolic inverse of the different dimensions. So, I think the formula is over here. I won't go through that. But that shows that as we go to more complex surfaces, the geometry gets more complex. The distance gets complex. Now, just imagine the torus that I showed back earlier. It had two curvatures. The outside curvatures were bulging out and the inner was bulging in basically. So, it has a positive curvature and negative curvature. So, we had to combine this. So, the distance would be more complex over there. So, complicity can increase. So, one way to solve this thing is to get back to a space that we know. So, that's where a concept called manifold comes in. Map that topological space to a local euclidean. So, it may be possible that the whole thing may be complex, but in parts may be flat. Like earth, you know, as always. Like before, I think a century back, we assume that to be flat. It was still doing pretty good. We were doing our distances and we were doing things fine, but it was not flat. Similarly, all complex surfaces, majority of them, could be mapped to locally flat surfaces. That's what we call a manifold construction. And we know of a very simple example, but we don't know it's a manifold. It's the earth. So, when we create atlas, that's a manifold. So, everybody is working on atlas on school days. So, that's actually a three-dimensional manifold on a two-dimensional space, basically. So, that's a simple example. Of how a manifold can be constructed. So, I'll slightly go into a little bit of how a manifold is constructed, just to give you an idea, because that would be useful. So, in general, if you have any curved space, it could be of any varying curvature. So, you can have functions. If we have a function that can map a curved point to a flat point, and you can do an inverse of that, and you always have a unique mapping, then the manifold exists basically. Only this criteria is enough for that. And these points are called charts, basically. Collection of these charts is an atlas, similar to what we did on the earth from the globe we got the atlas back. So, these, if the function exists, then we can have a manifold. There could be complex scenarios where we could be having multiple functions where the curvature is changing. So, it is part of both the curvatures. So, then there is a function called a tarnation function, which basically looks at the intersection of different functions, and you can then combine them and again get a manifold. So, a tarnation map, along with an atlas, is basically a complete set to create a manifold of any curved surfaces. So, that's the generic point. And now, coming to the main point, manifold can be learned in two different ways. One, call it topology. That looks at the global structure of the data. Second, call geometry. That looks at the local structure of the data. So, we need to choose what's our area of research. Do you want to look at the global structure or look at the local structure? I'll come with an example. But to the last slide of my theory, I'll just give mention a way. This is a technique called re-emanian geometry. So, this is a generic way of discovering the manifolds on any curved surfaces. This is projecting a sphere to a plane based on the tangents. So, for every point you can draw a tangent on that, and you can understand the curvature. So, it constructs something called a smooth manifold or a re-man manifold based on a re-man metric. The metric is very simple. Just a product of tangent space at a point varying at the surface. So, this basic idea, it's a complex mathematics thing. So, I'll just leave it as it is. But this is very powerful. This was even used by Einstein in general relativity to create the spacetime curvature. So, this is, I guess, the most generic form of defining geometry. And I'll just leave it as it is because this is an area of research and people can study them on their own and figure out how you can do theoretically. So, in short, this was a theoretical way of understanding the geometry of data. If we can buy some analogy or buy some theory, can understand which structure should I apply on the data and can test that. So, in this approach, the theory comes first. You imagine a space and then get the metric. So, that's one way. Now, let's come to the easier part. How are we going to do it in the applied machine learning? So, I'll pick some examples. Let's pick the first one. We need to find the clusters in this. So, I guess, no points for this. It's an easy one. The clusters are pretty simple. So, we have four clusters over here in the data and we can easily separate them. Now, let me slightly change the example. So, if I actually change the ODFC to this way, will it make a difference? Do you think, can K-means catch that? Yeah, it can catch that. So, because easily, it could be metrics that are invariant to any rotation reflection thing. If things work in basic space, it will also work after rotation reflection. That's an easy thing. Let's make it slightly more interesting. If you take a plane and slant it, will it still work? Do you think? Yeah, it will still work because it's, again, a flat plane. Yeah. Let's make it more interesting. It's still the same ODFC in the S form. It's the O, the D, the S and the C. Now, what do you think? Will it work? Yeah. So, how many of you think it will work, basically? So, can you raise your hands to people believe it will work with K-means? And people who believe it won't work with K-means. So, the first one, one, it does work with K-means because it is still, locally, it can be, a plane can be drawn. The coverage is small. The coverage is not good enough, basically. To, it can still be approximate by plane. So, it can still capture that, the points. So, they can be a plane drawn over here. They get the O out, the D out, this and the C out, basically. So, the clustering can still work easily. But now, let's make it more interesting. The same example. I've bent the S further down. Now, what do you think? Will this one work? Yeah. Okay. How many of you think it will work? Okay. So, this time, probably, I think, yeah, it won't work because the curvature now is too sharp, basically. So, look at this color. This color, light blue, now is being affected by all the four clusters. So, it's spread in all four of them. And, yeah. So, it can't figure out, basically. It says, basically, I've bent this further, basically. The same S. Actually, it's more like an hourglass, yeah. That's a better way to say. It's more like an hourglass. So, but the curvature is so high that they are close at the center, basically. Yeah, but it's actually not overlapping. So, that's a good question. So, to understand, does it overlapping or not overlapping? Yeah, it's not. So, that's the other way. Use a manifold. So, if you use a global manifold, then you can MDS. So, in case of this image, we can clearly see the position of this three-dimensional S on this. It's in clear S. No overlap and no nothing, basically, yeah. Now, if we look at this one. So, if we again create the same image, we got the clustering. Now, take up position of this on a two-dimensional world. So, now you can see they were not really overlapping. Yeah. So, they were too close, but there was actually physical separation. So, it's an issue with the clustering technique, basically. Not really with the data. Definitely, yeah. That's the idea. This is an example to show if you can visualize three-dimension, but yes, it can work with n-dimension and back to any low dimensions. So, and but I think that's a good point. Which dimension to take first? The higher or the lower? So, now I'll just reverse the same technique. So, let's say we had the same data and now we do the projection first. I'm not doing clustering. If I just go back over here, the clustering was on three-dimensional data. So, we were not able to find the good clustering in a three-dimensional world. So, now I'm doing the reverse of that. First, take up position. Now, do a clustering. We are still not the best case, but still better. We only have one polluted cluster. Yeah. The three are still fine. So, that gives an important principle that if we can map to a local Euclidean manifold, then probably clustering will work fine because yeah, it will work fine better, basically. So, if that's possible, yeah, get it to lower dimension space and cluster on the manifold as compared to original data. If they had some clusters and there was a manifold possible, you will get that. Yeah. Yeah. So, I think they're both pros and cons. So, yeah, that's a valid point that in some cases, you can make out that, is it viable, worth doing or not worth doing? But the general case is indimensional. I'll come to that case. That's also what I'm also proposing, that that's a better idea to do in general case. But how? So, there's one more thing. If you can see, we were able to find the structure at a global level. So, still there's lots of information. I can't see the ODAC. Can you see? There is no ODAC over here. We lost the ODAC, right? Yeah, we got the structure, but we got ODAC. So, there is another idea called a local manifold. So, now if we take a local manifold, let's see what do we got? I think that's awesome. We got the ODAC back just from this curve thing. Yeah. So, I think that's the power of a local manifold. This actually preserved the information that was more relevant. So, the information was much more than just the curvature S that we had. It also had some inherent data, the O, the D, the AC, and that's what a local manifold captured. So, let me tell you what the, let's compare both of them. The same S looks like this in a global manifold. Looks like that in local manifold. So, they are both looking at different areas and working in different fashions. So, what happened in both of these? So, this is what happened in a global manifold. So, it took all the pairs and basically found the distances for each pair. So, it was understanding at every possible distance that's possible, then took a prediction. So, that was a more complex thing. But in case of local manifold, if you can see, I don't know if the lines are visible, but only the lines between these points are highlighted over here basically. So, all the local distances are only being looked upon. So, it's looking at the neighborhood and only at the, in a certain region, it's finding the relations. So, that's why it only found the inner structure. Yeah. So, that's probably is what is more powerful over here. So, in the aspect of clustering, this is a more efficient way basically. So, this can give us more clusters. The natural thing. Yeah. Yeah. So, but like in case of K means, I have to tell the K. So, I have to tell it the K is 4. So, I knew there were 4 clusters. So, I told it with the job. If I don't know what the K is, how would you solve the problem? So, if you know that there is a 4 clusters, then you can solve the problem basically. If you don't know, then it gets complex. So, that's a good idea. But, so we can use the property over here. So, I think that's the idea. The local neighborhood property can give us those structures if we can find the demarcation. We'll demarcate basically. So, that's where the next idea comes in. So, you must have heard I'm going to call DB scan. It's a simple. Yeah. Yeah. Yeah. It's Euclidean. Euclidean. So, I think yeah, so that's we can do that. So, that's the theoretical part. So, if we know the geometry, like in case we yes, we know that it's a curved space, we can simply apply geodesic. And we get the answer. So, that was the first part we were talking earlier. If you know that do it, it's easy, pretty easy basically. Yeah. But now, if we don't know it, we want to do it empirically as in like by doing testing. By applying algorithm and then basically testing it and finding a measure. So, we can also find like defined matrix. What is the number of clusters and like defined distance between intra-cluster distance and the inter-cluster distance. So, that will tell us basically how good the clusters are. So, we can measure the quality of clusters. So, empirically, if you have to find, we don't know the geometry, how we can solve that. So, this is the second approach where we do a test and trial and figure out those things. So, I think I was meaning this was the LLE, locally linear emitting. This was MDS, multi-dimensional scaling. Yeah. So, LLE is really powerful, the locally linear emitting. It is just looking at the neighborhood around that. So, that captures the real essence of data, the real information in the data that we had. So, we can generalize this concept and we may not need to create a manifold because it's a very computational intensive thing. It's not possible to create manifold always, basically. It's not easy also. Yeah. So, we can just pick the idea that look at the neighborhood and use an algorithm which works on neighborhood. That's like a DB scan. DB scan is a simple algorithm that works on looks at the neighborhood of data. It constructs a graph and it labels something called a core points. If I'll just go through the simple algorithm for people who doesn't know it, the points in the red are called core points. So, we are drawing a circle around those core points and let's say we are saying if there are three points within the circle, it's a core point. So, for all the red points, you'll find at least three points inside the red circle and that's what they're called a core point. Then there could be points inside the circle, but they don't have three more points. That's the yellow one, basically. These are the non-core points like B and C over here. So, they were part of some other core point. So, that's why they are still connected, but they are not core points themselves. Then we have the uncustered one, which are basically not connected to anyone. So, this is modeled as a graph. So, the same problem of local neighborhood can be basically imposed over here. But we need two things. First, if you have a distance matrix. So, if we have the distance of all pair of points, I think, yeah, that's starting point. We need that. And then we can maybe play on that. The second most thing is the epsilon or the radius. If we can somehow define the radius of this circle, what should be the size or what should be the minimum distance that I should cut upon to create the graph and basically prune the graph, then I'll get the clusters. So, this is the two main things that we need. If we can do those things, I think we have a totally automated clustering approach. We don't need to define any k, don't need to define anything, and just give that through the data and get the answers. So, the problem is, how do we automatically get this epsilon? So, that's where probably we're proposing a method, how to get those things. So, but before I get to solving this, I'll show one more case where we can apply the same technology. So, the second area is called different spaces. So, in the earlier part, I was talking about the curved spaces. So, let's say the one of the axes is curved, then you will see where the peak rises. These are, let's say, the projection of this on the one-dimension one. In a normal ruler, all the axes are equi-professional when you have a ruler, basically. One centimeter, two centimeter, three centimeter. But in case of a curved space, if you project a curve to a straight line, there would be non-uniform. So, not equidistance, basically. The one near the peaks would be close by. One near the planes would be far apart, basically. So, close and far, close and far. So, that makes it harder or complex to basically define geometry. So, one way, do the geodesic and do the curved geometry. But there is another solution possible. Use a reference space, something like a probability space. If you can match, map each point in this space to a point in some other space where you can define probabilities of a point belonging to some label, some class, or some basic cluster, then that's an easier way. I'll show an example, sorry. So, if you just look at a simple example, we have a users and movie categories. So, I don't know much about the users, their properties, or what they are basically belonging to, what they have, but I want to cluster them. Even if I don't know anything about the users, I can still cluster them based on their choices. I know the user watches the action movie or thriller movie. So, we can define probabilities in a simple way. We can do simple counting. If a user one is watching three action movies out of ten, probability is 0.3, as simple as that. User two is watching five comedy movies out of ten, probability is 0.5. So, we can have this conditional probability and create a bipartite graph. It's a simple thing. So, I just know the choice of the user and the counts of those choices. This can give me a vector. So, I can easily say that the user one a 0.1 times choose action, 0.2 times comedy, 0.3 drama, and 0.4 thriller. So, this way we can just easily create a probability vector for every user. So, it's a simple way. We create a bipartite graph, make the counting from the counts we create the probability. So, now we are actually mapping user to a probability space. We did not have any user information, but still we can use those vectors. So, now we have a new dimension. That's a four-dimensional probability space. Now, we're going to cluster on that. So, how we do that? I'll talk about later. But the idea is you got to a new space without looking at the parameters of the input data. You can also add those parameters if you have, let's say you have user groups based on A, H, X, language and location. So, you can create groups of users. Let's say having these properties. So, we can call these tuples basically. So, a group of user can have again a probability vector. So, in general we can say for any set of n-dimensional attributes we can map it to a p-dimensional probability vector. That's generic way basically. So, that's what we call as a different space. We are not looking at the original data, the original space, but referring it to somebody else and maybe his classes or his categories. So, this has a condition. So, this has a condition that we have a difference available, label available basically. So, that label can be utilized in the same fashion. So, in short, if we have a probability space, we have lots of matrix supply. These are some of the function names. Kale, Divergence, Patacharya, total variation and there are many more. You can also define a similarity measure like a same rank on a graph or something. But all these are not distances. These are similarities. This, Helinger distance, the distance metric basically. This is a true metric. It confirms to all the four properties I showed in the beginning, the matrix space basically. So, if we use this, then you can actually compute a geometry and get the right geometry of the data. This is a simple idea. And so, essentially, we just need to create a distance matrix by using this function and the vectors. So, now we get the distance matrix. And remember, this is what we needed for our dB scan. So, we already have one criteria fulfilled. So, if we generalize this, so what all do we have till now? So, to create for the purpose of clustering, we need a distance matrix. And we can get a matrix in both the scenarios. In case of a high dimensional data, we need a distance matrix. In case of probability space, we need a distance matrix. So, in both the cases, we can get a distance matrix. So, that's a simple thing. Now, the challenge is, one, how do we create the matrix in a bigger space? That you could have a million rows, 10 million, 100 million rows. So, n cross n n square is the distance metric size. So, 100 million cross 100 million. That's 10 to the power of 16. That's pretty huge. So, that's a challenge. So, how do we do that? That's one part. Second, if you got the distance matrix, how do you find the epsilon? So, how do you automatically basically get the clusters? That's the key idea. So, I'll talk about this first, because that's more important, given assuming we have the distance matrix, basically. So, I'll take back the example again, which we had earlier, the less curved S and the more curved S, or the hourglass, basically. Just do a simple histogram or a density plot of that. Just counting that we have a distance matrix. Just make a count of distances. So, this graph we're showing, maximum distance are at distance 2, because that's the max distance, basically. So, this is just showing a histogram for both the cases. It's showing a peak over here, similarly for there, but that's the idea. I can easily play with this same histogram or density plot and do more fine graining. If I do a plot at a different resolution, I get a different plot. If you see, notice, now you get a peak over here and a peak over here in the beginning also, which was not there earlier. So, this is called a smoothness of how much smooth density, kernel you want to fit over there, basically. So, you have different formation. So, in a more refined kernel, I got a peaks over here. So, that's the major idea. If there are clusters data, then majority of them would be very close by. These are very close, very close. All of you are very close. So, that small distance would be lying in a small range. So, the frequency of that would be very high, what you're seeing over here, basically. So, a small range would always having a high distance, a high frequency. And if you can capture that, so that is your epsilon, basically. So, if we just translate into words, the clustering epsilon is simply based assumption that we have clusters. Assuming the data is high dimensional and is organizing in some clusters, then we can apply this and simply just compute the distance matrix and find the frequency for that. And the inter-cluster distance should lie in narrow range because they have a cluster, they are very close by, basically. So, this assumption translated into a very simple thing. If we can find the peak of the first mode, then probably we have the clustering epsilon. But again, finding the peak may be easier, but finding the right curve is not easy. So, the problem comes to which is the right curve for the data, right frequency curve. So, that's where the problem comes in finding the right Gaussian kernel. So, let's say we are fitting a Gaussian kernel over the density plot, like multiple Gaussian curves. So, the first one is the one we need. That is the more inter-cluster density and the peak of that is the mean, is probably giving us the epsilon or forward clustering. So, there is a simple technique. Just using a grid search, give it us a multiple bandwidth parameters. These are like controlling the smoothness, basically. And give a list of those and do a simple Gaussian kernel estimation. These three things are just simply Gaussian kernel. That's it. Take a Fourier transform, inverse it and get the data back. And you can find the minima by maximizing the probability as simple as that. The probability of the entire curve, where is the maximum is, find that. Based on a scoring mechanism that uses log loss, standard technique, log loss over the distribution will give you the minima and that is the optimal bandwidth and the grid size. So, that was basically all. I'll just, in short, quickly... Yeah. So, one is a theoretical way. Second, you can basically play analytically and try different distance functions and then measure those parameters, like integral distance, integral distance. And we will find the best, most separated clusters and most dense clusters. That's the way. Yeah. Yeah, exactly. That's an integral way. A grid search on those things, basically, and minimizing those parameters for clustering. So, that's the idea. Sure, yeah. But that's hierarchical. So, it is basically doing in multiple hierarchies a bigger radius or bigger radius, basically. Here, it's more like finding the minimal distance. So, actually, yes. For example, in the case, basically, it's part of a big cluster, like the S1. So, in case of HDB scan, the bigger one would be the S. Smaller one would be basically this ODSC one. Yeah, yeah. So, I think that's one way of doing it, HDB scan, if you have a hierarchical structure over there. But at times, the structure may not be coming to clusters hierarchies. So, it's another way of doing it, basically. Just find epsilon later on. You don't have a different function. Get the epsilon is free to DB scan. It's the same DB scan. You can modify it, basically. Without modifying DB scan, just plug the epsilon and get the value. So, you will still have the answers. We'll quickly summarize how do you get the distance metric data? I'll let you about the challenges that we have. So, you do create a data frame and find the pair of distance, basically, between any two feature vectors. And the assumptions are, data is large. Large means dimensions could be 10,000 or more than 1,000, basically. And, like, dimensional in current world, like deep learning, it goes to even millions, also. So, dimensions could be very high. And features are definitely more than one million. So, in a very simple case, just one million case, computations is basically very high. It goes to n-square. And as the n increases, the n-square goes up pretty high. So, you have compute problem. You have storage problem. So, we can do a few quick things. I'll just summarize and give the idea. So, a most obvious way of doing a distance metric computation would be do a cross-joint in your spark. This is a simple spark code. And then apply distance function. But, as you know, going for n-square cross-joint is pretty bad. It will have a huge shuffle cost. And it will be basically, it will go on for days. So, in our cases, we had data which had more than 1,000 dimensions, and probably more than million rows, it went to days, finding all those things. So, and it will basically, still you can distribute that, make it horizontally scalable, but that is the limit. You can't go beyond some executors. It is a challenging problem. So, simple quick way. By reducing to reduce of operation is use the spark technique. What we call as basically, instead of doing a pair computation, or each pair do the row computation basically. Compute the whole row at a time. And that's a more optimal way. So, in this case, convert your feature vectors to a simple array. And then compute dot product of an element with the transpose of an array, as simple as that. That's a row compute computation. And, but the idea is, broadcast the array to all of the executors basically. So, now you just have the end length array being partitioned in data. And now you are also having the same end dimensional array as broadcasted in all the executors. Now, just multiply the array with the element. You have the row in each of those. So, simple you're doing this basically. Just collect the whole data. It's a sample code. The idea, I'll just skip the idea. So, I'll also share the code later on on a GitHub, so that you have the code. I just collect it, do a simple broadcast, and then basically apply distance function. So, this is now a row distance function that is simply taking an element and getting the whole row for that element. So, what happened over here? We did the entire row computation single thing. But, this also had issues. The issue was for broadcasting now. Now, that's the second issue that's got added. Because broadcasting is a new memory thing, we require RAM over there. So, for high dimensional vectors, these will cause more problems. So, idea is then change your dimensionality vector space to sparse. So, using your sparse vectors. So, like in our data, we have thousand dimensions, but most of times, only 70, 80 were populated. More, less than 10% were populated. Majority were empty. So, sparse vector was given the basically the space optimization that we had. So, that was one way. I'll just show what we obtained from that. Just give you a trade-off. So, the two techniques that we had. First one, which I talked earlier, the cross-joint technique. We were doing n-square computations with a cos c. C is the cost of shuffle, and the distance function basically. So, that's a very high thing. But, the space was optimal. We were just picking a record and finding a distance for that. So, it was space optimal, but not time complexity. The new technique, let's say, k is the sparsity of data. Like in our case, it was like 7%, 0.07%. So, now, time complexity is n to k into c. And the memory cost goes up. So, at the cost of memory, we save the time. So, the saving was basically nearly n by k. So, for n more than a million, nk less than 0.1 is easily a 10 million time save operations across just 7 million data. So, that's the one quick technique that we have. Then, I'll just quickly mention a few more. You can read the code later on. The way we found the epsilon in the earlier way, we can do something similar over here. Just take a data frame and find a histogram, basically. Take a random sample, take a sample of data. Let's say 0.1% or 0.01% of data and compute distance on the sample, just a sample. And assuming the sample captures the essence of data, get a histogram, and then find the first peak. So, it will again give you a broad or a broad epsilon to cut the distance matrix. So, you can also save only one percent of distance matrix and don't save the entire matrix. So, this will actually also save a storage cost over there. Two simple functions, basically take a sample from data frame and then, basically, compute distance matrix and eventually do a histogram. So, there is a function in a double RDD given by Spark that can give you a histogram for that, basically. So, if you get a histogram, simply just get the first peak. That will give you a cross cut-off epsilon, basically. So, you can save your storage spaces. And then, I think there are some typical scalar things. I'll just mention it for your benefit that in any of our operations, like finding a distance function or anything or any loops, don't use iterators for each map. Then, you're asking, what's left then, basically. So, we have loops over there. So, the best thing is, use a primitive while loop because there is an internal optimization that Spark does for every map, every for each. They basically do the whole operation again. The memory is refreshed and then, probably, the whole thing takes a lot of time. So, I don't go details of that. I leave for people to look upon. Just to give you a hint how to find those things, take simple stack cases. So, do a kil minus three on your Spark job. You'll get stack dumps and then, aggregate those stack dumps. Analyze those, you'll get the answer. Why these functions are pretty slow? And compare them with a dump for a while loop. So, while loop doesn't have all those fancy stuff. So, the scalar does lots of fancy stuff to make code look smarter and basically, more fancy. But you lose the performance over there. So, quick thing, use a primitive while loop. That's ten times faster. Yeah, I've measured that. And that gives you so much benefit. So, after doing all these things, like in our case, the whole thing competition came from days to like 30-40 minutes. It was like a 20-30x improvement that we had by using the broadcast-based join and then using the different optimizations, cutoff epsilon, all those things. So, you can easily get down like a 30x improvement by doing these basic things. So, I think I'll leave that. And this is a simple code for distributed DB scan. I think you can pick from later on from the slides. The idea is, if you want to do a distributed scan, use a graph frame basically. It's a distributed graph. Find the core points. And the same and loggy. Get the core points based on neighborhood, basically. If the neighborhood lies in some degree, let's say called num points, you have the core points. Find the core edges. And after core edges, create the graph again, basically. And then just do call graph-continued components. You have the clusters. So, this simple technique is the distributed clustering you can do on Spark. So, you have a distributed computation. You have a clustering technique on Big Data. So, now you can combine both and actually get the real geometry data and find the real clusters that you have. So, these are the findings that we have. Understand geometry of data. And so, you can actually capture the natural information data. If you don't modify it, you don't distort it. Analytical approach. First, figure out the geometry. Then apply the function. So, that's more like a theoretical study. You need to first understand how you can find or estimate distance function and how to test that. If you have that, that's the way. Other is, if you have a reference space where you can map each points to probably a label, then use a reference space. Use the probability vector of that and do a clustering on that. Third or fourth is the manifold. So, that gives you good essence of data. How data is organized. Is the global thing more important or local thing more important? So, that gives you a notion where to focus upon. And finally, for clustering, I think, yeah, more important is the local neighborhood. As we can see in LLE, we captured the real information data by looking at local structures. So, focus more on local part and for big data, I think, yeah, you do some optimizations. So, I think, yeah, these are the things that we can have and we can automatically do an unsupervised machine learning without knowing much about data. So, I think that's it. Yeah. So, one final quiz question. Can you defeat between the two images? If the answer is yes, then you're not a topologist. Because topology sees the thing as this way. So, the coffee mug and the torus are homeomorphic. Yeah. So, that's why you're not a topologist. If you can't, you can't defeat between them. So, it's a simple mathematical joke. Yeah.