 So we want to start basically with the first set of machine learning techniques after we got some preparation and it's natural to start with clustering because historically they were among the first smart techniques that were introduced. So we will start with clustering and the idea of clustering is that intelligence is the capability grouping similar objects. So the thesis or the idea is straightforward if you can group similar objects you must have certain level of intelligence. So if you know different type of squares and rectangles and different sizes of circles and flowers and cars and you can group them in a game together so that means you have certain level of smartness about it. So basically clustering groups unlabeled data into clusters of similar inputs. And from the perspective of learning that we want to start talking about that so the objective unlabeled is extremely important because that means we are dealing in terms of any clustering technique we are dealing with unsupervised learning. So unlabeled means unsupervised. And at the moment there are critical voices that as long as we are relying on supervised learning we will stick with weak AI because the actual strong AI or artificial general intelligence has to be unsupervised. So from that perspective is another reason for me to start with clustering because they are being somewhat neglected let's say. So if we look at, I have mentioned verbally the example of cars so if I have horsepower of cars and I have the maximum speed of cars as features and I want to classify cars I may get something like this I may get some cars here, I may get some cars here and I may get some cars here. So there are cars that have a lot of horsepower but their maximum speed is not very high there are cars that don't have a lot of horsepower but more or less have the same maximum speed, a little bit higher and there are cars that don't have that much of horsepower but they are really fast. Well if you give them to any... so if you remember we implied that AI is a function approximation and mostly we work with separating stuff but clustering basically goes and finds groups of objects that are very similar and this could be done labeled trucks this could be labeled sport cars and this could be labeled SUVs so of course no technique can tell you this is a sport car this is truck, they will just say this is group 1 you can put the label on it and say group 1 is SUVs it doesn't matter what label comes when we talk about label it means label in terms of you only get X and in contrast to or versus supervised which you get X and the desired output so we don't start with the supervised way and so basically you give every point is represented in this simplified example with two features horsepower and maximum speed and then some clustering algorithm comes and groups the points together and says this is one group, this is one group, this is one group I don't know what their names are I have no idea but they belong to each other so that's a certain level of intelligence of course when we do that we can ask whether our clusters well separated so in this kindergarten example you see that SUVs and sport cars and trucks are nicely separated there is an ocean of distance between them so I can give this to most stupid algorithm they should be able to do this this is an easy problem so they are well separated which means as we say I'm not drawing lines because you're not doing classification there is no line here grouping, clustering is different from classification classification is supervised clustering is unsupervised and more importantly are clusters linearly separable although I'm not drawing any lines here but if I would would I be able to separate these groups with drawing some lines so if I draw a line here draw a line here I would separate them so this is a linearly separable problem which has nothing to do whether you want to classify data or you want to cluster the data nothing to do with that linear separability says how difficult it is to recognize stuff so linear separability is easy if they are not linearly separable that's difficult so example would be so if I have something like this and now I have to draw something that doesn't make sense to you so you see a chunk of data and there is no separation so here visually you could say this is this, this is this there is no way I can draw a line so if I group this with clustering so I get my three classes this way they are not linearly separable this was linearly separable although I am not separating them by drawing a line this is not classification but still whether the data is easy or difficult will affect me doesn't matter what I want to do with the data and so difficulties could be overlaps that clusters or group have overlap with each other so they go inside each other territory and you may get complicated shapes so the shape of the clusters may not be just I don't know almost circular, easy it may get a weird shape to draw a boundary line or contour around each group so clustering algorithms can be divided basically in two type with respect to what they expect from us one is they need to know the number of clusters who told me that I have to look for three clusters here trucks, SUVs and sport cars I knew it so I told the algorithm look for three clusters so what would happen if I asked the clustering algorithm to look for four clusters five clusters it will find something which most likely it will not be meaningful so there are techniques that you have to tell them what K is what is the number of clusters and there are some that don't need it so they will figure it out they will figure out how many clusters there are they will go inside they will just jump into the pool okay let's figure it out of course that's a very desirable attribute of a clustering technique if I don't need to tell the algorithm look for three groups of cars three clusters of cars it can become very very fast very very complicated because most of the time we don't work with two features most of the time we work with several hundred features and most of the time we have no clue how many different patterns are in the data so and then we may start how do we do that if the technique needs to know how many clusters to extract well there are ways to do it there is a trick for everything for every problem so we start with K-means with K-means algorithm which is by definition what K-means was invented in the early 70s it was based on some previous observations and then many many versions of K-means have been introduced but the core has stayed the same and the core is really simple so the main idea of K-means is find the centroids which are basically the prototypes prototypes or means averages of K-clusters so K-means basically means how many means do you have how many averages of the class you have and so this is a mean this is a mean this is a mean so I have three means here so K-means means how many clusters do you have I will go and try to find the prototype for each one of them and the prototype happens to be the average of the class and when I say average of the class that means clustering techniques like this have to be applied on numbers but cannot apply them on categorical data unless you somehow come up with some quantification for them so how it works is actually quite simple so what we do is first we randomly place K centroids so if somebody has told you find three clusters here that means you have three centroids if you come up with three random numbers you cannot really call them centroids but I pull them and the second iteration they become centroids so when I place them they are not centroids they are just three random numbers so if this goes from 1 to 100 and this goes from 1 to 1000 whatever so just get some random numbers in that range so and your initial one could be here doesn't matter so you start with three random numbers which may be completely outside of your data why do we do that? because we have no idea we don't know where to start we have to start somewhere and we will do that when we get to normal networks we don't know where to start because the solution space is very difficult so we start with random guesses the entire network is configured with random weights so you randomly place K centroids and then second step we assign each data point to its closest cluster K so if I come up with random guesses so each one of them has some distance from every data point then I start and say okay you know what this and this guy are close to this so I assign these two guys to this and these two guys are very close to this so I will assign these two data points to this at the beginning doesn't make sense of course but we have to be patient so after we go through the iterations hopefully things start to take shape and then there will be no learning if I make some random guesses and stay there so I have to move I have to move so I have to move means this guy across some sort of trajectory has to come here this guy has a much longer trajectory has to come here and this guy has to somehow get here if I manage to do that I have learned the groups and clusters of the cars with respect to those two features so that's the learning part the trajectory of these three numbers in this case to start here and get to the actual mean of every cluster meaningful cluster that's the learning part so of course then you have to update the centroids what I just indicated so this is updating so in every iteration it takes many many iterations in every iteration so I'm here and then here and then here and then here and then here and then here and then here and then here so it may take several thousand iterations to get there yes which data point there's all over the place I see this mean of what mean of what some people do all of them I get one mean yeah we're not sure what you could do is you could randomly grab some of them divided in three randomly and make that as your average you can do that yes but we don't see this structure when we start the algorithm doesn't see it we don't know where the means are so and again so I did not draw this a circle and this as a square and this as a triangle to say that there are different things because the software does not see any difference there are all just numbers so the structure is not visible but going back to his point there are many different ways to start just starting with randomness is the most obvious way is a cheap way is a safe way in most cases but there are other ways there are no rules that you should not try to start in a different way you can so which means so we have basically two type of adjustment so if I have let's say three points here three points here and I get two so I'm looking for two clusters so I have to calculate the distance of this guy to this guy to this guy to this guy to this guy to this guy to this guy so I have to calculate the distance of the first candidate centroid which is not a centroid yet to all data points and then the second one the distance to this one to this one to this one to this one to this one to this one to this one so I calculate the distance of the old centroid to all so these are distances for class assignment for class assignment because I want to know that, for example, if this is 1 and this is 2, if this is c1 and this is c2, my centroids, I want to know that when I start this 2, belong to this guy, and this 4 belong to this guy. But of course when I start moving, for example, when this one gets close to here, maybe this one belongs to this one. So things will move. People, data points will change membership to different clusters all the time. And if you visualize that you see, as if it's a competition, that this guy jumps from this group to that group, this guy jumps from this group to that group, because it's still not clear who is the prototype, the representative of each cluster. For that you need to converge. So if you converge to the real averages of clusters, then it's clear, then we stop. But we do also something else. So when I, again, if I try to redraw this, and again I have my center, then you have another type of distance calculation, that you only calculate distance inside a class. So these are distance calculation for updates. Wow, basically for error. So at any moment I know that for any, at any moment I assume this centroids are okay. This is the best I got. So I calculate the distance of every data point from every position of my centroid, and somehow I add them, and if they are good centroids, what? The sum of distance should be minimal. Isn't it? If it is really in center, so look at this guy. This is very, very close to this guy, to this guy, to this guy. The distance could be really small. So, and everything else is close compared to these ones. So if this is a really good centroid, then the sum of distances to the elements inside the class is a temporary class, I know, is a temporary class, until I converge. But the sum of distances to every centroid should be minimal if it is a good centroid. So if I do one more step and the distance becomes even less, oh, I have not found the optimal class centroids, yes. So go further, and then I move, I move, I move, I move, I move, and I see that the error is going down. The error is what? The sum of all distances. Clearly here, if I look at this guy, so this average could be easily here, and then I have my best centroid. As long as I'm not here, I have to move. I have to move, ideally in a straight line, because I don't want to waste time. I want to just go the fastest way toward a centroid. So a trajectory like this is not good, because I'm going some detour, which means it takes a lot of time to get there. So I don't want to do that. And we will see that whatever we do, we want to get there fast, because computations are expensive. Okay, good. There's not much into K-means, but we still have to mention some attributes, and then we move on, and we look at another version of K-means. So the similarity grouping here, the similarity grouping, so you group data points based on similarity, happens via distance measurement. Distance measurement. Measurement. What? Surprise, surprise. We don't have anything else. Similarity, dissimilarity, proximity, closeness, whatever. Distance. You have to calculate distance. Didn't we do that for PCA2? Wasn't that about subject of Disney? Yeah, of course. So we don't have any other tool. Everybody cooks with water. If you cook with oil, you're going to die of high cholesterol. There is only one liquid. Now, if I have a point XI, I have another point XJ, and I'm looking at CK, and CK is the cluster, is the K's cluster centroid, then we are talking about distance between I and J. So we want to calculate the distance, which of course we calculate while taking XI1 minus XJ1 squared plus XI2 minus XJ2 squared and so on, which is the Euclidean distance of XI minus XJ, the good old Euclidean distance. We calculate the distance. So if the distance is small, you're similar to me. If the distance is large, you are not similar to me. You should not be in the same cluster as I do. So we trust the distance metrics a lot. They measure everything. So that's the driving force, yes. So X is a vector of features. It has many features. So horsepower, color, maximum speed, insurance premium, and that, so one, two, three, four, and this I is the measurement and one, two, three, four are the features. Measurements. This is the Euclidean distance. This is the L2 norm, which is the same thing as we know from the high school. We know it from high school, don't we? Which high school? So what is the objective? Well, the objective is fundamentally the error. And the error for us is you sum everything over the K clusters. So if I do capital K here. So you have to go over all clusters. I'm talking about here. This is distance calculation. So you sum everything over all classes. And then you sum over all those Xs that belong to a certain cluster CK. Now I'm talking here. So I do this and then I do this and then I sum it up. And that is the distance. The distances should be small. So the number, the sum of distances I'm looking at it as error technically is not error. But you are not similar to me. So it's an error. So the sum of distances across all clusters and all instances is my error, which will drive the learning, which is the Euclidean distance of X minus MK. And M being the centroid of the K cluster. So centroid, centroid of the K cluster. So that's basically it. So we have to minimize this because we want to have similar stuff in the same group. Yes. If you use other norms, you will have a different behavior. For example, you can use L1 norm. Sorry. Nobody screams when I make mistakes. So this is L2 norm, power two. L1 norm, if it is power one, is the simple difference. You can do L1. You can do L2. You can do cosine similarity. People try all different type of stuff. By default, most implementation have the Euclidean distance as default. One of the parameters that you can play with. How can I measure? If you go high dimensional and we have mentioned that and we will mention it again. If you go super high dimensional, your Euclidean distance sucks. It collapses. Then we have to play with other stuff and see whether maybe the cosine similarity give us something. But if you are super high dimensional, K means may not be the best choice. Maybe we have to reduce the dimensionality with PCA before we give it to the poor, simple K means algorithm. Yes. Sorry again? No, it is not. One of the problem of K means is that every time that you run it, you may get a slightly different result depending on many different factors, data, parameters, approximations. But it is not huge. The variation in the result is not huge. It depends on whether you have well-separated data or not. So here, so we minimize, minimize the sum, minimize the sum of squared errors to its prototype, prototype in each cluster, in each cluster. So why we call it error? Because the distance tell me in how far you are deviating from the cluster prototype, from the representative of the class, which is the average of the class. So ideally, everything should collapse on an average, but then the data is not useful. If everybody, we need a little bit of diversity. So otherwise, we get only one SUV manufacturer, one caller, one truck and one sport car. We don't want that. So we want a little bit of dispersion, but not much such that I can distinguish between SUV and sport car. They may get close, but not too close. Okay. So talking about the update, which is the core of K-means, is unbelievable what you can do with a simple algorithm like this. So when we do the update, basically center it, the average, or the average of all X belonging to C sub K for K being 1, 2, 3, 4, up to capital K. How many clusters you have? How many clusters you have? Whatever. 3, 4, 10, 20, whatever clusters you have. So you update like that. So the only question that remains is stopping. So you start with some initialization. You calculate the distances. You establish your objective function. You do the update. And okay, how long should I do this? Well, stopping is always a problem. The same question comes back and when you have a neural network, so when do you stop? How many iterations are enough? So you can stop, first, after some iterations. After some iterations. You can set it. You can say, yeah, I know what. You know, after 5,000 iterations stop. Give me whatever you have. Second, we can stop when centers or centroids don't change anymore. Maybe not in a hard way or they don't change significantly. So if you visualize that I create a centroid here and then in every iteration I see is moving, moving, moving, moving toward the actual center and then gets here and then is not moving. It's just a little bit jumping around, but it's not making any significant progress. So I stop. So centroids got home. They are not changing location anymore. Or third, which probably could be the same thing, when few or no data points change cluster. This is quite easy to observe because when you create your centroids randomly and then you start updating them, you see that data points start changing cluster. The truck becomes a sport car. The SUV becomes a truck. Yes. You see how stupid that sounds? But we do those type of mistakes. So and then after a while you see that no car is changing cluster anymore. We converged. So you can visualize the number of cases that a data point switched cluster. And you see this is large and then comes down and then gets to zero. Ideally. Sometimes they still go back and forth for really tough cases. Then you have to say, okay, it's going, it's, it's oscillating. If things are oscillating, stop. You are just wasting time. Just stop. Goes back and forth, back and because it's the glass is half empty or half full. It's 50, 50. Something is 50, 50. The algorithm trying, okay, I give it to SUV. The sport car complains. I give it to sport car. The SUV complains. Okay, so go back and forth, back and forth, back and forth. Stop. When you oscillate, trust me, we oscillate a lot. And we will oscillate also when we get to neural networks. So you have to recognize it and you have to stop. So what are the problems of, what are the problems of K-means? You can find, you can find implementation of K-means everywhere. Python, Matlab, Java, JavaScript or whatever. So you can easily use it. So the big, the big problem is it needs K. Somebody has to tell K-means how many clusters to look for. Maybe I don't have that knowledge. Isn't that part of the supervision? It is. So, so we are not that intelligent. No, we are not. So we can recognize the clusters nicely if you tell me how many are there. If you cannot tell me then, okay, we have to do some other stuff. Second, K-means is outlier sensitive. And outlier sensitive means, so you have, you have a cluster here, you have a cluster here and you have a data point here. So that's an outlier. So that, that drives K-means crazy. The conventional, the traditional type of K-means. So if, if you know you have outliers, you've got to be on watch for using K-means. And third, so K-means does hard clustering. Hard clustering means every data point that you grab. So for example, if I, if I grab this data point, because let's say we are saying we have two clusters, which then the K-means have to assign this guy to one of them, which would be nonsensical. So let's get rid of this. So let's say we have class one and class two. So any data point that you grab is a vector that says 0.9 and 0.1, which means this data point belongs to the class one with a 90% probability and belongs to the class two with 0.1. That would be nice. But that's not what K-means does. It says 0.1. So it's a hard clustering. So any of them that you grab, so that's the reason that we say, okay, so you grab this one is 0.1. So this, any time that you have 0.1, you have Boolean, you're basically losing information. So K-means as a hard clustering algorithm cannot, is not, is not, in some situation does not follow the principle of graceful degradation, which means that oscillation that I mentioned. So when you get to cases that is 50% here, 50% here, in literature we have something like that. So now I'm drawing chaotic things. So let's just, let me come back maybe to this example. So let's say I have the so-called butterfly example. So this is a good example in literature that you have two classes and for this guys, and this guys, everything is clear, but this guy is exactly in the middle. So what do you do? Glass half empty or half full? Take a pic. Flip the coin. So if you, if you do 0.1, that's difficult because it has to assign it to, so K-means here has to say that this guy is 0.1. That's difficult. So you have to make a decision. That's why we, we got the Lukozevich logic that was okay. Sometimes you have to say I don't know. So 0.5. So, but if I give it, so this is hard clustering. So if I do soft clustering, it will tell me 0.5, 0.5. It's 50-50. I don't know. It's an honest answer. K-means cannot do it. So what we will come and talk about some techniques that can do it. Okay, good. K-means in spite of its performance on many different data sets may appear too simple to qualify as an AI technique or machine learning technique, but it is. Nobody says you have to be complicated to qualify as a capable machine learning technique. So let's go back and think about the fact that clustering is unsupervised learning. And the emphasis is on these two letters, unsupervised. Yes. There are many, yes, from data science field. When I say data science, that means conventional techniques that not necessarily have some sort of learning, explicit learning, but they have iterations. They are. What we don't hear. So fundamentally, we don't talk about those classifications or clustering techniques. So if this is about learning, okay, why not using some processing units? Processing units, well, some people call them neurons, but you're not prepared yet to talk about neurons. So let's call them processing units to place centroids on a map, on an adjustable map, adjustable map, which some people call self-organizing map, self-organizing maps. So that's 1985, if you remember. That's more or less at the same time as backpropagation, independently of backpropagation. It is an all-network type. People didn't perceive it that way. The terminology was a bit different. And nobody realized what we will talk about, that this is the case for this approach. So what was the hypothesis? What was the idea here? What was the claim for self-organizing maps or SOM? One of the most neglect AI techniques right at the moment. So I don't know why people, maybe, I don't know. Don't people understand it, but it's so simple, like K-Means. So the hypothesis is that the model or your solution self-organizes, based on learning rules, based on learning rules, and interactions, and interactions. So the model, the hypothesis, the solution, the agent, the piece of software, the algorithm, whatever you want to call it. It organizes itself. Oh, this is a fancy word to use. I would say K-Means organized itself, too, by just giving them some crappy sense rates, and it made really nice averages of them. That's self-organization. So, but okay, can we put it on a map? Why a map? Why? The human brain has a lot of maps, retinal maps, sensoric map, this map, motoric map, and so on. So it's a fancy word to use. If you can imitate some of the functionalities, you may have a case. So processing units, processing units, which if you have no idea what that means, what do you mean? Why the smallest piece of software that can process some numbers, which in human brain is the new one? The processing units maintain, maintain proximity, proximity relationships, proximity relationships as they grow. Apparently they have to grow. Well, there are several things that are not clear at the moment, but maybe we'll learn. So first of all, the fancy word of self-organizing yourself, and then you use some rules and interactions, and then there are processing units, and they maintain proximity as they grow. Okay, that's a different terminology that came in, but you need a good terminology when you're starting to establish something new. If you want to establish something new, you have to have your own words. You cannot use other people's words and provide a new algorithm. People call their algorithm algorithm, you call it the machine. Okay, support vector machines. As long as you can back it up, you're fine. So everybody will try it and say, these machines are fantastic. So, but if you cannot back it up, people make fun of you. Machines, what do you mean? You have two vectors. What is the machine part? Okay, so that was 1985, and this came through the so-called Kohonen map. So the Kohonen map was quite simple. There were some predecessors before Kohonen map, as there were predecessors before K-means, and this is the rule. There was nothing you can say that is new. Nothing. Nobody. Nobody has ever said anything new. Whatever you say has existed before. Okay, what is the innovation part comment? Well, you have to find something, rethink it, and adjust it to today. That's the innovation part. Last century, we had two people who had new ideas, Norbert Wiener and Albert Einstein. Everybody else was just rethinking. Pretty much. So, okay, if I put some processing units on a map, which some people may call lattice, so let me put, I don't know, nine, and then I have some inputs. I just do one because otherwise it gets really messy. So I have one input, let's say, and this input is connected to all processing units. It's really messy. I don't think I can do it justice, but that's the idea. So the input gets connected to all processing units or units. I'm hesitant to use the word neuron because we don't know what a neuron is yet. So let's just call them processing unit. And these connections all have some numbers that we call weights, and they can change. The weights can change. So, okay, now this is a fancy way of just, okay, so what do you want to do with it? So this is what we call a synoptic connection. Again, loosely inspired from the human brain or any other nervous system or any other animal. So we think we are so special. Most likely we are not. There are other animals. They are smart. They have central nervous system. They learn. So, and then these units are an array of post-synoptic neurons. Okay, so I'm using for the first time that word neuron. So I just want to throw the terminology at you. So I have some circles, and I have a number, and the number is connected to the circles. Yes. So it goes, I was trying with my childish drawing techniques to have a, that this is a map, and this goes from below and you get connected to every neuron. So this is in the space. So this is like this, and they come like this. So I was trying to make, apparently I failed. So, okay. Good. The input is connected. It's good that we do this exercise because this was before we got back propagation as a research community. But nobody called this a neural network. So therefore I'm hesitant to use the terminology of neural network. But it is. So the input is connected with each unit, which we can call neuron, of a map of a lattice, make it mathematical, which we can also call a map. So you have a lattice of processing units. When you say processing unit, that's a function. You give me a number, I put your number to my function, and I give you another number at the output. That's a processing unit, like a processor. But the processors crunch millions of numbers. A processing unit usually crunches a very small number of inputs, usually. Okay. Before we do that, for the concept of lattice, we have to also mention that on a lattice, you have the concept of neighborhoods. So the concept of neighborhood. So if I have a one-dimensional problem, so I have processing units, and I'm looking at this processing unit, I may define a neighborhood that say these guys, these two are the neighbor of this guy. Why is that important? I use the fancy word like post-synaptic neurons, synaptic connections. It's just a number that we adjust. The number itself is not magic. The way that we adjust it, then the things that we do it is magic. So why is that important? Because we assume, also in the human brain, the circuits that are responsible for doing one thing, they grow together, and they don't show discontinuity. You don't see, if you visualize the synaptic strength of neurons, you don't see a neuron like this, and then a neuron like this right beside it. Not going to happen, because things grow together. There is a smoothness in the learning. So if you define a neighborhood, you are bringing in the concept that they have to grow together. So you cannot have a neuron here is the best, and the neurons beside it suck. Not going to work. You have discontinuity in your learning procedure. You will not be able, it's like jumping from the 10th stairs to the first stairs. You cannot do that. You will break your bones. You have to go step by step smoothly to get down. So that's what you're talking about, the concept of neighborhood. And of course, if we do that in a lattice, so now I'm drawing an arbitrary lattice, and I connect them, because that's a lattice. I may have some cross section in some cases, but not here. So and let's say this is my neuron in focus, and I can now, I have a 2D, and I can say, you know what? This is my neighborhood. I can define the neighborhood. So if I don't define the neighborhood, I cannot grow with each other. Okay. So possible, possible neighborhoods, so this is 1D, this is 2D. Go 3D to go whatever pi dimension you're like. Okay. So what is the goal? So the goal is find weights, find weight values such that adjacent units have similar values, of similar values. Again, so we want to have a smooth transition. If this is this neuron, and the weight comes down, and then it comes down, and then it comes down from the other side, same thing, same thing, same thing. So if a neuron has a high synaptic value, the other ones have lower, lower, lower. So I don't get, I don't get something like this. So that cannot happen in learning. You will just kill the iterations. You need smooth transition. So units, neurons have to grow together. We can mathematically formulate that. You need piecewise continuity. If your error function is not piecewise continuous, you will just find in a black hole. Then you cannot recover. So also inputs are assigned to units that are similar, that are similar, similar to them. What? What, what again? So an input comes, and I want to find a neuron that has its weights have the same value as this input. So you want to copy it? Yeah, but I'm not copying one. I want to copy as many inputs as possible. I want as a processing unit, I will be competing with other processing units with everybody else. Maybe I'm a little bit gracious in my own neighborhood. So if I've been in lottery, I may go in my neighbor and say, okay, guys, come on, be around me. I won. But I will not go to the other street. I don't, other street, I don't know the people. Why should I share my winning with them? So what that means, this inputs that come in, if this is the neuron or this is the neuron that is best representing this input, that means the weights that goes to this neuron are really similar to the weights of this specific input. You are trying to become similar like your input. What? You are, you are becoming your data. Well, that's exactly what recognition of structure in the data means. So if you can become similar to the data, you will recognize the boundaries. And each unit, each unit becomes the center of a cluster. What? So if I have this, so if this is, if you tell me you need two clusters, this may be one cluster and let's say this one be one cluster. So this would be a big neuron with high values. And this will be another big neuron with high values. So just imagine a three-dimensional Gaussian centered on this neuron. So high values and then goes down. So that means there are many, many inputs that this neuron is representing. And there are many, many inputs that this input is presenting if I have only two clusters. But wait a minute. Isn't that what K-Means did? Yes. What's your point? While we created centuries and we let them go in K-Means, now you are just nailing them down. Yes. What's your point? Isn't that the same thing? Yes, it is. So you are coming from that door, I'm coming from this door, maybe we identify different structures. So K-Means is basically constrained S-O-M. We didn't know that until, I don't know, just recently. Some people sat down and mathematically showed, oh, actually, this is the same thing as K-Means. So it's constrained. These are the same averages that we talked about. These are the centroids that we talked about. But if you tell me two, I will find two. So you need a bigger lattice to represent. If you are looking for two clusters, you cannot have just two neurons here. You need maybe ten. If you are looking for ten clusters, you may be needing 100 neurons. So the size of the map becomes an issue. But fundamentally, K-Means is when you nail down your averages and then you get S-O-M. But S-O-M is much cooler than K-Means. You can visualize it. It's fantastic. So Thursday, Siobhan will show some stuff with S-O-M. K-Means is nothing. You can see the averages vander around and then they go home. Okay. It's the same thing. But when you do it with S-O-M, it becomes cool. It's one of those things. But we have to know this is the same stuff. This is the same thing. It doesn't look that way. What it is. Okay. So given, now I want to formulate that, given input X, given input X, find X could be a vector, could be a matrix. I don't want to go in detail, depending on what the data is. Find the i-th neuron or unit with closest closest weight with closest weight vector by competition. So that's what S-O-M tries to do. Given any input vector X, find the neuron or the unit that it closes to that. So calculate the distance again. You have to calculate the distance, I guess. Calculate the distance and say this neuron has some weights that are very similar to my feature values. So I will assign this input to that neuron. So we have fixed the centroids and the data is moving. So that's what I said. You come from that door. I come from this door. But we are attacking the same problem. So chances are in the middle of the room we say, hi, hi, hi. Why are you here? Oh, I was born 1975. What are you doing? So that means wi transport X will be maximum, if that's the case. Because when we say, when we say that my input X will be similar to the neuron, that means with the weights of the neuron, then the dot product has to become maximum because it's the same vector. If, ideally, it will not become the same vector, but it will become very similar. So if it is very similar, that means I have two vectors like this. So it's not a vector like this, that the dot product disappears. So if the more you go this, the cosine similarity increases, so W times X increases. So I have to maximize that. We minimize the distances. Here we maximize the weights times X. Same problem. Look, I don't care about the details of K-means and S-O-M. Do we get this? How we attack us the same problem with different ideas? Same concept, but look at it differently. Same tools, distance measurement, but we apply it here or here or here. That's the design process that we should learn. The details of this, go to Wikipedia. Look it up. Maybe Wikipedia does a better job than I do. Yes. Sorry again. No, we don't design it, but when we say this, that basically we are pushing the weights toward the data. But the centers are sort of fixed in contrast to K-means. Therefore, we call it constrained as, again, nobody. So K-means, what is wrong with this? So S-O-M is constrained K-means. Nobody screams. You have to learn to scream. Maybe we should, I don't know. I hate rewards, but whenever I make a mistake and somebody screams, you have a reward of some sort. I don't know. So S-O-M is the constrained, if you take the K-means and constrain it, you get S-O-M. Good that I catch that. Thank you for the question. So we are pushing, we are fixing the centers which are the post-synaptic neurons and then we push the weights toward the data. So we are not pushing the averages, basically. Yes? No. No. So you take the, that means K-means is more generic actually. If you put some conditions on K-means, you get S-O-M, which is fixed the position of the center rates. Okay. For each unit J in the neighborhood N of I, I is a neuron, right? I just numbered them. So this is one, two, three, four, five, six, seven, and so on. So I'm looking at neuron number I, and then that has a neighborhood that we talked about, some sort of neighborhood, and then now I'm looking at other ones around it. So for each unit J in the neighborhood N of I, of the winning neuron I, so there's one neuron that wins. That winning neuron becomes the centroid of the class, but then in the proximity there will be other units and neurons that are sort of the class centroid, but not exactly. So what they are close, we update the weights, you update the weights of J, which is WJ. So now I'm looking at the winning neuron, if this is my winning neuron or this is my winning neuron, then I have a neighborhood of some sort, and if this is I, this may be J, then I want to adjust the weights because that's the learning. Learning is adjusting the weights, learning to play with the synapses. So we come back to it, so this is the first time you are mentioning them. So weights outside of N of I are not updated. Well that's a difference to K means. In K means in every iteration we updated everything. Here I only make update inside the neighborhood, and you can have easily a map of 100 by 100, 10,000 neurons, but you are looking for five clusters. So I have 10,000 units, 10,000 potential centroids, but I'm looking just for five. Why? Because I have smooth proximity close to them. Why is that important? Why? Because K means was hard clustering. SOM is soft clustering. So you have a transition from being something to not being something. So, okay, so SOM has three stages basically. One, competition. Such an ugly word, but here makes sense. So at the beginning, every neuron will compete with every other neuron to represent the input. I want to represent the input. No, I want to represent the input. So that's a competition. Whoever is closer will win. We have adjustment. We will learn. We will learn that we made a mistake, hopefully. Second, yeah, we know, we know competition is really ugly. Let's collaborate also a little bit. So collaboration. Where is the collaboration? Well, the concept of neighborhood. If I win in lottery, I share with my neighbors. Yeah, I competed to win, but now that I'm winning, I'm not a jerk. I will share some of it. So that interplay between competition and collaboration is the fingerprint of self-organizing maps that has been reused in other ways, adversarial learning, generative models, and other stuff that made the underlying concept. And of course, last, you have to adjust the weights. So weight update. You have to update the weights. You compete, you collaborate, you adjust the weights. You compete, you collaborate, you adjust the weights. I doubted that our brain works that way at the synaptic level, but nobody knows how it works. So, okay, that's a guess. That's a model. Yes. Sorry again. Are we ending up with units that are not part of the neighborhood? Usually we have way more units that we have clusters. So many, many units may have really little small weights. And if I visualize them with big weights having big amplitudes, then we have no contribution. Yeah, that happens because we want to, we want to figure out the boundary between the groups. So we need, we need the excess of, of, of neurons. Okay, how does the competition work? So in the competition, we have to find the most similar unit. So that means I of X is arc max over J and X with respect to the distance between X and the weights. So look at your inputs. Look at the distance between the inputs and the weights for the ice neuron and see whether you can find the maximum. I want to be as similar as possible. If I go down the Euclidean distance, if you have a matrix, we really implement it as dot operation is fast. And then we know again. So if things are similar, their dot product will be high. If they are orthogonal, which are very different than it, we don't want that. We want similar stuff. Yes. At the beginning probably, but the concept, depending on how you define the neighborhood, we will prevent that actually because we don't want too much overlap. Some overlap, but not, not much because we want the distinction. So, and this J will go from one to up to M and M being the number of units. So which is the size of your map? How many neurons do you have? It's one of, usually it's one of my biggest headaches when we experiment with a self-organizing map and you say, okay, I have 120 classes or clusters. Okay, but how big should the map be? I don't know. Start the thousand by thousand. So you have a million neurons. A million neurons for 120 clusters? I have no clue. Start and see what happens. Then we have collaboration. So the winning is you want to be as similar as possible to input. At the very beginning, your weights are random. But then we start updating. Hopefully at some point, we come up with some smart weight update. Competition, collaboration through neighborhoods and then update. So use the lateral distance sub ij between the winner unit i and the neighbor unit j. So we have a, we have a function which is looking at the distance of dij and that distance, you can calculate it again with Euclidean distance, is some sort of Gaussian function minus the distance dij squared over two times sigma squared. So which means if I, if I choose something like this, I'm choosing a Gaussian neighborhood, which is, which is a wise thing to do because again, we need us, we need a smooth transition from perfect weight, a little bit perfect, not so much perfect. It's still acceptable. Okay, you are out of the neighborhood. So you have the distance dij and you are working with a zero mean and you have a Gaussian like this at one. This is my hij of dij and this is my two times the standard deviation. So this is the width. So now here with setting the Gaussian, of course we use Gaussian with, we know all its characteristics is nice. We can design it, we can center it anywhere we want. So with that sigma, with that the standard deviation, you define how generous you are. How big is the neighborhood? So do you want to collaborate a lot? Do you want to collaborate more or you want to compete more? So this is a Gaussian neighborhood. Is that the only possibility? No. Pretty much everything I say here, I'm just giving you the standard configuration and you can play with everything. So you can play with the size, with, with the function, with configuration. So what is sigma here? For example, that's a parameter that you have to set. How do I do that? Well, usually sigma is a function of number of iterations and you start with a sigma o initial value and then we put that through an exponential decay function of minus n over t. N is the number of iterations. So number of iterations and t is a constant. So at the beginning the job is tough. I need the help of my neighbors. I share. Toward the end, to help with my neighbors. I want them in. So maybe at the beginning I start with a Gaussian like this and then shrink it, shrink it, shrink it over time. So at the end I will not share credit with anybody because toward the end we are getting the cluster centers. I want to be the winner. So the collaboration at the beginning is a lot, but it decreases over time. So any type of exponential decay, collaboration is high, comes down rapidly. As a function of number of iterations, at the beginning a lot, over time not much more. So more collaboration at the beginning. You are already 654. How that? Okay. I was slow today, wasn't I? We needed five more minutes to finish it, but we will do it on Thursday and then we wrap it up and then we move into classification. On Thursday we have also the tutorial on self-organizing maps.