 Okay, good afternoon everybody. So I thought we are done with clustering and classification and we want to start with neural networks. But I want to cover one more topic in clustering and classification beyond neural networks before we move actually to the foundation of neural networks and talk about perceptron and regression and things like that. SOM, mainly because we talked about clusters, so the question of validity will arise which is a sort of Turing test. So how do I know if I apply K-means? And K-means does its magic of distances? Or SOM does the same thing in a different way? How do I know that the groups that they find are really good? How do I know that? How do I test it? So basically cluster validation is a big part of machine learning for any type of technique so you get a group of clusters and the question is, are they good enough? We usually don't ask this question for classifiers like SVM because they gave us the guarantee that, wow, I'm giving you the optimal result, nobody can do any better. Okay, so this is not for the superwise one because the superwise one, if there are doing their job, they match what the supervisor has said this should be the output. So if they can guarantee that they converge, they already passed the test. But clustering techniques are a bit different. So the main question is, how do we know the clusters are valid? How do we know the clusters are valid? How do we know it's not a bogged-in algorithm? How do I know that these are really the best clusters I can get? Or at least, good enough? Well, at least give me something. So if you cannot tell me they are valid, can I know somehow they are good enough for me to work with them? So yes, K-means has been around for 50 years and SOM for 40 years, but my data may be complicated. I may not have enough data. My data may be noisy. I may have missing points in my table. So there could be many, many problems. How do I know that the clustering technique is working properly? So what is desirable from our perspective? Generally we want high interclass separation. So the distance between companies that collapse tomorrow and the companies that will be very successful tomorrow should be really large. And the difference between the class of dogs and class of cats when I'm doing image distribution should be really large. They should not be close to each other. If things are close to each other I can mistake them for each other. So high interclass separation. We want clusters to be way apart from each other. That's desirable. That's a general thing to ask for. And of course we want a high interclass homogeneity. We want within the class to not be many different stuff. So if I have a class of dogs they should be very similar to each other. If they are not similar to each other then my attributes are not good. My features are not good. So if I have a group of successful companies they should be very similar to each other. If I have a group of people who survived chemotherapy they must have similar attributes. Why? That's the structure we are looking for. If there is no such a structure what are we doing? So what we want is high interclass separation and high interclass homogeneity. If we get this this is sort of objective two-ring test. If you give me this because don't forget every time that you run chemies or SOM you get different results. Why? It will not be gigantically different but this group maybe here the next time is there and so on. So how do I know which one is the best? So we need then to define a sort of index of validity index of validity that uses two things. I don't want to write down maybe I write here. So first I want to use some sort of sum of squares within cluster and we will call it SSW. Sum of squares within the cluster. So I want to figure out within the cluster my stuff should be similar to each other. If I calculate square distances when we say square we mean distance. Distance is our business. So sum of squares means sum of square distances which means Euclidean distance. So people don't mention it because we are used to it. You say sum of squares it means Euclidean distance. If I calculate the distance between all dogs in my class it should not be a big thing. So the sum of squares within the class is one measure SSW and maybe we also create something like sum of squares between clusters between clusters or SSB. In the statistics people know that you have been using it for quite some time. You can apply it when you have a conventional things for grouping stuff or you have you can apply it when you are using machine learning it doesn't matter. The validation is the same. It doesn't matter who grouped the stuff together. The way that we validate them is the same thing. So for example now we want to define it. So let's say SSW is simply you build a sum of the differences X sub i minus C pi squared and i goes from 1 to n and n you have n data points and X sub i is your data instance basically is the current measurement and Cp sub i is your class prototype or center or mean class prototype for the i for the data instance X i for the i data instance X sub i. So I'm basically building the difference between all data points that I have with corresponding centers. So which is if this is my this is a cluster this is my Cpi and then I have many many data points in this cluster. So what I'm doing is I'm building just everything from the center from the prototype. So I'm adding them up I'm building the sum of the squares within the cluster. So how different is every dog from the most prototypical dog. So of course if the cluster center is not a really good center which means the clustering did not do the job and things will be messed up. So this tells us because you need a point of reference you want to build how homogenous is the group the group has to be homogenous. This is not the place for diversity here we don't want diversity yeah no diversity we want homogenous everybody is the same ideally but if everybody is the same everybody will converge toward the center and will collapse and you have one big egoistic megalomaniac in the center well we don't want that either. You want some data a little bit of diversity because we have some instances of dogs. So this is our SSW which is our sum of squares within the cluster and this has to be small so SSW has to be small if the clustering techniques is not up to us assuming somebody knows the perfect location for the mean for the prototype of each cluster for every class assuming somebody knows that it's fixed yeah but can you find it that's the point can you find that perfect point and you can you can calculate the distances you cannot get below a certain level that's the nature of the data but can you get that. So then we have the SSB the sum of scores between the clusters that we go the sum of n sub i times the Euclidean distance between the sub a C sub i and the average of averages and you go from i equal 1 to m so you have m clusters we have n sub i which is the number of elements in the cluster number of elements in cluster C sub i of course the current the current class mean or prototype or center or centroid and x bar as the mean of the means mean of means so every cluster has a mean you build the mean of all clusters you get the mean of means so now I'm looking at the sum of the scores between clusters so I want to see how how separate the clusters are now I have something like this so I have cluster a I have cluster B I have cluster C which is very close to A and I have let's say cluster D here well with a B and D I'm okay but perhaps see I don't know it's too close to a so how can we do this if I measure this I get I get it I get a number that gives me some idea about how close are a B and C and D and we want them to be way apart so ideally perfect clustering is you have m clusters you get the Milky Way Galaxy and you put them there on the outer circumference of the Milky Way Galaxy and each one of them you have densely packed elements wishful thinking not going to happen but what can you can you get in that direction can you go in that direction well this to SSW and SSB or part of ANOVA or the statistical analysis so which is analysis of variance as anybody as anybody used ANOVA it was right in Excel it can you saw one somebody has used it people use it all the time hospitals finance people they use it all the time so it's embedded also in tools like Excel any other financial spreadsheet program it is it's not something new we have been doing this for 40 50 60 years so but again so suddenly they become more significant because I have autonomous clustering and I don't know whether it's doing the job or not so and then I go back to the old guy and say guys what did you do back then so back then we had just fixed many many examples we have in the statistics come from biology because biology is where you can get a lot of data the data is fixed is the nature of things how many flowers I have how many leaves have this type of tree and so on so you put that in and you do your analysis you classify them sometimes even manually now we can use the same wisdom for for for machine learning so there are others so this SSW and SSP are quite common in the statistics are not new people have been using them all the time so what about any other clustering cluster validity measures validity measures well there are many for example we have Kalinsky dash arabas index so these are basically just formulas that I'm throwing at you maybe one or two of them so ch index which is SSB over m-1 over SSW over n-m so now we are playing with the two fundamental and this is very common in computer science and statistics we establish really basic stuff like mean and standard deviation and variance and we start playing with them okay what happens if I put this here and I put this here and I multiply this with then you have to justify it you have to investigate the property of that formula that you are putting forward so now we want to put it in one equation so that within the class and into a class we want to put it in one equation this is one way of doing it so why you are doing it this way well I run experiment I get it gave me good good numbers other one is the hardigan index so H which is the index is SSW for m over SSW of m plus 1 minus 1 times n minus m minus 1 now we are getting created okay what about so m was the number of clusters what about we look at it so what was the separation that you gave me for m and what was the separation that we get me for m plus 1 if I cluster you remember for k means we have to determine the number of clusters k equal 5 okay what about k equal 6 so one of the things that we can do we can use the SSW and SSP to get some sense how many clusters make sense in my case for my application one of the ways that we can do it we can play with this we can we can or even sometimes you see also the other version of this which is log of base 2 of SSW over SSP over SSW yeah so there are there are different things we can do that and you see the pattern that we tend to combine both measures again because we wanted those two desirable functions in one we want within the class everything compact dense similar to each other and between the class we want as much as distance the classes should be clusters should be very different from each other then there is also maybe one of the most extensive one is the dawn's index which which tries to define things differently so the dawn's index says you get the minimum over all m for index i equals 0 to m and then you get the minimum of j equal j plus 1 to m and you do that for all distances between c i and c j c sub i and c sub j and do you divide that now we are getting really intensive with calculation the maximum if you go from k equal 1 to m of the diameter of c sub k and what are they those d and diameter so the distance and diameter so the distance between c sub i and c sub j is the minimum of the Euclidean distance between x and x prime when x comes from c sub i and x prime comes from c sub j so you take pair of data from two different clusters and you find the minimum which is in that class c that I just drew was very close to class a they will be members in those classes that get really close to each other there are members of neighboring clusters that are very close to each other yes sorry again yes this one sorry i thank you i plus one thanks so we have to find so what this does if this is one class and this is another class close to it so there is one item here and one item here and they are very close I want to get sense of that how many cases you have like this because this could be confusing cases that you misclassify stuff so if you create groups with many situations like this that two first the big problem is that two clusters are close but maybe you cannot do anything about it maybe not but maybe who knows maybe the cluster shape should be actually something like this and maybe these two are belonging to the same class so we don't know so I can only know if I compare it with other stuff I have and the diameter the diameter of ck for any given class you find the maximum of the Euclidean distance of x minus x prime when x and x prime belong to the same class c sub k well okay and you're doing that over everything so a lot of calculations a lot of distance measurements so now this is the built-in validation that we talked about at the beginning when said look you take the data you break it down in two parts test training and testing but inside training you still take a small part of it for validation because after the training is done you have to give the solution to the world you cannot change it anymore so whatever you can do do it within the training session to make sure what you have is good so we will do a lot of calculation distance distance measurements are not cheap Euclidean distance in high dimensions when you have 500 features wow it it will it will consume everything we have what we do it this is during the validation as part of the training this is not when we have given the software to the user the user does not see this this is happening in my lab in my R&D department so we do this sort of stuff this is quality control for when I am working in a new company and I want to make sure what I put forward is good enough so I do this make sure that classified I'm deploying this is my first big project after six months in a company doing some serious stuff I want to put my best foot forward I checked out it takes two days to calculate so be it and then I can document everything I said we did this okay there is also much simpler than this and many people work with that we simply call it WB index is up to you and WB index as a function of M which is number of clusters is M times SSW divided by SSB another example that everybody cooks with water you can play with it is SSW and SSB because that's the quantification of a real phenomenon there is no escape from the real phenomenon so I have to go back and say okay measure this measure that now divide it now add it now multiply it so and of course this should be ideally low ideally very low the sum of squared within the cluster should be ideally very very small very small and this should be ideally very large so this fraction should be a small number even times the number of clusters should be a small number so if I have two alternatives I go with the one that gives me a smaller value so I validate my my clustering technique past the Turing test with SSW and SSB okay now I can use it this is the best I can do there's no there's no other way of doing it okay there is cluster validation is a big field within machine learning anything we cover anything I cover here it in within 10 minutes you have to take it times a thousand it is a subfield you just a scratch the surface a cluster validation you do SSW SSB good to know because if you start and you want to go deeper you start here and say oh my god there's other so many other stuff but this is a good place to start so then we have some confidence but we have other problems too we have so many problems we made a big assumption we made a big assumption so the assumption is that X sub i belonging to C sub k and X sub i does not belong to C sub k sorry C sub j for all j not equal k this is the assumption that we just made for the classification on cluster what does that assumption telling us what am I saying with that if there isn't data instance that belongs to a class and that but that data instance does not belong to any other class which means the class membership is true or false is zero or one that's what they're saying what that was the reason we could pull off the SVM magic because we assume is binary is yes and no positive negative well this is hard or dual or crisp many people call it different things clustering so the membership of CI sorry the membership of X sub i belongs to zero or one so that's a set right so or other way which means the membership the mu of X sub i to any class to any class so I can write a k or j it doesn't matter so the membership of X sub i to C sub k or C sub j or to any other class is not in the closed interval 0 and 1 so is not 0.2 is not 0.75 is 0 or 1 that's a big assumption to make that's a gigantic assumption to make for the data that is noisy that is hyperdimensional that has outliers okay so what do we do with this what the problem is the problem is much deeper than that the problem is that AI is supposed to deal with imperfect information imperfect information so okay what is imperfection so if it was perfect you wouldn't need AI so but if the data is imperfect what does that mean so imperfect means hard tough okay so imperfection means either you are dealing with uncertainty or you are dealing with vagueness very different things what is let's let's like any a natural event like raining so is raining an uncertain event or is a vague event while it depends what does it depend on so if I have my event horizon so before something happens and after something happens so I have before and after before it rains after it rains before the market crashes after the market crashed yes this one this one this means the membership of any data point you have is either 0 or 1 and it is not a number between 0 and 1 so it's a hard clustering or crisp true or false black or white this one this mu mu is the Greek word for membership we use mu as so mu of something is membership of something okay so before it rains after it rains so before rain after rain so if you ask me tomorrow tomorrow will it rain I look weather network and tell you with a probability of 60% it will rain so this is uncertain with so with 56% who say who can say that with 50% likelihood it will rain tomorrow the probability that an airplane in which we are flying fatally crashes and I die is one roughly one in 35 million passenger flights we have per year one in 35 million wow that's safe but if you're sitting in that one airplane that will crash you will curse the hell out of probability theory because it doesn't help you that I'm sitting in the one that is crashing so probability probability is responsible before things happen when they are uncertain when things are uncertain we use probability theory very powerful too since over 250 years has brought us a lot of stuff has brought us to the moon if the moon landing was not a hoax I don't know if it was was fantastic so so but now it's tomorrow and we see that is raining now is tomorrow is raining there is no uncertainty we can watch out the window is raining so what is the problem there oh we do not agree is it a heavy rain is it a drizzle is it a light rain is it a heavy rain so what is it so is it a drizzle light rain heavy rain what is it what type of rain is it so the problem that you're dealing with you want to classify the rain with respect to its intensity there is no there is no uncertainty about that so here you cannot use probability theory after things happen after things happen we use fuzzy logic generally so because things become a matter of degree so what is the likelihood that the next person that enters this room from that door has a red t-shirt okay we wait and wait and wait somebody enters the room he has a red t-shirt well then we do not agree what type of red is it is it a dark red is it the light red is it vermilion what type of red is it we need that information we want to classify we want to subclassify when I am looking at it there is no uncertainty anymore I see it is red but I cannot agree what type of red so the imperfect information that AI tries to solve is most of the time one of these two sometimes we are in good luck and is a combination of those so either we are dealing with uncertainty we don't know what will happen we try to guess we try to predict we try to estimate but most of the time we know it's right in front of us we cannot agree what is it what is the intensity of it two different theories for two different things so let me go a little bit because I want to give you I want to give you another example for clustering and classification and that's what you are doing is to remove this assumption I don't like this assumption and I want to I want to talk about another clustering technique that is one of the simple and really good capable clustering technique but for that let's talk a little bit not too much a bit of set theory and if it is boring for you so just look at your cell phones without set theory this wouldn't be here so we have this because of contour the logician who came up with a with the foundation of Boolean set theory and now we have everything so our entire computer system is based on 0 and 1 which is the set theoretical framework so in set theory we work with a universe of discourse which has some instances in it universe of discourse is a fancy name for the set of all your variables if you're talking about temperatures measured on planet Earth what is the universe of discourse minus 60 to plus 60 what is it so everything so you have to unit universe of discourse contains everything if your variable is temperature okay can I go from minus 70 to 70 who has who has experience plus 70 minus 70 to 70 I'm fine that's my universe of discourse it contains everything for that variable then we have a set a which is a subset of x a is a subset of x you are doing high school stuff but okay it's fun how do we how do we how do we write a set well I can just write a set and say set a is the set of ABC so I can list its elements that's the way that we show and a set I can write a set and say the set a is the sum of all x's such that x belongs to n actual numbers so I can list the property of the elements in my set or I can say the set is a such that the characteristic function of a for every element is either one if x belongs to a and is zero if x does not belong to a so this is called the characteristic function characteristic function of a so every set has a characteristic function that tells you I didn't write mu I wrote f because this characteristic function is a function I would rather use f okay so I can't like there aren't many other ways so these are the most common ways that you give a set to somebody else what is the set of pleasant temperatures I know 18 19 20 21 22 23 24 25 maybe 26 so that's the set of pleasant you just list them or you said the set of old temperatures whereas the temperatures is larger than 18 and smaller than 24 so you give the property or the set of temperatures such that if the elements is less than 18 or greater than 26 is 0 and if it is between is one so we can play with sets in many different ways why is that important what this is our business this is what we do because a could be a set or could be an event so this guy works with events this guy works with sets they have different businesses some people don't get the difference is still I don't know after after 300 years of this and 60 years of this we still don't get the difference but if we do we can come up with some nice stuff for machine learning okay let's go back back to the past to the high school and talk about van diagrams so this is my universe of discourse and I have a set a and I have a set B so and whatever is in between is the intersection of A and B so this is a intersected with B and of course this is a subset of X all of them are subset of X okay when I classify I don't want any intersection between classes just just a little bit indication that why this is relevant to us then we have again the universe of discourse X I have my set a and my set B and if I'm talking about everything so this is a union with B which is again a subset equal of X so a unified with B is all elements that belong to A and they belong to B okay whatever and we have X and we have A and you ask me what is if this is a what is not a so this is not a so not a is negation of a so somehow so we can say a bar or negation of a so everything that is not a now in the van diagram nobody told us in high school we made that big assumption that things are crisp true or false so you are either old or young I don't like that logic especially being 53 so I still want to have one my food and in the young domain but according to the Boolean logic I'm old you are about 50 you're old well I want to be young to 20% or something so not a is important for us because if these elements belong to this set to this class to this group with what certainty can I say they do not belong to the other class okay we have bigger problems right when you are dealing with a are you don't know you may be fighting you may be fighting giants and you have no idea that you are doing this so we have some logical laws we have this laws since antiquity there are more than 2,000 years old when you talk about this laws is about Aristotle and Plato and many people be before them that we don't know and many people after them so one is the law of non-contradiction so which means what which means a intersected with not a is empty the set of rainy day and not rainy day has nothing in common the set of successful companies and bankrupt companies has nothing in common the set of young people and old people has nothing in common if old is the negation of young the law of non-contradiction if we wouldn't do this we wouldn't get digital technology this is not an exaggeration without the law of non-contradiction we could not come up with the circuits transistor nothing we don't have we are still running after if you violate this we are hoping that quantum computers help us good luck with that I'm not gonna experience maybe you hopefully you will experience second logical law the law of excluded middle which means a union with not a is the universe of discourse the set of rainy days and not rainy days it's the set of old days the group of young people and old people is the group of everybody there is no middle there's nothing else okay whatever makes you happy if you violate this people will get upset some people violated this in the 60s so and they are called fuzzy sets fuzzy sets violated this and not only they violated this they violated this in the worst possible time it was early 60 the beginning of digital boom it was so proud of digital technology and somebody called and calls his technology fuzzy what this guy has his fuzzy headed so a fuzzy set a is the set of a pair of X and mu a of X such that X of course comes from the universe of discourse and mu a of X the membership belongs to the closed interval 0 and 1 blasphemy this was blasphemy because this will violate this to we can't do this or sometimes we just write a is the fuzzy set a is the integral over the universe of discourse mu a of X over X which means we just have an order we don't really build the integral we just we just show them the concatenation of all everything every element comes with its membership we will see what that means okay let me give you an simple example now we are the down the rabbit hole so let's let's have a simple example really simple the universe of discourse is 1 2 3 7 so this is my universe of discourse seven numbers so then we want to have we want to define a being the set of neighbors of four neighbors of four so I want to define a as the set of neighbors of four the classical Boolean crisp binary definition is a straightforward is 3 4 5 the neighbors are four are for itself to the left is 3 to the right is 5 do you have a problem with that you want to be more generous you can define a bigger circle like SOM and we can give you also 2 and 6 but you have to keep it consistent but either you are a neighbor of four or you are not a neighbor of four make up your mind because this has to count now if I define a as a fuzzy set basically you get this 1 2 3 4 5 6 7 so I wrote them in the form of this integral I wrote the elements here and then I have to say what is their membership and I copy the Boolean guy 1 1 1 so this guy does not need to give me the membership value because everything that is listed here the membership is 100% everything that is not listed the membership is 0 he has an easy life but this guy every time has to list everybody and say okay but you know what this guy also is sort of a neighbor of four and this guy is also sort of a neighbor and this guy is also neighbor and this guy is also neighbor but not just 100% 70% 50% 60% so a fuzzy set will list all elements and will also list the membership that's why we say this is an ordered pair the element and its membership value clearly for this kindergarten example I can see that is much easier to work with the binary technology for this simple example look how bigger the fuzzy guy is the fuzzy guy need GPU and CPU and APU and DPU and any other technology that has not been invented yet so what is that this membership so membership is what is it a similarity membership is intensity this similarity is not a subject of uncertainty why you say I'm not sure is similar or not what why can you put a number on it yeah well okay then can I look at it as probability yes you could probability so that could be the cause of few confusion that you have a number between 0 and 1 and people say well that's probability why it depends was it before the event or after the event numbers between 0 and 1 you just normalize who cares what you call it probability likelihood intensity similarity fuzziness vagueness ambiguity but if you call it probability you will set a different tools to solve the problem and if you call it fuzziness you will see the very different you will use a very different set of tools so one of them may not be able to get the solution depending on what the problem is membership is approximation what at the latest here I know that this is AI business because AI is function approximation logic is a big part still nothing has changed nothing has changed you're still doing weak AI logic is a big part of AI and of course membership can be compatibility and many other things that I'm not writing so membership can quantify many different things intensity similarity approximation probability likelihood fuzziness vagueness crispness you name it so there are many many things okay this was high school this is valid for the Boolean logic that is conforming to this so Aristotle sees this he's the happy guy because there is the decision boundaries now we are talking classification the decision boundaries are clear right there is a this there is an exact boundary between things and not things the boundary between a and not a is well-defined of course that's Boolean logic true or false young or old so what's the point now how can I come up with a fuzzy version for this I cannot draw this for fuzzy systems because then I need Photoshop to put a gradient here in the center is absolutely black and as you go away it becomes brighter and brighter and brighter and when I get here so here is black and when I get here it becomes white so is it is a circle of gradient color but we cannot work with that that's not serious science that's Photoshop okay let's come up with something scientific we cannot visualize fuzzy sets what can be visualized the membership function because the membership functions will do the Boolean logic is 011 we don't need to visualize it is there is right in the listing so I don't need to I can visualize the set but visualization of the of the characteristic function does not make sense so if this is my universe of discourse and then I have a membership function for my set a so me of a is a membership function for the fuzzy set a what is the fuzzy set a the set of old man the set of pleasant temperature the set of successful companies the set of people who smoke and do not get cancer whatever set is it is that we are interested and then I have of course this goes from 0 to 1 here goes from min to max and we don't know what that is and I have this which is the mu of B so is a characteristic is a membership function for the set B so now I can show a intersected with B through the intersection of their membership values so this would not mean anything in van der Grams because van der Grams is for dual logic so if I now if I have again mu of a and mu of B and I build the union so this is then a unified with B and negation if this is mu of a so what is the mu of not a the mu of not a is defined by 1 minus mu of a why is that well the Boolean clear if you are not a member you are not zero you are one and so on so if my membership to the young people is 0.2 what is the membership of being the member of not young people 0.8 the remaining part so which means if I come here so this is mu of not a and at the latest here somebody should screen a not a is not empty blasphemy a and not a intersect no cell phone for us no no digital technology this is not an exaggeration worse if this is mu of a and this is mu of not a so one minus mu of a and I build the union a union of with not a is not the universe of discourse because this part is missing this is not part of a unified with not a why is that why if you were dealing with binary Boolean logic so a would be this is binary right 0 1 what would be not a not a would be this so this is a this is not a what have a and not a in common nothing what will happen if this type of concept come into classification and clustering what they are there we just don't see them so does it mean we have to revise some of the stuff maybe at least we need additional tools look in some cases our binary logic works perfectly we don't need to be worried about but in some cases maybe you should issue membership cars that you say you are member of this library to a degree of 65% of gold and platinum and silver and the rest of the nonsense that's a degree that's Lucas Yavitch because the first one who said guys we cannot do with true and false we need something in between at least we need to say I don't know 55th yes the overlap will disappear if you become binary gap between a and not a no if you are working with normalized fuzzy subsets no okay so what does that mean that means again so for fuzzy logic says no no which is fresh where does this guy come from he came from Berkeley but okay he passed away in 2017 he was 93 years old and in one of the conferences I asked him how did he came up with this idea it doesn't happen that frequently that you made one of the colleagues who has introduced a whole different theory I asked him I was I was a student I was enthusiastic and said sometimes indiscreet so I said how do you came up with this idea and he said it was in the some it was the winter of 1960 and he was sitting in John of Kennedy Airport was waiting for his flight to Los Angeles and there was a snowstorm and flights were delayed and he goes to the information that's and ask the lady there what about my flight and the nice lady says don't worry you're you don't have a long delay he goes back and sits looking at the display and as he had a mixed background with math and engineering he tries to come up with his in his mind the set of old flies that do not have a long delay and he immediately realizes you can't do this because you don't know what long means I said this idea didn't let me go and I went home and sat down and wrote something and I said the boundaries sometimes are fuzzy so I called it fuzzy said and after two years he sent the paper to many journals they reject him of course and like SVM like bad propagation like evolutionary algorithm if you have good idea it will be rejected definitely so and he get desperate after the fifth submission fifths rejection he goes to Bellman Richard Bellman dynamic programming one of our pioneers in computer technology and says Richard what should I do said you are editor of some journals yourself you say yeah I'm editor of that journal information science information control I said send it to that journal they know you cannot reject you say but what I put them do you want to publish or not yeah okay he send it to them they say as a god damn it he's on the editor board so we cannot reject him so we got fuzzy logic so okay how do we measure fuzziness so if that's the case so let's be clear fuzziness is not a good thing like uncertainty is not a good thing probability theory is to remove uncertainty fuzzy logic is there to remove fuzziness so a gamma is defined as fuzziness being two times over the number of supporting points of your function the sum over all I you take the minimum of the mu of a of x sub I and 1 minus mu of a x of I so for every point you take the minimum between the membership and its negation you sum them up times two divided by all number of supporting points because these functions are of course discrete so one two three four five and so on so I can I have n supporting points why times two why times two where the come the magical number two wasn't this also somewhere in SVM but this has nothing to do with that too but two is a is a funny number so why two here what is the range of minimum of mu and one minus mu yeah so minimum is zero maximum is one if this is zero this is one if this is one this is zero if this becomes 0.5 this is 0.5 so the maximum number that comes out here is 0.5 times two is one normalized minimum of zero and one is zero minimum of one and zero is zero minimum of 0.5 0.5 is 0.5 so that minimum goes between zero and 0.5 so this never goes this is 0.5 it cannot go above that this is in the definition 50 50 glass half empty or half full this is a logical dilemma this is a real thing that too that too has something to do with the glass half empty or half full so normalize it and then you get a number between zero and one which means zero one this is my gamma the fuzziness I can call it index of fuzziness and here I go between zero and one and then I have my 0.5 Mr. Lokozievic sometimes we got to say I don't know and this index has a shape like this so the fuzziness becomes maximum when the membership is 0.5 of course sure if your glass is 20% full you can still say it's almost empty if your glass is 80% full you can say it's almost full come on take it pay the full price but if it is 50 50 you really can say you have to flip the coin this is from logical perspective the point of maximum uncertainty nobody can make a decision are there such cases in clustering and classification that objects are sitting right on the decision boundary are there such things if yes then nothing can help us so gamma is max for mu of x being 0.5 we don't want 50 50 so the entire fuzzy logic is to try to push you either in this direction or in that direction this are the regions that we like because I can easily make a decision right don't get close to 50 50 because then I cannot easily make a decision well nobody can make a decision okay so based on all this can you give me another clustering technique that deals with vagueness gives up the big assumption about things are true or false black and void young and young and old we had k means k means it's fantastic but maybe sometimes k means cannot do a good job if the vagueness is the dominant characteristic of my data set how do I know that I don't so I run k means I get some results with my SSW and SSP I run something else get some results run some statistical analysis I go with this one I say oh okay the nature of data was rather vague not uncertain good scientists make that distinction okay so what is that around 70 actually well well early 80 which was not even a decade after came is was introduced somebody else introduced the fuzzy version of came means so and he called it fuzzy see means fuzzy see means to date one of the best clustering techniques we have so and I go I jump right into it I want to just give you the sort of pseudocode so first step of fuzzy see means you initialize why you initialize for any clustering technique so the not you need the number of clusters and so no difference to came means you still need to tell the algorithm how many clusters are you looking for then we have a fuzzy fire fuzzy fire M so that's a specific parameter for the fuzziness site of things that we call M call fuzzy fire and then we have membership function membership function membership function mu in in came means we didn't have a membership function because everything was true or false either you are a member or you are not a member so and I only list you if you are a member so now we need a degree of membership if you want to do the concept of fuzziness second step we find an update we find cluster centers cluster centers so C sub i is the sum of k equal 1 to n number of instances of mu sub i k raised to power of M which is my fuzzy fire times X sub k so basically a modified version of membership times the data set so now before we had this but this was either zero or one so I didn't know to do this I would just add the ones who were there what is the M about why M how much like vagueness do you have how difficult is the problem you can play with that that's a parameter so we like nowadays we leak hyper parameter this is not a hyper parameter this is a high pole parameter because it's just one and then of course we'd go the sum of k equal 1 to n mu sub i k to M just divide by same thing to normalize so the center of classes is defined as a weighted version of the k means so I'm not just building the average I'm building the weighted average but the weights are memberships that I defined somehow okay well how do you define the memberships what do you think what is it that we have to measure anything distance there's nothing else so so step three of FCM by the way we call it FCM step three is update memberships updates what it has to be a learning procedure involved otherwise it's not AI memberships so where does that come from if this is the first step well like everything else the first we initialize it I didn't write it here so well I did so you you initialize the membership values how randomly assign some numbers to them in step two and three and four we will adjust them and make them more meaningful so the membership mu of sub i k is equal now this is one of those handcrafted stuff that you have to sit down and say why did this guy do it this way one over the sum of j equal 1 to M number of clusters of the ratio of two distances d of sub i k divided by d sub j k raised to the power of 2 over M minus 1 and M was my parameter faze fire so I'm saying the ratio of the distance inverse of that is my membership and of course for all of them you will get a number between zero and one you will not get zero or one the nature of the equation but how how this how do we come up with this equation why there is no philosophy behind it you sit down and put your intuition and everything that you know the domain knowledge try to put it in an equation try to justify it and then hopefully it works and then you come with some fancy adjustment of it and again it has to work if it works you say well that's my equation and everybody buys it that's my equation this was the phd of somebody this was the phd thesis 43 pages so and then step number four again we need a stopping criteria stopping stopping criterion so how do I stop you build the difference or the Euclidean distance between your current partition and the previous partition so you current minus you before and this is your fuzzy partition what is that well every element then in your data set has M membership values so if you grab if you grab X X X will be represented by let's say if M is four classes X will be 0.1 0.2 0.6 what come next 0.1 so if I have four numbers this is for one X put them together you get your partition so every then every element every measurement of that big file that I get for training columns being features rows being measurements if I am if I'm looking for four classes I have four numbers so this is this is for FCM if I look and of course if again if you put them all together for all measurements you get your fuzzy partition is a matrix and then you compare this matrix with the before matrix is it did things change substantially if not then stop so I converged so if I if I do this for if I do this for K means I will get 0 0 1 0 there is no partition per se very different thing they suit they do the same stuff interestingly there are applications if you have to apply K means and you apply FCM you get more less the same result of course because he comes from that or he comes from that door but this is the same room the same data of course you get the same results if you have a reasonable approach so so that was an example there are more example there are other examples for clustering but if I when I start to do something I start with K means FCM is also available our fly FCM do some SSW SSP applications take a look at the numbers which one is better continue with that there is no way to say ahead of time so just an idea and hopefully we all we have done justice to the concept of clustering and classification so next lecture we continue with regression and jump into normal networks see you later