 Yeah I was talking about standardization and normalization, if you have a variable which is taking values x1 to xn then a way of standardizing it is you just write new values as yi this is the write the new values as yi yi you write it as xi- the mean x bar the mean x bar of this n observations divided by the standard deviation sx is standard deviation of standard deviation of these observations x1 to xn I hope all of you know the meaning of standard deviation the square of this is actually going to give you the variance the square of this is going to give you the variance this is a way of doing standardization of the variables now what is the specialty of this once you do this one then the mean of yi that is y bar 1 over n summation i is equal to 1 to n yi this is actually equal to 0 you are making the mean to be 0 and you are making the variance variance of y is to be equal to 1 you are making the mean to be 0 and you are making the variances to be 1 so the example that I was telling you 0 to 100 since the variation is more you are going to get more distance values once you do this one you are making the variance to be equal to 1 okay so if you have an have a worry have a variable where its variance is something like say 0.5 then after you do this thing it is going to become 1 and if you have another variable whose variance is say 100 after you do this one that also going to become 1 so both these variables where for which the variance is for one of the variables the variance is less than 1 for another one the variance is say greater than 1 for both of them the variances are going to be 1 and the means are going to be 0 so this is one of the methods this is one of the this is generally accepted way of standardizing a variable but whether you will do standardization or not that depends on the problems that you have I am not saying that every time you should do it neither I am saying that every time you should not do it it depends on your problem whether you will do it or not if you do standardization then your variance covariance matrix it is going to become correlation matrix that is all the diagonal elements will be 1 and all the off diagonal elements will be the correlations between those two variables the correlation between those two variables that will be the there will be the off diagonal elements and the diagonal elements will be all 1 I hope all of you know the meaning of the word correlation okay I hope all of you know the meaning of the word correlation so if you do standardization then the variance covariance matrix is going to become correlation matrix and we generally we do PC on variance covariance matrix then you may try to do it over the correlation matrix normalization this is standardization normalization is again if you have x1 to xn then you write find max maximum of xi values find minimum of xi values then you write yi as xi-minimum by max-minimum this is a way of doing it this max is maximum of the xi values min is minimum of the xi values so find the minimum of the xi values find the maximum of the xi values then you write new variable yi as xi-minimum divided by max-minimum so when this xi is actually the minimum you will get the value 0 when the xi becomes maximum you will get the value 1 so then all the values are in the interval 0 to 1 all the values are in the interval 0 to 1 this is slightly different from this here the mean is 0 variance is 1 here all the values are lying in the interval 0 to 1 this is a way which many persons follow for normalization again one can think of a few more ways of doing these transformations this is not the only such thing you can think of a few more ways of doing it whether you need to do it for your data set or not that again depends on you and if you do it then which particular transformation you will do and why that again depends on the data set that again depends on the data set with this basically I would like to go to actually another with this one I will end this the lectures on I should say classification class if with this same thing I will end the lectures on classification now I will actually go to clustering that is unsupervised classification please this one modification that range can be changed to you can do that I mean you can actually instead of making it from 0 to 1 you can actually make it from minus 1 to 1 so I mean probably a way of doing it is this is from 0 to 1 so multiplied by 2 then this will become 0 to 2 then you just subtract 1 and this will become minus 1 to 1 this will become minus 1 to 1 so you need not have to make it always between 0 to 1 you can think of one particular interval and you can keep everything with confined to that interval please. Sir can you just give an example where doing the standardization and normalization will not be useful or may not be useful yeah basically there are some places where you need the information on variances to be preserved when you are doing the standardization then you are reducing or increasing the values of variances generally it is reduction but so those places where you want the information on variances to be preserved then that is the place where you should not use and there are many examples where variance provides lot of information about the phenomenon under consideration about the phenomenon under consideration many times variance provides lot of information so they are the places where you should not you should not do the standardization you should not do the standardization and you the places where you are forced to do the standardization is at those places where you have some variables with smaller variances but they have the capability of distinguishing between the classes and you do not want those variables to be submerged by the variable with larger variance but it does not have any capability of distinguishing between the classes then they are the places where you want the one with larger variance to have smaller one and the one that has smaller variance you want its voice to be heard so that is the place where people try to do and if they feel that they are the problem under consideration is like that then they do the standardization one of the places where you want the variances to be preserved is the example about Punjabis and South Indians doing the classification we are taking the height you see there the height information for I am assuming that Punjabis are generally have larger heights but there are also many Punjabis who have smaller heights so probably their variances that variance has more I mean variance is more for Punjabi class then probably for the South Indian class but there that variance should be preserved the variance information you should not change there and the other example about length of the I gave that there you need to change it there you need to change it because the variance for these things this is very very small and you need this variable to be heard so and many of these things I will try to discuss in future selection when I give lectures on future selection I will try to describe these things I will try to I will connect up this thing with that ultimately these are connected to future selection when you are given a data you will do standardization or normalization that depends on the problem at hand and for the problem at hand which features are giving you the information about the problem at hand that you need to somehow do the make the decision and that actually becomes the problem of future selection so problem of future selection problem of standardization normalization they are very much connected they are very much connected our questions please please Sir when we are doing standardization is the topology of the data preserved what is the your meaning of the word topology how it is distributed the structure I am not getting please explain scatter of the data the distribution of the classes is that preserved I am not understanding but what I can tell you is this for two different variables the reduction and increase it is not in the same proportion if that is what you want me to say I think that is what is your doubt for two different variables the increase or decrease is not in the same proportion it is not in the same proportion different proportions so if you have different proportions of the distribution information will be lost when you do standardization how the line in the space feature space so that is not preserved suppose if you are scaling it down so equally if you are scaling down so saying I think you see there are many words the word for example use the word topology and there are some such words these words in my opinion they should be used very sparingly because they have they contain lot of lot of things probably in my tomorrow's lecture I will I will be actually talking about topology there I will tell you I mean the meaning of open set close sets closures compactness connectivity all these things are there in topology. So the word when you are using the word topology I do not know which one you are trying to mean I do not know which one you are trying to mean basically the word topology is used by computer scientists in one way somehow if they preserve the distances then we say that it is topology preserving am I right between the point she could mean the IC control lines around the mean same distance around they are not going to preserve the distances that is what I said different proportions they are going to change yes for both the variables here you have two variables for both the variables the mean is going to become zero and the variance is going to become one if you have two variables and I was giving you an example yesterday where one variable has variance two another variable has variance one if you take a if you look at the scatter plot how it looks like it looks like an ellipsoid if you have the covariance is to be zero now here that two and one they are going to become one and one and the covariance is zero means there will be a circle that means what is happening it has changed the distances have changed I am not saying that in every situation you need to do the standardization neither am I saying that in every situation you should not do the standardization you need to make a decision whether you will do standardization or not and that decision is imperative in many many problems in many problems the decision is imperative other questions noise okay yeah I had been talking about classification methods now I will be discussing the clustering in clustering the basic problem is you are given a data set and you really do not know how many groups are there in the data set the word group I am using it loosely okay it is not the group that you would see in group theory rings and feels the word group is the way an ordinary man uses a word group uses the word group in the same way I am also using it for example look at the two data sets that I have given here in this data set you would say that there are two clusters this is one cluster that is one cluster or there are two groups of point this is one group that is another group and in this data set again you would say that there are two clusters this is one cluster that is another cluster right now this is something when we are looking at the data set we are saying it and you want the same thing to be said or to be concluded by your algorithm your algorithm should automatically state that in these two examples the number of clusters is two number one and number two once it says that number of clusters is two in the second one it must be able to state that the first cluster is this this is a cluster of points this is one cluster and that is another cluster of points similarly in this one this is one cluster of points and this is another cluster of points this is what your algorithm automatically should state as the name of the subject is recognizing patterns pattern recognition so you are given a data set we are able to recognize the patterns and you want your algorithm to recognize them automatically. Now let us see what are the difficulties here so if you want to state the problem the problem would look like this you have you are given a data set S maybe in m dimensional space here the number of clusters K may not be known and then there is something called a choice of similarity dissimilarity measure and algorithm what is this choice of similarity and dissimilarity measure you somehow feel that this whole thing since it is you are saying this is one cluster somehow they are I mean similar to each other similarly these points are in some way they are similar or they are connected to each other and here also this point and the same is the case with this you have some sort of a similarity relationship that relationship somehow you need to mathematically state it the clustering problem which in many books it is stated as you would like to find groupings in the data set so that in each grouping the points are similar to each other and from one group to another group you will find dissimilarity let me repeat it in many books the problem of clustering is stated as you need to find groupings of the data so that in one group the points are similar to each other in every group the points are similar to each other and between one group to another group you will find lot of dissimilarity so this word is similar and dissimilar they need to be properly defined and that is this point choice of similarity and dissimilarity measure and the number of clusters has is it the case that someone gives you number of clusters and once you are given the number of clusters you will find them are in some problems you need to find the number of clusters on your own in some problem you need to find the number of clusters on your own it is not necessarily true that someone gives you the number of clusters always let me tell you an example where you may not know the number of clusters where you may not know the number of clusters the example that I state it is again related to satellite images it is related to land cover types it is related to satellite images it is also related to land cover types my examples they are based on the Calcutta data because that is the place where I reside I use I always ask my students that how many land cover types are present in Calcutta and then they would say water is surely one land cover type and then you have concrete structures sure that is building areas and you have vegetation where you have some paddy fields and there is open space that means barren land not in Calcutta but in the suburbs and there is one place called Maidan in Calcutta that is also just barren land so these are basically main four types of land cover type mainly four types then I would start asking them questions like well you have pond water and you have river water do you think there is any change in this then initially they would say no all the water should be same then actually what happens is that when the satellite is taking photographs it is giving actually it gives two different signatures for pond water and river water and it is true of Calcutta and I assume it to be true of even this place you have river Cauvery right and I do not know because I have not seen the data set but I presume that the Cauvery river the pixel values would be different from the pixel values of the ponds I presume that I do not have any way of judging it I have not seen the data but I presume it usually this is a mistake that people make all the water should be same the answer is not correct it is not true surely seawater is different from river water and in most of the cases river water is different from pond water seawater here in Madras you have see also there is water in the sea the seawater is different from river water we all know that seawater tastes salty right whereas pond water or river water at least the taste is not salty this is something that we all know and you would see that they have different signatures so you cannot put the word the water as just one cluster you need to have that thing to be broken into two three components okay you need to have that thing to be broken into two three components they may not possess the same signature from the source which is giving you the values for these things it may not give the same value for all these types of waters okay so this is one thing that you need to keep on your mind you may think that you know the number of clusters but that may not be the case you may not you might have made some assumptions in saying that you know the number of cluster that may not be true you need to verify those things okay so this is an example where you may not know the number of cluster even though you know the place and you know the whole area yeah so number of clusters K may not be known in fact I can tell you very many examples where the number of clusters is not it is just one example I can tell you one example relating to Telugu language my own mother tongue I can tell you one example to Telugu language Telugu vowels we say a a e u a o these are the vowels right but there is also one vowel that we always use but that is not mentioned in the textbooks there is a one word all these things I am telling to him one word called Tataku to him Tataku there the vowel is not neither a not a this something in between so this has come when ISI has done project on Telugu vowel recognition this particular thing has come as a separate vowel Tataku that sound a that separate that has come as a separate vowel so there again the number of groupings is not known even though the mother tongue of most of the persons who are conducting this experiment this Telugu so I mean you assume that you know these things but you may not I mean ultimately you may need to change so once you somehow get K and once you somehow get this similarity or dissimilarity mention then the next step is you need to have an algorithm to do the clustering before I proceed to all these things let me ask you a question given the data set do you think clustering is unique I will repeat given a data set do you think clustering is unique that means given a data set is it possible to get two three equally meaningful clustering the answer is unfortunately yes given a data set you can get many meaningful clustering I will give you an example probably all of you know what playing cards are posh in Hindi that is the same thing in Bangla you have 52 cards okay if you ask a child to do the clustering of this 52 cards probably the child will divide the cards into two clusters all black cards all red cards and if you make a slightly grown-up child to do a clustering probably he will put into four clusters clubs diamonds hearts and spades some other child may put all the face cards on one side and the other cards on the other side some other person may put all aces as one cluster kings queens jacks like that one may have 13 clusters now so I have told you four different clusterings I am asking you which one is better and why can you give me an answer so given a data set you can have equally meaningful clustering given a data set you have equally meaningful clustering then when someone gives you a data set and ask you to do the clustering usually the person has some particular or some specific property on the mind on his or her mind so you need to do clustering of the data based on whatever he or she tells you I hope you are getting the point that I am trying to state so it is not necessarily true that you have the same I mean unique clustering of the data so then when I say that the number of clusters is somehow you will find you will find some choice of similarity or dissimilarity measure dissimilarity or dissimilarity the example that I gave you I think the choice of similarity or dissimilarity is very clear if someone says all black cards and one cluster right card there the similarity measure is very clear to you if someone says clubbed diamonds hot spades four different clusters there also the similarity or dissimilarity is very clear to you right similarly aces kings queens jacks like that 13 clusters there also similarity or dissimilarity is very clear to you note that if that similarity or dissimilarity measure changes then your whole clustering will change am I right if the similarity or dissimilarity measure if it changes then the whole clustering changes. So depending on what you want to achieve you need to define the similarity or dissimilarity measure properly otherwise what you want to achieve you may not achieve okay so then the last part is algorithms we all know we all know what Euclidean distance is before I talk about this dissimilarity measures let me tell you little bit of what a metric space is there is something called metric space you have a set x this is not a null set okay you have a function defined from cross product of x to 0 to infinity now D is said to be a metric on x if D of xy is same as D of yx let me explain to you these three properties the word D is used D is for distance the word D is used the symbol D I am using it for distance we use the word distance very loosely so in mathematics they have given a separate word for it and they gave a definition to that the word that they gave is metric METR I see metric and this is the definition of metric D is said to be a metric on x if these properties are satisfied now what are these properties if you go from x to y whatever may be the distance that is the same as distance from y to x which is totally true distance between x and y and y and x they are the same for all x and y and distance between x and itself is 0 okay and if distance between x and y is 0 then x is equal to y if distance between x and y is 0 then x is equal to y then what about this this is distance between x and y plus distance between y and z it should be greater than or equal distance between x and z for all x y z belonging to x this property is known as triangular inequality take a triangle sum of two sides is greater than or equal to the third side am I right in fact it is strictly greater than the third side then only you can have triangle okay so this is basically sum of two sides okay if all the three points are on a straight line then you will get equality if all the three points x and y x y z are on straight line you will get equality otherwise you will get inequality so this property should hold good for all x y z belonging to x so a function D satisfying these properties is said to be a metric on x is said to be a metric on x and such a space is called metric space and there are the spaces that we are seeing R R square R cube all to the power of 4 matrix are defined on that and we are using those matrix probably without knowing that they are matrix on real line it is basically if you have two values x and y the distance is measured as modulus modulus x-y and you can show that that is a metric it satisfies all these properties okay and if you are in the m dimensional space these are two vectors then for every p you take some value for p let us say p is equal to 2 and you find this that you are going to get basically Euclidean distance if you take p is equal to 1 what are you going to get if you take p is equal to 1 what you will get is say this is a point a 1 a 2 say this is a point b 1 b 2 say in two dimensions and if you take p is equal to 1 what you are going to get is this plus this right there is a standard name for it what is it city block distance and these are all matrix if you take p is equal to 1 you will get a metric if you take p is equal to 2 you will get a metric if you take p to be a fractional value any fractional value greater than 1 then also you will get a metric any fractional value and any integer value greater than 1 you will still get a metric p is equal to 3 4 5 6 p is root 2 you will get a metric root 5 p is equal to E metric p is equal to pi that is an irrational number even then you will get a metric so these are uncountably many metrics uncountably many metrics even then you may define you are in a position to define some more such metrics it is not the case that I have exhausted all metrics here. So metric is a or this is a measure of dissimilarity between two points these are all measures of dissimilarity here this D this is a measure of dissimilarity distance is a measure of dissimilarity okay so there are very many metrics now the question is given a data set you need to choose which particular metric you would like to use given a data set you need to somehow find or you need to somehow calculate or in some way or other you need to get which particular metric you need to choose now I will give you some examples where probably for the same data set you may want to use two different metrics this example let us just say there are two students there are five subjects there are five subjects there are two students I will give you the marks of the students the first student you got 90 90 90 90 90 in five subjects okay the next student you got 100 100 100 150 and you would like to say who is better that is what is actually your problem 90 90 90 90 95 subjects in the same five subjects other student has got 100 100 100 150 and you would like to say your problem is somehow you need to judge who is better and what you will do is that you would like to calculate the distance of this thing from the ideal one ideal is 100 100 100 100 whichever distance is minimum you would call that student to be better okay now my question to you is what distance measure you would like to use city block okay I mean let me just make it 60 because city block means then you are going to get to be same okay so let me just make it 60 then if you use city block probably he will be better okay because here it is 100-90 10 10-10-10-50 you will get here you would get 40 so he is better but if you take p is equal to 2 that is Euclidean distance then here it is 10 square 100 100 100 100 500 and here it is 40 square 1600 okay if you use Euclidean distance he is better now which distance you would like to use it depends on whom you want to call better okay we would like to call this one to be better because it is 460 his 450 450 is less than 450 but supposing this is 50 okay okay let me just take 40 then we can you think of some way of calling this one to be better suppose this is what he just passed this subject is there any chance to call him better the second one probably in our usual subjects mathematics physics chemistry algorithms data structures networking that one this one probably we do not we do not want to call the second student to be better we would like to call probably the first student to be better but supposing your five subjects are learning five different karnatic raga karnatic music raga you know that mastering a raga is extremely difficult and this person has mastered four raga then probably you would call him to be a better candidate because he has the potential of mastering this one whereas this person he could not master any one of them maybe if it is a subject of about learning karnatic music or Hindustani music learning those raga probably you would want to put him to be a better student than this your choice of distance measure it depends on the problem at hand and how you would say someone is close to some or someone is far away from the other one it is very much problem dependent is it understandable to you can we take a small break let us take a yeah so the choice of distance measure it depends on the problem at hand and it depends on what the problem tells us I mean the problem actually forces us to calculate the distance between the quantities in that way and since you are trying to model the problem at hand you need to define your dissimilarity and similarity measures according to the situation or according to the constraints that you are facing so in the previous slide I was mentioning about dissimilarity measure and in this slide I am mentioning about similarity measure similarity between two vectors yeah is summation i is equal to 1 to m these two are m dimensional vectors a i b i divided by square root summation a square summation b i square can you tell me what this similarity measure is this is basically the angle between the two vectors a 1 a 2 a m is one vector b 1 b 2 b m is another vector how do you calculate cosine ? this is the formula for it in fact your expression for correlation it came from this cosine ? has been existing in mathematics for a long long time statisticians used it to get the expression for correlation coefficient that is it and with that they said that it lies between – 1 and 1 that is why this is cosine ? it lies between – 1 and 1 okay does it lie between – 1 and 1 cosine values they lie between – 1 and 1 so this is similarity measure if the value ? is 0 cos 0 is 1 you have maximum similarity maximum similarity and when you are going to have ? to be ? by 2 then you will have get cos ? by 2 to be 0 then they are perpendicular and one is on one side of one vector is like this another vector is completely the opposite that is 180 degree then you are going to get – 1 then you will get – 1 so there you have the maximum dissimilarity that is the basic feeling when you are using this similarity measure this is also just an example there are many other examples of similarity measures in the literature and there are many other examples of the similarity measures also note that here in this slide I gave these are all known as LP norms for P is equal to 1 that is called L1 norm P is equal to 2 that is L2 norm and by the way can you tell me the value of this as P goes to infinity what is the value of this as P goes to infinity this one as P goes to infinity what is the value of this this is called L infinity norm and this is also a metric this is called L infinity max modulus a-bi is equal to 1 to m now if you look at one of my previous lectures where I multiplied the differences by some meant by matrices I called one of them to be positive definite matrix you attended those lectures positive definite matrix you may try to do I mean that is one way of looking at different I mean distances where you keep x1-y1 whole square plus x2-y2 whole square square root you keep it like this there you start adding first some weights to each of these terms then ultimately you will write a matrix there also you are generating many matrix here also you are generating matrix Mahalanobis distance which Mahalanobis defined way back in 1930s he wrote that paper to Kolkata mathematical society Kolkata mathematical society there that paper was published Mahalanobis distance paper that is I am not going to talk much because he will be discussing all these things Mahalanobis distance okay there also you are going to get one xi-yi I mean x-y prime there is sigma inverse x-y or mu1-mu2 prime sigma inverse mu1-mu2 which is again a quadratic form which is again a quadratic form and that sigma inverse that is a positive definite matrix so that is why he called it as ?square that is why he called it as ?square so that we can talk about square root of that if the sigma inverse is identity matrix you will get the usual distance the square of the usual distance that is why he denoted it by ?square there is a square added to it so you can go on generating matrix in different ways depending on the problem at hand.