 So, in this lecture I shall discuss a few topics which are very much related to pattern recognition techniques and which are in some senses they are becoming part of the pattern recognition techniques the topics are data condensation this is one of the topics and feature clustering which is again a topic in pattern recognition, but it is now being vastly used these techniques have been are being vastly used in many other related subjects like machine learning and there is one another thing about which I would like to talk about that is data visualization which is very much in some way necessary to understand the behavior of the data, but there are some problems and some troubles in visualizing the data in higher dimensions so about which I would like to discuss little bit, so initially let me talk about data condensation as you can see on the screen suppose there are the data set size is huge now dealing with all the points in the data set that might be resulting in too many computations so one may want to reduce the size of the data set and there are also other reasons the data sets often contain run and data, so what we would like to do is that we would like to replace a large data set by a small subset of representative patterns, so that the performance of pattern recognition algorithms when trained on the reduced set should be comparable to that obtained when trained on the entire data set, now in this one performance of pattern recognition algorithms when trained on the reduced set that is fine I mean that is understandable in the sense that we can have some classification techniques or if we do not have any data that is labeled data if you have unlabeled data then what we might do is that we might have some clustering to be done on the entire data set, so instead of doing it on the entire data set we would like to do it on a reduced data set and so the third point as it is mentioned here this is fine but then replacing a large data set by a small subset of representative patterns that is the main aim there are a few points which need to be discussed here one is what is the size of the reduced set what is the size of the reduced set if our total data set has let us just say 10000 points then we can reduce it by half that is we can have 5000 points in the reduced set we can have 2000 points in the reduced set 1000 points in the reduced set 500 points 100 points 50 points may be two points in the reduced set so number one what is the size of the reduced set and number two when we are going on reducing the sizes naturally when we apply the usual pattern recognition techniques on the reduced set and we would like to have the same performance as we can get when we use the original data set now when the reduction is small then the performance would be expected to be very close to the actual performance when the reduction becomes more and more that means from 10000 when we come to let us just say 1000 or 500 then the performance may not be as good as what we expect because we have naturally reduced too many too many points we have naturally reduced too many points so and when you want to make it from 10000 to just two points or so then it is going to be simply vast reduction now we have already done some such thing like this earlier 10000 points reducing it to two points we have done it in clustering when we have done K means clustering and if you take K is equal to 2 and you divide the data set by doing two means clustering and ultimately when the algorithm converges you will get two means these two means we can take them to be in some way representative of the whole data set so we have done this thing already earlier but we have not used this terminology as data condensation at that time we called it as clustering at that time we called it as clustering we did not use the word data condensation naturally those two means they are in some sense representing the whole data set now so actually on one hand clustering is one end of this data condensation clustering is one end of this data condensation and the another end is the entire data set whatever data set that you have and in between you have several different stages where we are reducing the sizes of the data sets in between you have several different shades where we are reducing the sizes of the data set now here there is another term that comes into the mind that is scale if 1000 points are reduced to a 10,000 points are reduced to two points then the scale is the ratio of 5000 to 1, 10000 to 2 is same as 5000 to 1 whereas from 10,000 if you come to 1000 it is 10 is to 1 here it is 5000 is to 1 and here it is 10 is to 1 the same data set when we represent it by a different scales now you would actually the same data set when it is represented by different scales it looks in some sense different let me just give you some examples look at this this one when the scale becomes coarser and coarser and the very last stage it might look like a like an ellipse because the finer details here number one they are missing okay and number two they are also getting slowly and slowly joined you see this is your thing and it has some of these details they are missing here and some more are missing and here some more are missing and when it comes to this it has become like this and ultimately it has become this it is like there is another example we are standing here let us just say and you look at the sky during night times we see too many stars and there are some patterns that you see there and let us just say we start going towards them we start going towards them and initially we have seen too many stars and when you start going towards them now naturally some starts they start getting faded away and the one to which we are going nearer the details are going to become more and more clear right and the closer you go there the finer representation you are going to get and it may so happen that whatever patterns that you are seeing from here quite a bit of it may be lost when you go nearer to them I hope you are understanding what I am trying to say so so the same data when you are representing it in different scales probably different details you may be missing this is one point that one needs to take into consideration and there are in fact a few other issues which are related to this let me just tell you one of them I do not know whether you people have used ink pens when you are in your childhood for writing on the paper have you done it and probably so now how did you feel the tube probably you might have taken a filler and you would have put it in the ink taken the ink like this and put it here right now you are supposed to fill up this whole tube but then probably most of the times you have you have never completely filled it up why because you would see the last bubble that you are going to put you know how much volume at least you guess how much volume it might occupy if you think that it is going to overflow then you would not put it there right are you understanding what I am trying to say so it is confined to the size of I mean one I should say one drop of ink the amount of volume that may occupy so everything is limited to that size you cannot reduce it further than that similarly when you are doing digitization there also to certain degree you will do the digitization so everything is limited to that you are not in a position to make it go beyond that okay everything is limited to that size you are not in a position to make it go beyond that so with respect to that scale of digitization there with respect to the volume of the drop of the ink when you are filling up the tube here also there is a scale at which you are looking at the details so with respect to that scale there are some things which may be changing but you would like to keep many things as they are so totally here the problem is what is known as multi scale data condensation at different scales you are going to reduce the sizes of the data what scale is needed that is the user who would say the scale that is needed and once that scale is given you are supposed to reduce the data set to that size so this is the basic problem now there are in fact a few methods already existing in the literature for this one of the methods is by Astrahan what he did was in order to for K means algorithm he wanted to get the initial seed points Astrahan for K means algorithm he wanted to get the initial seed points so what he did was now select two radii D1 and D2 and for every point in the data set find the number of other points lying within D1 distance of that basically for each point you consider a radius of D1 and then you find how many points are lying in that data set within in that radius that means basically you are measuring the density of every point basically you are measuring the density now what you are going to do is that you are going to find that point having the highest density and once you find that point you retain that point you retain that point and you discard all points from the data set lying I mean within a distance D2 from the selected point and repeat till the set is you go on doing it if you are trying to get only K such points you just do it till such K such points are there and if you do not have any such restriction you go on doing it till the whole data set is exhausted till the data set is exhausted what Astrahan did was he went on doing it till he has got K means or K representative points so he used it for extracting the initial seed points for K means algorithm so this method has been it is already existing Astrahan's method but there is a small problem with this you have to select those two radii D1 and D2 and the results are going to change if you change the values of D1 and D2 the results are going to change if you change the values of D1 and D2 and there are some more methods you have random sampling with replacement and without replacement now what happens is that sampling methods are fine if you are really getting representatives I mean points from all over the data set but if you are not getting fine if you are not getting points from all over the data set then what happens is that the set is not exactly a representative of the whole thing so you have these troubles with random sampling method with replacement or without replacement and you also have similar problems with stratified sampling you also have similar problems with stratified sampling and there are some more methods I will not go into all these details about these methods I will just tell you a way of doing it one can surely use Astrahan's method and I will just tell a way of doing and my point is that that is not the only way in which you can do it one can always improve upon these methods to get better and better results there are this learning vector quantization and there are many other such methods so this method that I am just going to tell you it is a slight variant of the what is known as the K nearest neighbor decision rule which we have already discussed in one of the earlier classes in this lecture series basically what we are going to do here is you select an integer K K actually will tell you this it is that multi scale data condensation it tells you the scale so once the scale the K is given for every point find its distance to the K nearest neighbor okay you take a point in the data set and find its distance to the K nearest neighbor and if the point is Xi you call the distance as Ri now select the point having the lowest value of Ri select the point having the lowest value of Ri the lowest value of Ri means it has something like the density is quite high in a small radius you have many points lying so the density is quite high then what we do is that remove all points lying within two Ri of the selected points two times the radius in two times the radius whichever point is there you just remove and repeat steps two and four till the data set is exhausted so you go on doing it till the data set is exhausted this is this is one of the results that is obtained when the data set I mean these are the representative points and these are discovered by this method and there are several evaluation criteria that are used you have something called Kullbach library information number okay that is G2X log G1X by G2X and you have log likelihood ratio that in some way gives you the distance between two distributions and with these things you can try to evaluate you can try to evaluate the performance of the algorithms and the last method that has said it is found to provide good results on a few in fact it is found to provide good results on many data sets okay and compared to several other algorithms also compared to several other algorithms this is found to provide good results but the point that I want to make is that always there is a scope for improvement always there is a scope for improvement and my aim here is to try to tell you what the problem is instead of trying to say what solutions are there which is better which is worse that is not actually my aim my aim is to just tell you what the problem is and what are the main issues around it and so this is one problem that this is one problem that about which I wanted to tell you and there is another one that is clustering of features we have done already clustering of points now if you want to do clustering of features basically you need to have some sort of a distance measure or a dissimilarity measure you need to have between features a distance measure or a dissimilarity measure you need to have between features. Now there is already a similarity measure between features which is correlation coefficient there is already a similarity measure between features which is correlation coefficient. But the main problem with that is that if you shift the data by some angle the whole data by some angle theta then you would like to have the correlation value to be the same but it is not going to remain the same at least one cannot guarantee that it will remain the same so this is one of the troubles and there are a few I mean this is one of the main problems with the correlation coefficient. So basically what one would like to do is that you would like to have some sort of a correlation measure which even when the data is shifted by some angle then it should not really change its value and there is one way in which one can suggest this one. The way is one of the ways is for those two variables under consideration you please look at it is the covariance matrix for these two variables the covariance matrix would look like for the two variables the covariance matrix would look like sigma 11, sigma 12, sigma 12 and sigma 22. Now find eigenvalues and eigenvectors of this matrix let us call the eigenvalues as ?1 and ?2 and ?1 is greater than ?2 now if the variables are related let us just say linearly then what is going to happen to the value of ?2 ?2 will be 0 if the variables are related linearly that means there is a direct relationship between the values then this matrix will be of ?1 since this matrix will be of ?1 the determinant is 0 since the determinant is 0 the product of the eigenvalue should be 0 since this is a non-negative definite matrix both the eigenvalues have to be positive I mean greater than or equal to 0. So the second eigenvalue has to be equal to 0 so if the variables are linearly related then the second eigenvalue is equal to 0 so somehow you can take the second eigenvalue to represent the distance between two variables which is what has been done here so once you take the second eigenvalue whether you will take it in the absolute terms or you will write ?2 by ?1 ?2 that is relatively that you can have many ways of doing this either you can take just ?2 or you can just take ?2 by ?1 ?2 so that you can always make the distance between two variables lying between 0 and 1 lying between 0 and in fact what is the maximum value of this what is the maximum value of ?2 by ?1 ?2 ?2 is if for example this cannot be more than half can this be more than half the maximum value of ?2 is ?1 right when ?2 is equal to ?1 so this is ?1 ?1 by 2 ?1 this is half okay ?1 by 2 ?1 it is half so the maximum value of this is it cannot be more than half so then here we are trying to put everything in between 0 and half by taking ?2 by ?1 ?2 but again this is a way of defining dissimilarity between two variables now once you define this thing then you can always do a clustering of various features you can always do a clustering of features which is I mean similar to the way we have reduced the size of the number of the way we have done the data condensation in the same way you can do feature condensation you take a point and find its kth find a feature which lies find its kth nearest neighbor find its kth nearest neighbor means say you have chosen the value of k let us just say k is equal to 10 then for this point you are going to find all features from this point and you are going to you are going to find their distances and you are going to arrange those distances in increasing order you are going to arrange those distances in increasing order and you will take the 10th such distance you will take the 10th such distance and for the distance for whichever variable which is going to provide that that is the 10th nearest neighbor of this variable so you can do all these things and you can develop clustering algorithms using this sort of a similarity measure now this is only one such similarity measure one can always look at other similarity measures and in fact it is very much needed that I mean we need many other similarity measures which measure the similarity between two variables or two features in one of my earlier talks I was mentioning about dependency and independence we know the definition of independence of two random variables so we say that two random variables are dependent if they are not independent but how much dependent they are we do not know how much dependent they are we do not know similarly the distance between two random variables which is what I am trying to tell you here it is not exactly a metric whatever measure that was defined one can show that it is not a metric I do not know whether I have given the definition of metric corner did I give you okay I think let me just define what a metric is let us just say X is a nonempty set and D is a function D is a function which is defined from X cross X to 0 to infinity and D is said to be metric on X if this word metric is taken from mathematics this gives you the definition of distance in a mathematical way the first property is that the distance it cannot be less than 0 the distance value it cannot be less than 0 it has to be greater than or equal to 0 second one is that distance between X and Y it should be same as distance between Y and X and this is if distance between two points that is equal to 0 then this those two points have to be same and distance between the same points has to be equal to 0 and this last one is the triangular inequality distance between X and Y plus distance between Y and Z is greater than or equal distance between X and Z for all XYZ belonging to X now the distance measure that is proposed here it is not a metric that can be easily seen now the question is can we define a metric between two features and if we can define one such metric does it really satisfy the intuition of our intuition of distance between two features number one how to define this a metric between two features okay and the metric should be the definition of the metric that is given there it should be satisfactory to all the persons who are working in this thing in this field so there are two aspects so these are all the problems on which people have been working so here also by defining a metric we are trying to define some sort of a we are trying to measure the relationship between those two variables so by using this definition that is lambda 2 or lambda 2 by lambda 1 plus lambda 2 some works have already been done and there are in some cases some improvements are also made on this one by modifying a few concepts and it is very much necessary to devise some more feature clustering algorithms in different domains such as web mining domain and there is another domain there is bioinformatics where in these two domains the number of the sizes of the vectors the feature vector that is huge and the number of points they become very small but the number of dimensions they become really large now there is a third aspect on which I would like to discuss little bit the third aspect is data visualization when we are given points in two dimensional plane you create in two dimensional Euclidean space so these are the points say that are given to us we understand that this point is closer to this than the distance between these two points is less than the distance between these two points similarly the distance between these two points is less than the distance between this and this when we look at the data we understand these distances very well we see which points are closer which points are far apart we understand it okay and we somehow try to get the shape of the data set for example something like this we know that it is circular we somehow try to get the shape of the data set in two dimensions when we plot like this we understand the meaning of shape may not be always but many times we understand the meaning of distances between points which two points are closer which two points are further apart and which distances are smaller than which distances which distances are larger than which distances all these things we understand them nicely but the movement the data is in three dimensions are more than three four five six since we are not in a position to plot we are not in a position to visualize the data properly so one of the problems that on which people have tried to work upon is how do you visualize the data of in multiple dimensions multi-dimensional data in such a way that we understand those distances and their some distance is smaller than other distance some distance is larger than other distances so these properties we understand them well so in order we need to represent the data set in such a way that the properties regarding their distances are understood well by us that is the main issue of data visualization this is in a sense connected with cohenants self-organizing map I am not sure how many of you are aware of this cohenants self-organizing map are you aware of it there also the issue is the same mapped into one dimension or two dimensions right so that you would like to preserve the topology topology preserving do you know anything about it okay the issues are in somewhere related to cohenants self-organizing map in the sense that in that map from any dimension you would like to reduce the data set to two or three dimensions that is what cohenants try to do there and his aim was to preserve the topology as much as possible what is the meaning of preserving the topology the meaning of preserving the topology is whatever I am mentioning regarding these distances that is what he wanted to do that is if that is if you have points say x1 to xn then you want to define a map from this to another y1 to yn okay so that f of xi is equal to yi now let us just say these xi s are in let us just say some 10 dimensional space okay and let us just say these yi s are in two dimensional space xi s are in 10 dimensional space yi s are in two dimensional space so from 10 dimensions you would like to have a map from 10 dimensions two dimension so that you would like to preserve the topology you would like to preserve the topology means if let us just say if distance between x1 and x2 is say less than distance between x3 and x4 then distance between y1 and y2 also should be less than distance between y3 and y4 if this is your map and suppose between x1 and x2 and x3 and x4 this relationship is satisfied then correspondingly you want the similar relationship to be satisfied here now if there are n points there are totally nc2 pairs that means nc2 distances now if you want to compare one distance with another distance then two distances that is nc2 c2 right n x n-1 by 2 c2 so those many pairs you have those many inequalities are equalities you have and correspondingly you should have the same equalities are inequalities to be satisfied in among this yi s also that is what you want now there is a small twist here the twist is you can show mathematically that it is always not possible to get it one can easily provide contradictions where topology preserving maps are nonexistent you can easily provide contradictions in fact you can take your own data set I just want you to take you take unit cube in three dimensions how many points are there nine times three 27 points I am not unit cube so plus one and minus one so two to two into two so you have totally 27 points are there from 0 0 0 on each side you go plus one or minus one so you are going to get 27 points correspondingly you try to get 27 points in two dimensions having the same equality or inequalities you cannot get it you can try to do it you cannot get it are you satisfied with this 27 points you are not understanding right how the 27 has come you see each one you have three ways of doing it minus one 0 and 1 each one you have three ways so 3 x 3 x 3 27 right so there are 27 points you try to get 27 points corresponding to this in two dimensions having I mean between any two vertices there are 27 vertices you can measure their distances and you can measure those inequalities and corresponding inequalities you should get in R2 you must choose 27 such points size having the same inequalities you try to do it I can assure you that you cannot do it you simply cannot do it what Cohen and try to do was how much close you can go there how much close you can go to that optimal that is what he tried to do and he gave an algorithm with that you have got the self-organizing feature map before Cohen and came into the picture people tried it in too many ways some people try to give a tree structure for I mean looking at the some people try to give a tree structure and there are some people who just gave some faces there is a famous article on Chernoff's faces the same Chernoff about which I was talking about Chernoff bounds there is a very famous article on Chernoff's faces you just go through Google on Chernoff's face there also they try to represent data in multiple dimensions using face so like that you will find several different works but yes I said on one hand for something it is impossible I told you very clearly but how close you can go to that which is what Cohen and try to do and probably what Cohen and try to do people can improve upon that people can improve upon that but how to improve upon that and how to do that it is a matter of further research basically topology preserving maps are always not possible are always not possible there are theorems and proofs corresponding to this statement if you go through some books on topology you are going to get the theorems and proofs for the statement that I made topology is basically a subject in mathematics it is a subject in pure mathematics from higher dimension to lower dimension topology preserving maps they are always not possible that this thing was proved by people long back but data visualization is extremely important data visualization is extremely and extremely important and okay something is not possible that is fine but to how much extent you can visualize are you in a position to develop techniques way to the extent that one can visualize these techniques will make us visualize this is also another issue on which I mean so I mean work is going on and especially when you have simply very high dimensional data present nowadays both in data mining literature and bioinformatics literature as a web mining where you are dealing with web pages so data visualization has become really and really important I am stopping it here if you have any questions please ask me please sir instead of data visualization like if we just find out the manifold of the data is that an alternative I do not want to say that it is not an alternative I surely do not want to say that okay manifolds are useful in many situations I am I surely agree to that I surely agree to that and if you can really have some idea if you can really get some idea about the data using these manifolds it is fine there I mean I am always for it I am always for it but how do you generate this manifold it depends on that depends on that very much okay the generation of manifolds yes I know that in phase recognition literature people have developed manifolds for I mean for changing the phases by using manifolds they are doing many things in phase recognition that I have seen those papers but it is not exactly data visualization you see they have constructed a manifold for some specific purposes that is absolutely fine and yes if you can visualize or if you can get something more about the data than what you know already using manifolds it is fine for me I simply do not have any problems with that and I encourage that it is an alternative or not I do not want to use the word alternative you can use the word alternative when you are able to get good success with that then you can call it alternative but I do not know whether you can achieve the success if you achieve the success well you can call it an alternative I do not have any problem with that okay any other question sir can we do something like PCN LDA to visualize means reduce the dimension and then try to visualize the data no I mean you can always reduce the dimension and by reducing the dimension you are losing that information you are losing the information many times you are losing the information the information that you are losing it may be very small amount of information but you are surely you losing the information yes I mean you can I am not saying that you should not do it because now the question is what is the alternative I do not see any other way of visualization so when you are trying to do it in one way I cannot say that you should not do it if I do not give you an alternative and I am not in a position to give you an alternative so you can always do that but that is not the end of it because you must understand that you are losing the information whenever you are doing any sort of PCA because after all some eigenvalues are greater than 0 and you are not taking them you can say that the amount of information lost is small that is a different matter but you are still losing the information whereas here what we want to say is that we want to preserve those distances are you in a position to preserve those distances when you are doing this PCA LDA sort of thing or only the PCA sort of stuff are you in a position to preserve those distances for this you cannot make any guarantee you cannot give me any guarantee to that effect so when you when you cannot give me any guarantee so then you really do not know whether you are able to keep track of the distances are not so I mean you can always do that but you need to understand the limitations this is what my point is okay any other question thank you