 Welcome back to the lecture series of pattern recognition. In the last class we were discussing about how normal distribution when put under the formulation of the Bayes decision rule or Bayes theorem of posterior probability class assignments lead us to a distance function which we termed as the Mahanabhes distance. The Mahanabhes distance is actually governed by the covariance matrix and we took the very simplest form of the covariance matrix equal to an identity matrix in which all the variance and the covariance terms in fact the variance terms were equal to unity and the off diagonal covariance terms were equal to 0 or null and in such cases we actually got an Euclidean distance criteria which we actually neglecting the cash variance term we saw that we are able to get a linear discriminant function which will actually yield a linear decision boundary now today. So let us continue if you look back into the slide and let us revisit these equations on the top. So what you have is the expression of the distance to the class mean for a particular sample under the assumption that the covariance matrix is equal to an identity matrix you will get this expression and considering g i in some parts of the lecture today we will switch between capital G i and small g i although both of them almost mean the same it is just a normalizing factor to take of this factor 2 out here to divide the entire expression by 2 and you get this. So this is what we got and a linear discriminant function and this is obtained by ignoring the class invariant term this is what we had done in the last class. So we have a weight matrix with the class mean ith class mean and multiplied by the test sample x and this is actually considered as a bias term which is dependent on the class mean. So this is a linear discriminant function which actually gives us which is also called a correlation detector but we ignore that so we will proceed with this where we stopped from the last class. So the linear discriminant function for linearly separable classes can be expressed as this and we know what are the corresponding terms where this term w i is basically a d cross 1 vector of weights used for class y well in this particular case we are talking about a d dimensional space okay that means you have picked up d different features from the test samples either during training or during testing which we have discriminated earlier in terms of supervised learning and we will actually see a small example in the class end of the class today in terms of what you can be considered as a training as well okay. So that is w i and of course you have the expression of w i0 earlier. So this function in this particular form leads to decision boundaries or DBs which are hyper planes in higher dimension it is a point in 1D line in 2D it is a planar surface in 3D and called as in general hyper planes in higher dimension okay. So henceforth we will consider although there is a small difference between a discriminant function and a decision boundary we will see this difference in fact how decision boundaries can be considered using discriminant functions we will see but for the time being we will consider both of them to be identical okay for either for a class of course you must remember one particular fact here that the discriminant function is valid for a particular class i whereas the decision boundary is between a pair of classes i and j class 1 and 2 class a and b then is what you consider a decision boundary whereas the discriminant function is for a particular class. Now when you are talking about a decision boundary between two classes there are often certain classifiers designed where you try to classify samples belonging to a particular class with respect to all other classes so that sort of a discrimination or a categorization or classification is often termed as one versus all or one versus the rest okay. So in that case also you will get a discriminant function or a corresponding decision boundary but however whether it is one versus one or one versus the rest the one versus one is what we talked about between two classes i and j of course i is not equal to j class a and b let us say say between fruits and flowers or between two categories of flowers two categories of fruits or two categories of bags for where image samples are taken for categorization and all that it could be between two different phases or fingerprints in the case of pattern recognition applications we say that a discriminant function typically is for a class a decision boundary between a pair of classes but however for some time in the class today we will not discriminate between the two but immediately we will see that one of the slides will have an equation where the db or the decision boundary is derived from the discriminant function let us go ahead. So in the case of 3d it is basically a plane correct the planar surface in 3d is a plane if it is passing through the origin the simplest form of this expression of w i transpose x will be in this particular form because there will be three components both for w and x and this class vast term will be equal to 0 because the plane is passing through the origin if you substitute all 0s null here this equation is satisfied okay so this is a very special form of a discriminant function or a decision boundary okay and we will analyze this expression now in a little bit data detail and try to see the difference between a discriminant function and a decision boundary plus how do you find the weights how do you find these weights w i's for the purpose of classification so we will try to understand given a set of test samples training samples sorry I repeat again given a set of training samples for you how do you define these weight which will help you in categorization it will help you to build up the discriminant function as well as the decision boundary. We will observe this equation and have some sort of a geometrical graph based representation of using a graph basically 2 dimensional graph in 2d space is what we will look and see the interpretation of and significance of these weights and test samples in the next set of slides okay so in general the equation is in this particular form we have this expression so in this this is another I would like to warn you with this d here they do not confuse this d with the dimension of the problem this is just a notation here and the expression which we saw in the previous slide which was in this particular form okay w transpose x plus w i's it is actually being written in this particular form and you can actually take the weight common out where w transpose multiply and x d which is actually a vector is actually a scalar distance d and we will find the significance of this d very soon. So it represents a plane or a hyper plane passing through an arbitrary position or a point in d dimensional space called x d so this plane now h which is represented either basically almost by this equation is passing through this particular point of course we will see representations and diagrams in 2d which is easy to visualize for you and this particular plane partition in the space into 2 mutually exclusive regions r p and r n both of these 2 regions are considered to be semi-infinite regions we talked about this in the earlier class itself. So the class assignment rule based on the something similar to what we do also in base classifier although there is no probabilistic function here can be simplified using this discriminant function as the following okay so if the vector x is lying on the positive side which is represented by this r p p in r p indicates a positive side with respect to the plane h what is this positive basically means this function is giving a positive distance positive value then we assign the sample x to the cluster or the class belonging to the positive side if it is negative it belongs to the other side r n negative side of course if the value is 0 then the point is lying on the hyper plane h. So it is the simplest criteria here based on which we can do classification provided of course you must remember that this corresponding linear decision linear discriminant function is representing a decision boundary a decision boundary between 2 classes again 1 versus the 1 or 1 versus the rest. So we will observe this the geometrical interpretation of this with the help of next slides and we analyze first the linear decision boundary then we will proceed towards nonlinearities in the corresponding discriminant function. If you look at this slide the same equation which we got for the linear discriminant function remember I did inform you that we will not discriminate now between discriminant function decision boundaries both can be considered for the time being to be identical is represented by this equation the background of the slide I have purposefully changed because some of these diagrams we need some color markers and color plots so it will be visible we will get back to the form later on. So look at this particular graph so this is a 2 dimensional space x 1 and x 2 are the 2 features for the samples taken. So this is a 2 dimensional vector x is also a 2 dimensional vector which is any point in this space and this h the hyper plane h which is represented by this equation g of x is equal to 0 is represented by this plane h and which is given by basically with by this equation here okay this equation is equal to that means if you take up a point x and put it anywhere in this plane h this corresponding equation will be satisfied. Remember this x d we have seen it in the previous slide let us go back this is the x d we are talking about it is a point on that plane because the plane h is passing plane in a higher dimension it is a line in 2d it is passing through any point x d so this is the point x d now it is lying on the h this is the point x d which is lying on h and the normal to the plane h is basically the vector w this w is basically meaning the same w here okay so this is not just showing a different symbol but it is basically orthogonal to the plane. So we are looking at this what we call as the pattern or feature space and looking at 2 dimension representation of this expression for the hyper plane h so it will be a line in 2d again I repeat you can visualize this to be a plane in 3d and hyper plane in higher dimension okay so you can see this is the positive side rp this is the negative side rn what does it mean if you take a sample x and the x sample lies here somewhere in this region which is the positive side of rp then you go back to the discriminant function criteria here you will have a positive value so we can say that x lies within the positive region or this particular category or class if it lies somewhere here below the h here the negative side rn to be very precise and specific then you have a negative value of this g and you can say that it belongs to the other class however if the point lies somewhere on h you will have the function going equal to 0 and then of course in this particular case the plane could pass to the origin so the d is absorbed in this expression when we write it in this particular form remember it is a dot product w transpose x or x transpose w is basically going to be in the same so this is an interpretation of this particular expression where we say that although this is a discriminant function and a linear function it is also representing a decision boundary you have a positive side on one side negative side on the other one which provides you a boundary between 2 particular classes the orientation of h is determined by this w remember this is w here and the location is determined by this d which is the is something can be considered to be the perpendicular distance from the plane with respect to the origin which is basically projection of the z-vector on to w so that is what and it is of course hyper plane if d is more than 3. Look at this complementary role in the parametric space so what I have drawn in this curve below is a complementary plot of the pattern feature space which we call as the weight space okay so it is the same equation here but unlike in the previous case where we had 2 axis x1 x2 as the 2 feature components along the axis instead of that I have put w1 and w2 which are the components of capital W the weight matrix again we are operating in 2d then we get the same expression we can get the same line the only difference now is look at the way I have transposed this equation instead of x transpose w we have written w transpose x so this is the weight space that means when you move around this space you are talking about different values or combinations of this tuple w1 w2 when you are moving around in the pattern feature space you are talking about x1 x2 combinations so that is the complementary role it is something like you have moved from the pattern feature space to the weight space here the normal to the h is not represented by x instead of the w here the w was dictating the normal or the orientation of h okay here it is dictated by the sample x so what does this weight space in terms of the decision rule says us now remember this plane is dictated by a sample x this plane is dictated by w this plane is dictated by x you move around you get different sample points you move around you get different weights this is a negative side for a set of sample this is a positive side for a set of samples given away that means if you have decided the weights the plane is decided but the question is who gives you the weight well that expression is not we have an expression because it is dependent on the class means short of a thing for a discriminant function okay for addition boundary we will also have the expression of w but so given w decides the orientation and the position of the plane here a sample decide the position of h that means given a sample now if you have a choice of weights and you are here on the positive side that means set of weights here on the positive side will give a positive value of g set of weights in the negative side of this hyper plane will give a negative side of the g okay so when you keep choosing different samples you are having different orientations of h if you choose different weights you have different orientations of w will you remember this concept because you are going to use this and extend this idea about and give some idea about how to design or how to pick up w's in general for a set of samples given because sometimes during training you may not just have one set of samples but a set of samples from two or more different classes so in that particular place how do you decide the w okay so look back in the pattern feature space the w decides the hyper plane in the weight space with just the complementary okay see the w has become the variables here the variables were x these variables x have become actually the parameter of the plane h so which is called in the weight space the independent variables become the parameter in the pattern space where the pattern space are independent variable the w become the parameter of it so this complementary role between pattern and weight space you can just keep in mind and these type of concepts are used in many other different applications a typical explanation for those in the working in the imaging domain you can look into concepts of half transform or half space where you actually go from spatial domain to parametric space so a similar analogy is almost shown here where the role of the axis variables and the parameter okay they are mutually swapped here so keep this in mind we will probably use one of them but sometimes interchange them so given a point x you will be able to show if you go back here a point x basically indicates a plane h with a corresponding orientation a point here indicates a weight vector which is basically dictates a plane h here so you can see the reversal of the role between the parameter and the axis variables in this particular case remember we are using the same expression so these were the two okay so you have the this is the pattern space or feature space this is the weight space you can see that the sample dictates the position on sorry orientation of h here the weight w this is the actual correct notation of w which is normal to this h it the weight dictates the position of the hyper plane h so both of these planes are complementary when you are searching for samples you are moving around here we are searching for weights you are moving around here keeping this in mind let us observe the weight space let us observe the weight space what is the impact of different set of weights on a set of samples which are given to you okay so for the timing what we are deciding here is let us say for class T1 this is just a level or a class 1 you are given two set of samples x1 and x2 for a set of samples belonging to class 2 you are given x3 and x4 so now what we are trying to see is from the discriminant function we are trying to get a linear decision boundary belonging to which will split the space into two different classes or two different regions or two different categories and typically in the case of training you often have several samples you actually have several set of samples belonging to each class sometimes of course the samples are very less in which the training actually can actually become a little bit more erroneous are difficult but in general sometimes you have lot of samples in the case of machine learning and pattern recognition in most applications you have a huge number of training samples. So instead of taking a very large for the sake of visualization we are observing this concept in two dimensional space and just considering two samples for each class two samples for each class let us look back into the slide so for class 1 indicated by this level t1 you have just two samples x1 and x2 which are given to the system for the class table 2 you have two other samples x3 and x4 now given these two pairs of samples for each particular one pair for each class one pair for each class in there are two classes you have two pairs how do you find a W do you use the equation which was given at the discriminant function directly well may not be there are of course equations of decision boundaries but what we are going to see here is a process which will lead us to another type of a classifier which will just introduce and which gives us a linear decision boundary during training with two different samples. Let us observe as to what are the correct set of weights it seems like can we identify some sort of region in the weight space where we can get correct classification that is the purpose of explaining this this sort of a concept is explained in the book by artificial neural network by Satish Kumar is that the correct name Satish Kumar's book will have an explanation of this and actually it comes under the learning algorithms for a perceptron so we are leading towards a discussion about how a perceptron can be trained and it can provide linear decision boundaries will explore that first and come back to this the discriminant function and decision boundaries under the base paradigm may be in perhaps in the next class okay if you go back so what you could do is pick up a sample x1 so if you pick up a sample x1 and look into this weight space what will you get you will get a corresponding hyperplane h prime depending upon a sample xk let us say this k is equal to 1 I repeat again let us say k is equal to 1 that means you have picked up the first sample and so that first sample will give you some plane in this weight space let us say this plane is this I am not writing the value the symbol hs so this is some plane well it is a line in 2d but for the time being I will always use the term hyperplane or plane 3d or higher dimensions is what you should visualize the graphs are being shown in 2d okay so it is a geometrical interpretation is what we are looking why this is x1 this is similar so if this normal to this hyperplane h is given by the sample xk since this plane is being drawn from the sample x1 the normal to this is actually given by the sample x1 what we do is pick up the sample x2 and we will get another plane so let us say that plane is somewhere here correct so this plane is contributed to by the second sample and that is why the normal to that plane is given by this particular vector which shows that this is the normal to this plane so what does this basically tell us in some sense it tell us that the positive side of the where the vector is shown that it seems that if we choose weights in this region I repeat again the positive side of the plane is given by the direction of the arrow and if we choose weights here you will get a positive value of the g a positive value of the g will indicate that you are able to classify the sample correctly using the set of weights which you will choose based on this diagram okay what if you have negative value of g well then either it is incorrect classification if the sample belongs to the class 1 indicated by the level t1 in this slide but look at the complementary class or the other class class 2 indicated by this label t2 if you pick up samples from those you do not want the discriminant function gi to give positive values for those weights you want to have negative values for that remember I am trying to design a single discriminant function with a certain set of weights I am trying to find out a possible set of values for w which I will decide which will give me positive values for class 1 negative values for class 2 so if I pick up samples x1 x2 from class t1 I should have a positive value of g if I pick up samples x3 and x4 from class 2 labeled as t2 I should have a negative value of g that is my purpose here and to solve that what is the appropriate decision boundary which will provide me this classification is the purpose of this representation we are actually not proposing a disc a short of a learning algorithm either for a classifier statistical classifier or a percept one right now but we are trying to give a physical justification of how the weight should be selected so that you can do proper classification and the interpretation of the decision boundary using this geometrical interpretation using this graph okay so now so these are the hyperplanes corresponding to samples from class 1 we will now select samples from class 2 let us say there are 2 of these again and draw similar hyperplanes nobody can stop us from doing that let us draw one of them let us say this is the hyperplane corresponding to class x3 now this vector shows that this is the positive side corresponding to this hyperplane what does it indicate remember the sample x3 belongs to class 2 I must design weights in such a manner that corresponding to samples like x3 I should have a negative value of g samples belong to class 1 I repeat that again which are x1 x2 should give a positive value of g samples belong to class 2 like x3 and x4 should give a negative value that means corresponding to this hyperplane if this is the positive side I should not select weights here in this positive region with respect to this hyperplane if I want a negative value of g for samples belonging to class 2 if you go back with the similar logic for these 2 hyperplanes belonging to class 1 I should select samples in weights in such a manner that I have a positive values positive value of g corresponding to samples belong to x2 x1 okay now there is another sample left x4 we will draw another one so this is the curve sorry this is the hyperplane for the sample x4 so if you have more of these samples okay 3 4 or more you will start getting more and more of these hyperplanes what will these hyperplanes tell us that please select weights on one side of it which side of it it should be appropriately chosen in such a manner that weights should actually provide a positive value of g for class 1 and negative values for class 2 I repeat again so corresponding to x1 and x2 I should select weights so that I have a positive value of g and corresponding to samples x3 and x4 I should have negative samples remember here the samples are just showing the normal to the plane okay so let us observe this from a slightly different angle this is the plot which we have got for the hyperplanes what does it indicate let us take samples from class 1 x1 x2 belonging to class 1 sample here what are those hyperplanes this one corresponding to x1 here this line or this plane and the other one x2 this tells us that we should select weights for which the g of this should become positive okay what is that region it is the intersection of the positive side with respect to both of these hyperplanes which is basically this region you can see this textual region in green indicates that if you select weights from here not here not here not anywhere else but only in this region if you select weights from this region and substitute into this g then samples belong to class 1 as far as because we have only two samples x1 x2 but it is equivalent it is equally applicable if you have large number of samples say n of them the same thing is applicable the region will be different the g of tn will be positive I hope this idea is clear remember if you take this first hyperplane for sample x1 the semi infinite region is the positive side with respect to the hyperplane which is below this line for sample x2 it is on the right side let us say this entire region is the positive side so you must have an intersection of these two because if you select some weights here for sample x1 you will get positive g but not for samples belonging to x2 similarly here if you select weights you will have a positive value corresponding to sample x2 alright or sample similar to x2 let us say but not for sample x1 it is guaranteed here that it is in this region you will have a positive value of g if you select weights here good so now if you take up samples from the second class t2 where are those hyperplanes well one of them is here corresponding to sample x3 corresponding to sample x4 you have the another one we have drawn them in the previous slide the vector indicating the normal to the plane. Now in the second case for class 2 you have to be a little careful it is the complementary of class 1 but we are using the same discriminant function g which should actually act something like a decision boundary remember I said right now we are not discriminating we are using the same g and so the g discriminant function formed by selecting a set of weights if it gives positive value for samples belonging to class 1 it should now give for samples from class 2 if I pick up will it give positive or negative it should give negative values okay so I should look at negative region corresponding to negative side of the region okay or negative region to be very precise and correct negative region corresponding to the two hyperplanes for drawn from samples or obtained from samples belonging to class 2 which are x3 and x4 so let us look at it so that means with respect to the sample x3 this is the negative region with respect to samples from x4 this is the other negative region. So a similar intersection of these two negative regions will give you another will give you another semi infinite region in any dimension semi infinite region in any dimension we are seeing this in 2D you can visualize easily this in 3D and of course in general we have very very high dimensions depending upon the dimension of the feature vector or the pattern space we are talking about so this is the negative so now what does this indicate on the right side we have g of t1 which is the positive that means it is that means samples from class 1 tell us please select weights anywhere here then I will have a correct class assignment from g because it will give you a positive value. The other region on the negative side tells us the g of t2 remember the g is the same that please select samples from this region because if you select samples from the second class I will get a negative value of g which will give you a correct class assignment same g I want same set of weights two different classes binary classification as it is called or often referred or it is called the one versus one classification two classes one and two a and b examples being flowers versus fruit cars versus trucks two different fingerprints and faces I want to obtain one g which will give a positive value for samples belonging to class one or class a and negative value for samples belonging to class two or class b and to do that if you look into the diagram very very carefully that the intersection of these two spaces is something which is given at the bottom where if you select weights in this well a textural area which is the as the color code so it is a somewhere neither this neither this but a sort of a intersection of the two this is actually called the solution space the solution space for this classification two classes only mind you just two samples and why this is the solution space because it seems to be the common space only as an intersection of this space g of t2 negative g of t1 positive and the intersection of this is this particular space which is semi infinite I have just drawn a line but this is again also a semi infinite region okay if you select weights there mutually both these constraints will be satisfied for the corresponding classes that means if you select samples from class one you will have positive values for g as well as negative values for the samples belong to the other class if you select anywhere else you will get wrong results if you select weights here it will may be correct for the classification of the samples belonging to class two but not for t1 similarly if you select weights here it will be doing a correct classification for samples belonging to class one but not for two and of course you know question of selecting samples here so this is the only solution space I leave it to your imagination that if you have three or more such samples belonging to each class you can actually keep on drawing such hyper planes if there are end n1 different samples let us say belonging to class one you will draw n1 such hyper planes find out that common intersection space g of t1 which is positive then turn to class two let us say you are also given n2 number of samples very large set of samples draw all those hyper planes find the intersection which will give you g of t2 which is negative that space and intersection of these two intersection of these two will actually give you the solution space you could ask me a question does the solution space always exist will the solution space exist because if there is no common space shared in the feature space mind you between this g of t1 and g of t2 you will not have it okay we will think of this example okay in a later class and see for the time being we will look at solutions where you are able to draw a decision bond whether decision bond is not very clear your mind you but it is a linear discriminant function each of these are linear discriminant functions each of these hyper planes and you are getting a solution space to it this is not a learning algorithm this is not a learning algorithm to learn the weights but it tells you gives you an idea of the significance of the weights and the possible solution space for you to obtain to learn the weights that means now if you have the solution space somehow defined you cannot pick up any value in that which will give you the correct result and actually perceptons learns that perceptron is a linear implementation of a linear discriminant function its motivation comes from the basic neuron of the human brain in which it receives signals from several other neurons and provides a single output and the most simplistic model of such a perceptron is actually a linear discriminant function this is not a course on your network you have to learn many things about that from a separate course but there is there is a quite significant overlap at and terms of applications of neural networks for pattern recognition so we will move towards the diagram of perceptron see how that is learned from this solution space given the same set of samples and come back to the discriminant function decision boundary linear case once again. So as promised in the field of artificial neural network a perceptron is built to form a linear discriminant function so this let us say is a single neuron in your brain which is actually called in the paradigm of artificial neural network because artificial artificial in network is built using a set of perceptrons but the single perceptron what it takes it takes different inputs from a set of components so let us say our input is the d dimensional feature vector x1 x2 up to xd okay do not confuse this d with the distance this is the d dimension so there are weights sitting between the input and the perceptron what it basically does this weight is multiplied with x1 w2 is multiplied with x2 w3 with x3 and so on wd is multiplied with xd all of these are summed up at this particular point there is external bias so the output O of x what is this x feature vector with rediamson feature vector is x so the output of the perceptron can be visualized to implement this this expression is nothing new we have got that in the beginning of our class itself when we talked about the linear discriminant function which we actually got from the Mahalanabes distance which we got from the Mahalanabes distance after taking the covariance matrix to be equal to identity matrix that same statistical linear discriminant function perceptron also implements that now the question is the last discussion which we had for the last few minutes of the solution space in the w here you will ask the question in the case of perceptron who decides this weight vector which will go and sit here how is this design there is a corresponding learning law but can we interpret this process of learning using the diagram which we have just studied and also I would like you to imagine that in the 2d space in a very simplistic sense this O x can be considered to be our g the discriminant function and it gives a simple this is the equation of a line everybody knows this so visualize this to be an equation of a line in 2d space to be our linear discriminant function or linear or linear decision boundary if you ask me to compare these 2 well you can say that this is just a single scalar quantity C and this w y will have 2 components w 1 w 2 which are m and 1 x and y are the 2 components x 1 and x 2 if you this is what we have been doing in the previous slide when we are taking x 1 x 2 which are equivalent to x and y in 2 dimensional space okay. So we will now move towards what is called I will not derive it completely what is called the least mean square learning law in perceptron models but we will describe this process of how these weights are assigned and the basic idea is something like this a perceptron learning algorithm starts with a set of random weights w so what you do is you put random values here and then increment the weight iteratively using this expression well this is not the ultimate expression but look at this what it does it is actually using the condition here on the right hand side which you are using a little bit earlier in terms of class assignment G of t 1 is positive or negative based on that criteria we were looking at the solution space now what we are doing instead is we are looking at incremental learning of the weights it says that keep the weights unaltered if what the sample x provided belongs to the correct class it is giving a positive value if it is giving a negative value well if it is equal you can do either does not matter you are going to increment the old weight using a sort of differential term as given as this eta is a learning rate parameter I will tell you what to do with this eta for the timing assume it to be a constant okay but it is a function of k that means iteratively when you proceed you are changing this but for the timing assume it to be constant but what is the significance of this you are changing the weight remember these are all vector quantities you are adding a vector to another weight vector to get a new weight vector I repeat this was the old vector you start with random weights and incrementally update that is what the neural network training or the essence of it remains and we are trying to bring in an analogy with respect to whatever we have learned in terms of the pattern space but the weight space which we have seen little bit earlier so what it says is now that if the weights which you have point of iteration you do not need to adjust that if the if it provides correct classification as given here provided samples are for class 1 if it is not of course it is different however you want positive values but if the discriminant function g of x which is given by this expression gives a negative value here on the other side of the hyperplane then you change the weight along the direction of the sample vector so let us look at the significance of what this means in terms of changing the weight along the direction of the sample will go back to the weight space we have the hyperplane h here given by this expression in 2d 3d or higher dimension and the normal to that hyperplane is given by the sample x if you try to draw the diagram corresponding to this expression what it basically means is if the weight is here as given by wk then eta times xk will basically tell you that you please change the weight along the direction of the positive side of xk and bring it to a point which is given by this wk1 and you do this only if you are in the negative zone here in the weight space remember the h is going to partition my feature space into 2 semi infinite regions a positive region and a negative region as given here if you are already here as given by this constraint that means you are in the positive side please do not change the weights keep it wk1 equals wk here however if this condition occurs what does it mean g of corresponding to the discriminant function gives a negative value here if you are here the weights are not correct you want the positive value of g so please change w along the direction of x which will take you from negative side towards the positive side well you may not reach from the negative side to positive side always in one step jump because if you are far away somewhere here you will probably move towards the positive side but still in the negative region and do it incrementally keep on doing this till you is the once you are here this constraint will be satisfied you do not change the weights this is done for one particular sample you will have several set of training samples for both classes and will follow this iteratively for each sample this is what the new network training essentially does okay and in general look at this expression now what the first line says the first line is actually almost unchanged except that we have put another constraint if the sample belongs to class 1 if the sample belongs to class 1 then if the second condition here that means if you are getting the g value positive do not do anything so there is no point writing about it only the value of g is negative as the first constraint says so this is similar to the first one but you are looking at samples belong to class 1 then you change it along the direction of x however if you choose sample belong to class 2 if you choose samples belong to class 2 then you have to look for the complementary nature of this if the g is actually positive then you must make a change if it is negative do not make a change so what it means here if the sample belongs to class 1 and if the value of g is positive you do not make a change if sample belongs to class 2 if the value of g is negative you do not make a change you do it only when you are in the wrong side what is the wrong side indicating for samples belonging to class 1 if you get a negative value of g then make this change if samples belong to class 2 and you are on the positive side just the complementary figure think you the wk is here somewhere it is in the wrong side then you need to change it in the negative direction corresponding to the sample x to go to the other side if sample is belonging to class 2 okay so you have to be a little bit careful here that this is the logic that means you change it along the sample vector okay if you are belonging to sample 1 if you are belonging to sample 2 you move it in the reverse direction okay now what we will do now is exploit this type of an incremental learning algorithm for the perceptron with respect to the previous diagram which we had in couple of slides back where we had a solution space for two classes two samples each during training and see that if we start with the random value of weight will you reach the solution space this diagram is given as a static slide in the book by Satish Kumar so we will see if this type of incremental learning of the weights helps us to actually obtain a correct value of w and once you get that correct value of w you have designed your classifier the perceptron set of weights in this case of a perceptron which will give you the correct classification so this is where both of the things are brought together on the top you have this diagram which shows you that this is the solution space corresponding to the training samples given here which we have worked out few slides back class T1 has two samples x1 x2 you can see this corresponding hyper planes we have done this for class 2 label T2 there are two samples x3 and x4 for which these two are the corresponding hyper planes this one is for x3 and this one is for x4 the corresponding vectors are shown here and this is now the incremental learning algorithm for learning the weights for this set of training samples so we are actually looking at a perceptron learning algorithm some sort of a learning algorithm which I won't see is a variant of base we had the base classifier the blaze decision rule and the base theorem which was a statistical classifier working on probability distributions class priors conditional and conditional probabilities we are not going to do that anymore what we are looking at here is just a perceptron giving us a linear discriminant function what should be the weights given a set of training samples we will work in a very simplistic situation in 2d with two samples per class it is left to the observer reader or viewer that if the number of samples grow very much higher what will be the solution space and how will this learning algorithm work in terms of weight assignments will it lead us to the solution space what are we doing here look at this this is the initial weight let us say either assigned at random I say that you can start with random weights okay or what you can do is you know let us say that this is the weight obtained after a few set of iterations that is also fine with me okay and then in one shot this is simplistic example mind you may be in just one iteration we will see either are we getting closer to the solution space or are we even getting into it if possible okay so you want to apply this sequence of steps for these samples what does it basically mean I will pick up sample x1 which will belong to class 1 tell me will this criteria be satisfied the g of x given by this remember this is the hyper plane for x1 positive side is here my weight is here will this condition be satisfied yes or no yes why yes because it is on the negative side of this hyper plane for sample x1 so it basically means I should travel along a direction which is to the positive side as given by this vector x1 the normal to the hyper plane I will show this movement as a sequence of steps for all these four samples in one shot rather than showing it one after another and you will be able to visualize then corresponding to sample x2 this is the hyper plane this is the positive side here will this condition be satisfied or not okay so if it is satisfied you will be also moving incrementing the weight along this depth that means you will move the weight along x2 of course you will move it first along x1 then you move it along x2 so the path of this point will be somewhere as you see my cursor is moving along x1 then along x2 okay let us move towards samples from class 2 there are two samples x3 and x4 so if you look at x3 this is the hyper plane this is the positive side look with respect to w itself right now wherever it is okay so this is the positive side now look at samples belonging to class 2 will this condition be satisfied will this condition be satisfied with this sample point with respect to the hyper plane the answer is yes so this condition is true for sample belong to class 2 I have to work on the towards the so it will be pushing it towards the negative side that means it will be the negative of x3 so after I move along x1 then along x2 I will move along negative side of x3 and finally for the sample x4 the same thing happens if the sample is here or somewhere on this region this condition will again be satisfied and I will finally move towards the negative side of or the negative direction towards as pointed by the negative of x4 negative x so if you look at the trajectory it will move along x1 then along x2 then along x3 and finally along x4 so if I draw these four trajectories approximately keeping the magnitude as shown by the corresponding vectors you will get something like this okay if for the trajectory you see this is x1 then there is x2 then the corresponding to negative here this is x3 finally x4 okay this is an approximately hand drawn curve not machine computed and this animation shows here one set of iterations for all the samples that is what I say maybe you are at the intermediate stage of iteration or very close to the final solution here you have to keep on iterating till you reach the solution space because at the solution space none of this conditions if you look back if you are finally here see this is the final position as marked by my cursor here here is finally you stop after exploiting all the four what will happen at this point for samples belong to T1 this condition will not be satisfy why because in the positive side of G with respect to T1 similarly for this point here for samples belong to class 2 you are on the negative side of G so this condition will also not be satisfied and you will stop the movement of the weight vectors but if at any point of iteration you have not reached the solution space one or both of these conditions may be satisfied you keep on incrementing the weights as given by the expressions look at the expression in one sense for samples belong to class 1 it is move along the positive direction of x for samples belong to class 2 it says move along the negative direction of x correct and what may happen initially is you start with a random set of weights you start with a random set of weights so it is possible that you are very far away from the solution space the solution space could be somewhere here and the weight vector is here you move around you are coming close to the solution space apply the same logic once again and slowly iteratively you will move closer to the solution space which in this diagram if you go back again shows it in just one step but that will not usually happen number one number two is I have not discussed the significance of eta which is called the learning rate parameter it decreases with each iteration what is this eta it is sitting here and this learning rate parameter sitting as sitting as a coefficient of this vector x tells us that how much do you want to change will you like to change equal to the magnitude of the vector x well initially yes so the eta k starts with a larger value which could be equal to 1 let us say some normalized value of 1 initially but when as the iteration proceeds or the learning algorithm moves towards convergence then what you do is you reduce the value of the learning rate parameter eta k why do you need to do that when you are moving towards the solution space you do not want to make large movements because if the solution space is restricted act not the case shown here as I said before if you have large number of samples the solution space actually may become finite okay and then if you have a large value of it you may actually jump over the solution space and may not instead of converging move to the others direction and then you will require a larger and larger number of it is typically sometimes in certain cases of neural networks the feed forward neural network which is basically called a multi-layer perceptron where what we are learning is a single perceptron you can make a single layer neural network by a set of perceptrons then you can have multiple layers of perceptrons we do not have scope of discussing those in this particular course although we are giving certain expressions of learning in a very general case later on not now which can implement even linear or non-linear decision boundaries but let us take the case of perceptron for the simple example others are extensions of this you need to reduce the learning rate parameter it is actually reduced by an exponential function initially with a large value you bring it down to a smaller value why do you do so because when you approaching to the solution space you want to move just a little bit closer to that instead of jumping with a larger value initially when you start with the random set of values you want to move with a larger like the one which is showed here this sort of exaggerated movements may initially happen initially happen towards the end you will move in an incremental manner and that is why the ITERK the learning rate parameter is brought down of course what may happen is if you are that lucky that you are already there very close to the solution space and in one or two jumps or one or two iterations as it is called you will leave the solution space fine then what will happen this inequality will not be satisfied you will be not allowed to the system will not move the the algorithm will not move the weights around it will get stuck in the solution space.