 So we want to finalize the SOM we just got to the last step, which is the weight updates and yet We run out of time. So we talked about competition and collaboration competition by finding the minimum distance basically and collaboration by Using a neighborhood when sharing the weights and the last step Would be then the weight update weight updates Which is for every neuron at the iteration n plus one I have to adjust the weight using what was before or what is now plus a Little bit of Change delta wj so We will not go much in detail into this today because we still don't know The normal networks, but this will be the adjustment that is supposed to make many techniques intelligent and So if you start with a synaptic value, which is between two Connection between two things usually two processing units to neurons, whatever and If I cannot adjust that there is no movement. There is no progress. There is no Learning so learning is basically adjusting the rates and the way that s1 does it is quite straightforward it uses the a factor eta that We call learning rates and then the in the output times the input minus a function g of The output yj by sub j times the weights, okay, that's a little bit too much so this part is the based on Hebb's rule and this part people call For getting rule and the have rule, but this is just a constant That we need a constant to control the level of adjustments we get to that so But the fact that if you have something going into a synopsis and going out from the other side So this is what hebs Established and is formulated in the general rule that fire together fire together So that's the basic principle of habs fire together Fire together Which means you can implement it in many different ways, but basically if you get the product of in an output Pre-synaptic post-synaptic that should give you that growing together What okay, don't don't panic we get back to this We want to do this from scratch when we get to the percept one so if you do this so that this this g of Y sub j this part is Simply at the the eta factor the learning factor times The output why sub j so just and we can we can say what is that again this this factor times our Neighborhood function So you take a weighted value of the neighborhood function that you were supposed to control your collaboration So if I take that So this is the Neighborhood function that we talked about before So to what extent so the neighborhood Times the weight so am I sharing generously or I'm sharing Greedy Lee so that's up to us so that value which is again a number between zero and one So we'll decide how much of the weight is being shared Or you forget about it not shared So we can completely done right this the updating rule The wj at the iteration n plus one is W sub j at n so you always take the value whatever it was And then we want to update it Then we have the eta which could also be a function of iteration that learning rate So whatever whatever you want to add or subtract or whatever you want to do whatever change you want to make I Want to have a little bit of control of course Eta is between zero and one so I want to control Maybe I don't want to make a lot of changes so eta will be towards zero Maybe I want to make a lot of changes then eta goes toward one so we want to have control above that and Ideally that control should be a function of time because at the beginning we want to make a lot of changes and Toward the end we don't want to make changes Because you could easily destroy whatever you have learned Toward the end by making big changes So toward the end you should not make many changes so it at of n times this h sub ij of x Also at n now I put also x as the subscript so The members the neighborhood function is also a function of time and As we said at the beginning you are very generous and then you get greedy greedy and do not share weights with anybody anymore times So because I put brought if I here now I'm taking it out times x Minus the weights whatever they are at the time n and this becomes a Pattern that we repeat that any value that we use has an initial value and Then we use some sort of exponential decay function Minus and over let's say another constant t sub 2 again, this is another constant that we are using and For this you can go for example Eat us up all is let's say zero point one at the beginning t2 could be a thousand and t1 from before another exponential decay function that you had Let's say thousand over log of The standard deviation the initial standard deviation and that that has to be That has to be big enough Okay, I'm throwing at the beginning of lecture a lot of material at you. I know so Take it easy we get into this I Want to get rid of SOM because that I was handicapped with SOM We have not talked about no all network and I have to talk about this on as a clustering technique So we are talking about no all network without knowing no all network. So we come back to this We will revisit this we establish it for perceptron We will talk about heavy and rule. We will talk about generalized heavy and rule and all that So for the time being all I want you to understand is this that learning is doing adjustments and Determining the amount of adjustment requires a lot of effort. So and that that goes back to have so we All the time you are using some notion of heavy and rule and that heavy and part Maybe difficult for us to understand we get to that and you bring in some specific type of learning that you are incorporating in our case Neighborhood to bring a balance between competition and collaboration The rest is just playing so I need something to control Etta alpha beta whatever and I need it to be a function of time Of course, it cannot be constant. So I want to be more generous at the beginning more conservative at the end So all that and of course everything goes back to this difference So as as soon as long as we understand that okay We put There we had some discussions last lecture at the last lecture with some of you So again SOM is constrained K means on what is constrained is the location of the averages and the means So they are fixed and we let the weights go around and be adjusted to the inputs And that's another type of clustering basically, okay So what are what are the issues? So for example when we talk about the convergence of such a technique You if you don't understand that all just forget about this forget about the second term and just look at this So we are we are weighting the product of in an output pre synaptic and post synaptic But just keep it there Being confused for two weeks is not a big deal. We get to know all networks and then we will revise this so How do we converge while you need many iterations? We need many iterations to converge. So n has to be a large number. So this n has to be a large number How many what depending on your data? Give me something. I don't know 5,000 10,000 Depends how difficult is the problem? Generally for example, we are talking about several Thousand times Several thousand times several thousand times the number of units. Well, that's a rule of thumb the number of units So you have hundred neurons several thousand times Time hundred so hundred hundred neurons hundred thousand. Wow, that's a lot So that that's a that's an approximate There's no formula for that just to give you an idea the bigger the map grows You mean you need more iteration to to converge Okay How do I stop? How do I stop? Well, I can stop if there is no noticeable change If you see that your delta W's are getting smaller and smaller and smaller. There's nothing You're not making significant changes anymore than you stop So or no noticeable Which means basically no big change in the feature map and Yes, we call we call Self-organizing maps also feature maps Because they try to be like features the inputs. So we call them feature maps What problems do we have with SOMs I Was complaining that they are being ignored. Maybe they have some issues. That's the reason that people don't use them well, generally convergence may take a Convergence may may Take a long time and The bigger problem. So the problem that I have observed with SOM is that you get variable results That's much bigger problem So that means if you run SOM three times you get three times three set of different clusters It will not always give you the same clusters Why is that? Well, because you are not telling me the number of clusters and I'm supposed to figure it out. So it's not that easy So there are different variations of SOM That I may put some links into that that you look at extensions of SOM One of the things that I did in practice to get rid of this problem that if you run SOM three times You get three times three different set of clusters So what I did was you run n times you get n clusters and then you merge So you see how many times these two these two elements were in the same cluster. Okay, they belong to the same cluster put them here So you can you can do tricks like that to get rid of it, but that's that's a that's a problem of That's a problem of self-organizing maps the Initial idea there are virtually and hundreds of variations and extensions to any method we talk about any method that we cover in the lecture is the initial Idea, so we don't talk about usually about the extensions and Version 10 and version 20 some of them we will cover in the in the tutorial. Okay so We want to put SOM aside. We will come back to this part. I just wanted to Freak you out a little bit We get to that in two weeks and then we go deep and then we try to understand every bits of it Good So for us we want to leave clustering and go toward another topic which is for us classification. Oh This is a big one. This is a huge one and The hypothesis of this one is intelligence is to Distinguish Things that's it That's intelligence So it says you go back to your feature domain. This is your feature one. This is your feature two Again, we stick with the two-dimensional case To just be more convenient and then we have some data points and I'm drawing them differently because we assume we know but the algorithm doesn't know what is what so Let's have something simple like that and then You run something and we said if you can separate them so if you can draw a line and You separate the circles from the triangles then this is classification So you find a line basically which is wx plus b If I write it in a general form So you write and you find a line that can separate your this objects from this objects Doesn't matter what they are The circle and triangle is just for visual position They don't mean anything and the algorithm doesn't know what is circle. What is triangle? No idea. So So we just get some numbers like the car example with the horsepower and maximum speed you just get some numbers and then Algorithm has to find this line You have a large number of techniques that do that neural networks do that Classification at the moment is the most successful application of artificial neural networks And we will see they do this very in a very sophisticated way very sophisticated So, but what is the problem? The problem is okay, you do this I Can I can separate them like this would that be also okay? Yeah, sure or I could do this. Is that okay? Yeah You still correct Or I would do this or I would do this or I would do this or I would do this or I would do this Or I would do this So every time that you train a neural network will give you one of those lines Okay, so what's bad about that one? The question is is there a perfect in Mathematically expressed optimal Line is there a perfect line? That's a person. I don't want just every time that you give me a line Is there a perfect one? That you say if you choose this one and optimal means this when you get this line and you go in the practice and Another circle appears you are still correct generalization So you learn one line to separate dogs from cats and Then now you go in the reality and you get a shih tzu. Can you say it's a dog? Yeah, okay, you get the she by new. Can you say it's a dog? Well, we don't we don't know many she by new so can you do that? So suddenly okay? Then I can I have a suit I have a triangle here and suddenly I get a triangle here If we the line that you learn somehow this happens This line is not perfect is not optimal which means you cannot generalize properly So how do we do that? Well, we want to start with the best classifier we have Before we get to the ones that need a lot of training So and of course that best classifier is support Victor machines So we want to we want to talk about support vector machines or SVM or SVM So support vector machines went after the problem. Okay, so you get you get something You get your feature space of feature one and feature two and There is your circles and there are your triangles They want to the support vector machines want to find two lines So they want to find the perfect the perfect line and the perfect line is in the middle of the boundary that actually separates these two lines so you have some of the Some of the circles that are on the line on this side and you have some of the triangles that are on the line on this side so this this ones that are on the lines are Support vectors So this is a support vector and of course, this is a support vector. Why we call it the vector? Well, it has two values. It's a vector Usually it has more than two values several hundred So support vectors are the ones that we define the margin So this is the margin the margin between the two classes and For the first time we approach this problem and said, okay Can I find because this line can change as I indicated there? This line can be like this can be like this can be like this I can virtually draw thousands of those lines and for each one of them I get also two lines that go through the support vectors and then I draw one in the middle So say, okay, this is the middle nobody should be here. So nobody should be here Nobody should be here and Of course that gap that margin if it is maximized You can keep things separate and you are relatively safe as long as the god of complexity Let us be safe. You are safe to not violate the classification rule for new unseen data So we want to maximize this margin. So as support vector machines belong to the class of Margin maximization. How can I maximize this? margin, how can I find the line? Not just any line just the line that gives me this as a maximum Because if I draw this line the middle one a little bit 10 degrees to the left This cannot be maximum It become it shrinks So can I find that? What people said we have no all-networks. They do that. Well, but no all-networks like SOM Every time that I run it I get a different solution Yeah, but usually it's a good solution. I know but can you guarantee that's an optimal solution? No, sorry. I can't Okay, I want to guarantee Okay Wow, we can work on it will not be that easy So this line generally for us is again wx plus b. So this is the this is the line in 2d and is of course a hyperplane in nd So if I go Dimension hundred I still have a plane So it will not be a surface or curvature. It will be a plane And things get really complicated there. Okay, so we want to do that. How do we how do we do this? So? We have to make some assumption that for example if I if I extend this So my drawing may not be really If I if I some if this has certain characteristic If the weight has certain characteristic for example if this is the weight Sorry, if this is that weight vector and the weight is perpendicular to the middle line So I from the beginning I work. Okay, so I'm working with vectors. I will look at dot products Two vectors orthogonal maximum difference product zero I want the middle line to be the base Everything is is zero and then I want to say okay. I have two cases either wx plus b is Greater than zero or wx plus b is a smaller than zero I could say that right so I can say either things are positive or they are negative measured from the center The example of us SVM is interesting because in contrast to neural networks That we say okay. I have some processing units. They are capable just to throw the data at them They will figure it out and After they figure it out and we get 99% accuracy and somebody asked us how you do that we sit down and say hmm. I Don't know So SVM is different as we am from the beginning. We are designing the intelligence step by step and That's why this took 30 years And now we want to cover it in 30 minutes Will be tough because we have to skip some details But that's basically it. I want to find a line But that line is not is not crossing any of the objects is in the middle of the two lines That is crossing the objects and those two lines are Containing support vectors they go through some of the instances of the data But not the middle line the middle line is the baseline is zero and for that to be zero I need a orthogonal vector here So Okay, this guy is loving mathematics. So what okay? well We can come up with some assumptions. So let's let's We need some sort of assumption because otherwise we cannot do this Our classes belong to this to the set positive and negative So I will entirely focus on a binary classification. Yes or no Yes or no, is it is it the circle or is it the triangle if you add squares there? I cannot do anything So you have three classes. I cannot help you Well, but that would be quite limited. I know but let me get started But let's establish something for binary for two classes If you're lucky, we can extend to three classes four classes five classes, but two classes is easy Binary classification is easy Nothing is easy, but compared to multiclass is easy. So how can I do that? Now the question is this if I formulate the problem like this Again, you have your features Let's say this is Let me draw it this way And I have my Support vectors So here I have negative here I have plus I have plus of course all over the place here And I have negative here all over the place So and again Everything from middle line in that direction is negative everything in that direction is positive So I could actually say, you know what X W times X positive plus B should be greater equal one and W X negative plus B should be less equal minus one For why I'm doing this why this is zero The middle line is zero and from there. I have binary classification plus minus I could do plus minus hundred, but that's not very convenient. I will keep it simple Unity is always good. Just one so and now of Course Let me see. How can I do this? This is your W that is Perpendicular to orthogonal to the middle line We don't know We don't know how big how long the W is I don't know the values. This is just a direction. I have to figure out the values That's the learning part So I don't know will it come to here or here or here or here. I don't know But all I know it has to be orthogonal for me to be zero on the middle line Why I need a baseline and Here there's nothing going on and then the classification starts when I reach The support vectors and the lines that separate them parallel to the middle line okay Now what is the task? The task is now you get you get you get an unknown You get an unknown Vector I don't know where that is right. So this could go anywhere. This could go up I don't know where that is so how do I how do I How do I say that you the vector you is positive or negative? That's the classification right so Don't look at this in the two-dimension look at it and hundred dimensions We are not seeing that the algorithm is operating on it. It has to calculate some stuff That says oh This is not here if it was here. It would be negative. So this guy is actually here. So it's positive How do you know? The features so how do you find that plane that separate things? okay So we have to we have to do a little bit of stuff to get there because this apparently is not that easy so again, so W dot X Positive plus B. We say it has to be greater equal one W dot X negative plus B has to be so has to be less equal minus one and Of course, this is a dot product also. I am not using the bold notation to say these are vectors So I will just write and go From the context we should understand this is this is a scalar or is it a vector? So I'm using dot product and the dot product is crucial Without dot product there is no SVM as we will see So is a very crucial part of SVM. Yes Convenience easiness Anything else you use make become computation complicated and proving it more complicated It's just one on minus one. It doesn't change anything. So I'm setting it up is arbitrary. I'm setting it up. You could say plus hundred minus hundred But then you go ahead and you get to the part that you have to build a Differentiation and do this and that do the algebra you get in trouble People did that and then they came back and say oh you cannot do this choose something simple So many of this is mathematical intuition You should do it this way Define it this way. You can't define it any way you want. Can you get away with it? If you can get away with it nobody will complain Great. Okay, do it plus one minus one so now we also introduce a dummy variable another Sort of and we say dummy variable y y sub i is plus one for all positive cases and Is negative one for all negative cases see that the fact that I assume that I'm dealing with binary classification things are yes or no black or white Makes life very easy Now I'm introducing a dummy and my dummies are also simple. Why I'm introducing a dummy Why I have two sets of equations and I'm lazy. I Cannot drag two sets of equations with me. I want to make them one But one of them is greater equal plus one one of them is less equal minus one is two different things So can I do some tricks and Make them one Why if you bring a dummy variable like this you could? So this this is just something that we can multiply and say, okay, then why sub i times w dot X plus plus B is Greater equal one right because why sub i for plus Is plus one so it doesn't change this It doesn't change the direction of the inequality And then for this one why sub i w dot X negative plus B Minus Times minus and this goes other way around Also greater equal one so suddenly I get both equations now Are the same so this becomes this this becomes this now they are the same By just introducing a dummy variable which comes from no It's not part of the data But I could do that because I assumed I am doing binary classification. Yes or no So to keep things separate the dummy variables that I introduce are also Yes, or no, and I will not go with plus 9.8 and minus 9.8. I will go with plus one minus one Just keep it simple So now I have I have one I Have one which means what now I have why sub i times? W dot X now I can write general X not positive not negative just general X Plus B Minus one is greater equal zero Okay Now I got my first condition So my first condition is this is my dummy variable for class membership This is the weights that I have to find which is orthogonal to the middle line. That's perfectly separate as the data This is my data Doesn't matter plus or minus doesn't matter class. Yes, or class. No, this is my bias This is the position to she to shift the plane. This is my constant and this is greater than zero Okay, so what? Be patient It didn't happen in two months it took 30 years So and the guy had to immigrate from Moscow to New York to make it at the end of the day, so So I don't know it must be the weather in New York in Manhattan, especially So, okay. No, no, I have this then what? Wow What is the best classification for this if I want to use this for classification? What is the best classification? best classification is the largest Margin so this This has to be maximum. I want to apply this. This is a line, right? This is a line. This is the line This is the plane But I cannot go after if you find the w and you find the b you are done you find the line For God's sake we can find a line, right? So what we are finding it in multiple dimension and the algorithm is in charge We don't see anything. We cannot visualize it. That's difficult So I want to find the line But with one condition Because everybody else is doing that. I Want to be the special guy who gives you the guarantee and say I give you the line Optimal perfect, okay Everybody does this so from this point onward You have to do something special Something has to give to bring in intelligence and What happened? It was late 90s Mathematics was dead in artificial intelligence. This is not an exaggeration mathematics as a vehicle for artificial intelligence was marginalized and the perception was that Artificial intelligence does not go with the conventional mathematics we have and then 1995 Vapli came with this idea and Wow, that was the victory of mathematics Wow, you can do intelligence with linear algebra. Oh my god Okay, what did he do? Well something really simple. So what did he do? We lead we need to maximize the margin. That's That's the goal That's the objective. How can I maximize? the margin so if I look again And also, let's say Again, this are my positives anywhere These are my support vectors that are also positive These are support neck vectors that are negative and I have some negatives here, of course This is my center line and I want to find out This margin I Want to find M. What is M? If I find M, I will solve that classification problem in a way that margin is maximized So you can visualize it and you see that you find this line and margin is that much You find this line and margin is that much you find this line and margin is that and you find the optimal one And margin is that much you stick with it So this giving me the best generalization So, okay, so let's say we have we have we look at this support vector So this is an X minus and we look at this support vector Which is this X plus general as general example If I build if I look at the difference between these two vectors Right, you have not forgotten how to build a difference between two vectors. So if I build that difference and map it using the normal vector Which is the vector perpendicular orthogonal to the middle line You get the margin So this would be mapping so mapping this mapping the difference to The orthogonal one because any vector orthogonal to the middle line will do but it has to be the normal Okay So X plus Minus X minus Dot W This should be the projection that gives us the width if you don't get it if you don't get this Please this weekend review linear algebra, but it has to be normal. How do I make it normal? Now somebody has to help me out. How do I make it normal? divided by the magnitude Now is a normal so the difference Times the normal will give you the width because that's the projection of the difference on the normal That's it Very simple no magic has nothing to do with AI is is just regular conventional Boring mathematics Okay good so what Because what? Why why he did that? Wow, because you he he knew Vapnik. I'm talking about Vladimir Vapnik. Vapnik knew where he wanted to go You wanted to get a formulation of the line equation That contains W in such a way to maximize the margin So you know where you want to go, but you don't know should I go this way or should I go that way? But you know I want to be there Otherwise, this appears random. Why you do this? well, I want to bring in W as Something that affects the margin Okay, which means this which means this is W dot X plus minus W dot X minus Divided by the magnitude, okay Than what? So we saw that If I bring it up We have this this is our actual bread. This is what we want to classify with and We know that this is W X plus P and then times this minus one, okay? So this will become One minus P Plus one plus P. Is that clear why? So what is W X? All right This is one for plus and then this goes to the other side is this goes to the other side because one minus P, right? And the same thing for the second term you get one plus P All right, we don't have a problem with that just simply I just simplify that nothing happened So this becomes two over The magnitude of my vector Seems seems made up. It's first time that you see these it seems made up as if somebody made it up Why because you are saying? Maximize two over The magnitude of your vector and you're done Well, I can get rid of this. I don't need to But that's a constant one over magnitude of W. I Want to maximize this which means what? Which means you want to minimize W so minimize W. What? So that that was in front of us all those years and nobody saw it So well, don't forget the orthogonality But that aside You minimize the vectors and you're done What does it mean but minimize? You can okay set it to zero and you're done No minimize it Subject to you can do this You still want to draw a line? You cannot just minimize this for itself It doesn't make sense Minimize it when you are doing this Okay, now I have a constrained optimization problem And I'm sure all my colleagues who work on optimization get happy again and smile Because yes, AI cannot work without optimization AI is optimization Okay So which means what which means there is not many techniques in recent times the past 30 years that exhibit that beauty of SVM It's just Brilliant simple Okay, so I have to minimize this so What is the problem minimize this minimize my the magnitude of my vector subject to Y sub I W dot X sub I plus P Minus one greater equals zero. Oh, you have some constraints. Of course. I want to draw a line. I Just don't want to minimize W for sake of minimizing it because again just put it put it to zero What but then I cannot draw a line. I want to draw a line Okay, you can have a constrained optimization problem. We have a solution for it since 150 years Big part of advance is this Guys who sit down not I mean general men and women who sit down and Go back in the history of science and we discover stuff We had a solution for this since many many times Many many years log range multipliers So you come up with your log wrongs function and you say I want to minimize this guy and I use the sum of This condition which is Y sub I W dot X I plus B minus one and I Add two more things first of all you get your actual log range multipliers alpha So the function that I want to minimize Minus the linear sum of your constraint of your condition how to minimize So I have to find this log range multipliers And then I'm done this but bothers us You have to build the derivative. This is not nice Okay, let me add one over two power two What just like this? Oh, of course Who can prevent me from doing this? I? Want to minimize this So if I caught it in half still I have to find the minimum So if I squirt it I have to still mean it find the minimum it doesn't matter So why do I do this? Well? The derivative of X squared is a lot friendlier than the derivative of X in some calculation So the derivative of this would be two times this this will go away. W stays I can work with it intuition Mathematical intuition make it simple Okay So I have to do this. Okay Now we have to build The derivative the partial we have to build the partial derivatives with respect to two things. This is a still a line So this is a factor and this is a factor So this is my line or plane This is the parameter that will Span my line Right, so I have to build a derivative with respect to w and with respect to b for what w? I get minimum or maximum and For what be that I can shift my plane I get minimum or maximum a line So Okay, when I build the derivative with respect to w. Okay. Now I get this w. Okay. That's nice That's nice. I wanted to have w If I would just keep w the derivative of w would be one the parameter that I want will disappear So I don't want that so minus So this part has no w this part has no w this part has w so minus the sum of Alpha sub I Y sub I The derivative of this is one disappears X sub I this has to be zero We are building the derivatives setting them to zero to find optimal values so W is the sum of Alpha I Y I X I over I Wow That's a discovery the perfect Vector that I'm orthogonal vector that I'm looking for is a linear combination of Inputs some of the inputs Why is it a linear combination because I have a binary case of course it will be linear I'm looking for a linear. I'm looking for a line. Of course Okay This this this is too good to be true. No Sometime you gotta believe especially if you don't have powerful computers to Do your calculation and you have to accept the invitation of another colleague and go to another country to just do the computations Then we do the derivative towards B Of course, this will be a zero minus the sum of So here I have no B. Yeah, I have B. Yeah, I have no B. So it will be alpha sub I and times why I and Derivative of B is one Why I and this is equal zero So the sum of alpha sub I my sub I is zero. Whoa That's again seems to be made up These cannot be that simple what they are relaxed because we said that's binary classification We made a big assumption Binary classification that is separable linearly. That's a huge assumption to make But still it took us half a century to figure this out. So it was not that easy What was the question because now we are building the derivative and then We treat that as variable and the magnitude goes away So, okay, so and now I get this too good What what does that mean? That means now we have a simplified Lagrangian function So you would never see so much math in AI courses before 2005 Because you talk about symbols and automaton and the rules and this and that and nobody would get a word What are you talking about? It's simple linear algebra Okay, so how would that look like so after I have the Lagrangian and now I got two big piece of knowledge And I can go back and put it in and this becomes messy and I don't want to do this on the board so it will be seven eight Gigantic equations that I have to write to simplify it if I put this and this in the original formulation And then we simplify it. So I give you after six seven steps so substitute substitute W in L in the Lagrangian function with the sum of W I W sub I Alpha sub I Y sub I X sub I we get after Simplification the simplification is not end of the war. It's just Just you need you need to you cannot do simplify this way you you need to turn your This way because it's just a lot of sums So you will get the Lagrangian to become the sum of alpha sub I minus half of sum of sum of now I'm writing them in form of individuals alpha sub I alpha sub J Y sub I Y sub J X sub I dot X subject Now I'm writing it in this way because I this are of course scalars and This all of course this is of course the dot product and dot product plays the pivotal role in SVM everything is about dot product. We started the dot product. We end with dot product So, okay, so minimize this You're not done yet. We have not even have a solution yet. So minimize this For example via quadratic optimization Which is a well established method so which means what I don't do this anymore We have the solution for this. I simplify that to become a quadratic optimization for which We have a huge library of methods that do that This is a solved problem. If you have a quadratic optimization This is done So I do not continue. What was the contribution we formulated optimal classification to become a quadratic optimization So we took a highly complex issue. We made it so simple That can now be solved with quadratic optimization This was something that we had said is not possible in the AI domain And this is what Vladimir Vapnik did So now I give this to a MATLAB function to a Python function to a R function and say solve this Solve this which means it gives me back the parameters that I need And then you can start classifying so So how to classify after you solve it? So the sum of alpha sub i, y sub i, x sub i dot U plus B is Greater equal zero Then you say that's positive Otherwise you say it's negative and of course This is our new data point. So x sub i Are our support vectors that we save the result of learning and trust me and You is the unknown vector that comes wrong and say I have no idea. Am I positive or am I negative? Now it becomes again is the line Still is the line. So we check whether this value is greater than zero or less than zero Which is we are measuring from that middle line Which was the base for everything we defined So we can classify But you can only classify as yes or no It's very simple and you assumed is linear. Yes Indices of just get two vectors and work with indices I wanted to make the point that now we are working with the dot products within two vectors any vector So you grab you grab one support vector and you compare another one or you grab a support vector and new data points So because this this is your support vector and this is the new vector that you don't know So, okay This of course only works if you have a linear problem as as amazing as SVM might be SVM Only works for binary linearly separable Linearly separable problems. So you have two issues Don't get too excited You have still two problems. You can only do binary. Yes and no and you can do linearly separable problems Which means what? So again, I have my F1 the feature one and feature two and I have my circles and I have my triangles and I can separate them and now Let me try to do this Now I can separate them with my marginal Maximization margin maximization in SVM and I give you the guarantee. There is no other line that can do a better job than mine SVM gives you that guarantee in writing But What happens if I have something like this? So now I have again some circles. I have some Recangles and I get something like this Well, there is no way there is no line that can separate this stuff not gonna work. This is not linearly separable what you need is a Curve like this a Line doesn't work So a problem that is not linearly separable It's not good. So okay We can live with binary Supermighty neural networks are doing binary classification. There is no shame in binary classification They're tough problems cancer. Yes or no Very serious problem. So binary does not necessarily mean simple in the from classification perspective is simple but not necessarily But non linearly separable that's a problem most problems in reality are Something like this. We cannot separate them with a line. So we got so excited about this VM Should we throw it away? No, the guy has worked on it for 30 years. How can we throw it away? So let's work on. Let's make it work So let's come up with some other tricks We have many many tricks to give you a better idea so xor is a nonlinear problem So if I draw xor This is the xor problem, right? 001 1 0 1 1 0 There is no way that you can draw a line to separate through from faults Right. There's no line. So if I if I do this line, right? I'm wrong all together So if I draw this line again, there is no line This is a linearly separable problem that we can easily understand Okay, it's just logic. We understand that you cannot and or they are linearly separable xor is not linearly separable So, how do we do this? Yeah, you know what sometimes it's good to make things complicated and Then you may see stuff. What does that supposed to mean? Well, if I show you to my two hands from your perspective, you see one hand So I can have as much as distance I want between my hands and you don't see them from your perspective But if I rotated toward you, you see, oh, there is plenty of space between my hands But this rotation is a transformation that you need to apply on the data to make the differences between the boundaries visible How do you make that? So if I if I do this, let me see I can draw this So If I If this is my let me see Okay, let me do that around other way around So if I somehow bring up the other points Up like this This is the same. I bring a two-dimensional problem to three dimensions Suddenly you can actually slide something between the white and dark black They become separable In two dimensions, they are not separable. I bring them in three dimensions The black is in the first dimension the white comes somehow up rises up Now I can slide something in between and say boom I can separate them with a plane with a line Is that magic? Well, it's not magic. It's a trick. We call it a kernel trick So what is the trick? So again binary is fine. We want to see how can we find this fantastic this beautifully designed Classifier, how can we apply it on real-world problems that are not linearly separable. They are difficult to separate Assume that transformation of x t of x. So this transformation t t of x some transformation That brings your data from low dimension to high dimension and then suddenly things become visible so assume that t of x is a Transform that moves X To higher dimensions making linear linear separation making linear separation possible then We have to calculate We have to calculate T of x sub i times T of u the unknown Right. I'm talking about here So if there is such a transformation that enables support vector machines to Separate things that are non-linear in a linear way then I cannot work with the dot product of the data I have to work with the product of the transform data. Okay Whatever you say But this would be difficult Now this is typically This is typical mathematician who is talking it says we should do it this way, but it is hard Okay, let's let's do it this way But we have to get there if we had Now this is this is a desire if he had a function k If he had a function k Which takes x sub i and x sub j such that K of x sub i x sub j would be equal to of t x sub i dot t x sub j then We won't need T if you understand this you are a genius because first time I heard this and say what Again again come again So can you do the transform without doing the transfer? Why if there is such a transform? so if you if you have a if assume you have this take a Sheet of paper do some circles in the middle and Do some pluses around it right all around it Is there a line that you can separate the circles from the pluses is there no Okay, if this is a sheet of cloth now take the corners and put it this and This becomes separate from this now you can cut what don't cut it. So what does the transform do? Find a function that does the same without going to higher dimension. That's it Don't go to the higher dimension. That's expensive. You already have thousand features. You want to go to two thousand dimension Who should make those calculation? And and maybe maybe in thousand and first and thousand and second dimension you are not able to still separate it so don't do the actual transform So this this is called a Kernel this is one of those magical moments in the history of AI. So all we need is a kernel Which is a special function is a kernel function k all we need is a kernel function k We don't need to do the transform actually. We do not need t of x We just got the idea Somebody told us okay, so if you go from here to here you get the benefit, but Going from here to here is expensive So how can I get the benefit without going there? So we just need need to get t of x sub i dot t of x sub j and not t of x sub i and t of x sub j Individually I Think it becomes a bit more clear So what I need is not to get the individual transforms and then multiply them Can you give me the transform of both of them without doing the transform? Individually that's what the kernel do so this is called the kernel Trick and if you go to if you go to any computer science AI machine learning conference There are sessions after sessions that are dedicated to kernels Because these are very special functions That have this amazing property That Exhibit the benefits of a possible transformation without doing the transformation So kernels are fundamentally functions that measure similarity That's it. People don't want to tell you that at the beginning. I don't know why some people Keep it so they're so economical with That they are also economical with the truth. So just tell me okay. So this is similarity measurement So what are some popular ones? popular kernels a kernel of u and v is u dot v plus one Race to power another kernel that most likely you know for you and v is exponential of Minus the difference between u and v Over sigma oh the Gaussian is also a kernel of course it is It's a good one. It's a it's an important one. It's a powerful one so now Which means what? We have a binary classifier That if you use the dot product of the vectors it can separate linear things If you want to separate non-linear things you have to use the dot product of the kernels not the Data itself and What the kernel does it gives you a measure how I? don't want to Don't take it literally, but Fundamentally in that complicated space that none of us knows this is doing something like the KL diversions You're looking at how close our things together as we were looking at tisney So how are things close to each other? We need to know because I want to draw that line I want to know is this an easy job they are far apart from each other or they are close So I measure the similarity and that's similarity measurement of if you go There's a lot of mathematics behind the kernels. They have to satisfy some conditions We are not talking about any of that But when we put the values through this we basically work with similarity measurements instead of the values themselves That's the implicit transform that we talk about the transform that we don't do Because it gives us the product of the transforms not the I don't know what was times what I don't know that I just know the product was this Okay so we This is for us the beginning of classification Of course the next level is the mighty neural networks and we will start talking about that In next lecture. Yes Which one so here force line here here or here so here Okay So to get this product instead of getting this and this separate to get this and this separately or so we just need to get this and Not This and this individual is that not clear Okay, we can't talk about it. We talk about starting with perceptron's next lecture. So Today we have the SOM Tutorial that we will talk about the details and some visualization