 Alright guys, good to see you again. Good afternoon. Today I'm going to tell you about an algorithm that's used in the system identification. It's called expectation maximization and we are still thinking about structural learning where we have a set of equations that describes the state update and we have a set of equations describe the measurement and what we want to know are the parameters in these equations the A, the B, the C and the noises and in the particular task that I want to show you today the approach differs very much from the work that I showed you on on subspace analysis in that subspace analysis is was used by control theorists. Expectation maximization is a learning rule that's used iterative people in machine learning and you know they're just different fields they've used different tools. The fundamental power of the state space the subspace approach is that it gives you an estimate of the number of hidden states the size of the vector x. In the process that I'm going to show you today you have to assume that you have to make a guess about that number and then based on that you're going to come up with the best estimate that you have of the the parameters of system A, B and C and this expectation maximization as applied to the state space equation finding those parameters something that was described about the same time as the subspace analysis and it was the late 90s when the paper was written. So that's what we're going to be talking about today and to start things off I want to so this is now the last parts of the chapter 9 and what what I want to start with is how to apply the algorithms that we've been learning the subspace analysis and for example today's to you know you've collected some data from some individual some system that learns and you want to identify that learner and we did this a few years ago in a task where people were asked to make an eye movement so they had a stimulus they were given a target here they made a saccade to that target and then what happened is that we jumped the target so that we made it so that after the end of their saccade the target that used to be here now it appeared here so at the end of the movement it appeared that there was this error this error here and this error caused adaptation so for example if you look at the experiment so here's here's the way the experiment went the position of the target T was on the y-axis trial is on the x-axis and x is trial y is the change in the position of the target delta T here and so it began with a delta T that was negative and then you know it went positive and then to zero and what people did is that they adapted so that they began to reduce this error so they began to learn and there was a set break forgot a little bit learn set break forgot a little bit they learned something like this then the perturbation was reversed as follows and then what happened was that we took data like this and we said all right what are the states that describe this learner and what was the parameters associated with that learning system so let me write down the equations for such a thing so suppose that we can imagine the learner as having a state x and what we're interested in is describing how it changes the state so there's some state x but maybe we don't know its size and this represents a memory state in the learner and that changes from trial to trial as follows what was going to happen is that this this system is going to in this case we assume it's a vector B right in the matrix that's going to be learning from it's going to learn from errors I'm going to call this y tilde that's the error in trial and plus epsilon x so the state update equation looks like this it's going to have some memory state x as many dimensions as you like it's going to learn from error y tilde that's going to be this B that tells us how much is going to learn from error and then the action that the learner produces the thing that I can measure is y that's his action maybe saccade length the saccade that he makes depends on where the target is t in trial and and maybe he's there's got some bias in him so that whenever I show him the target at 10 degrees he always makes a saccade to 9 degrees so I'm going to have this bias here I'm going to call it y B some constant bias in his actions and then his performance also depends on the state x his memory states so I'm going to have c transpose x of n plus epsilon y so this is my system of equations that I want to fit to the behavior of the subject so there's some state that this individual has and that state changes as a function of error and tends to decay to zero without error and he performs an action y that I can measure that action depends on where the target was some inherent bias that he has and this his states that tell him how those states are related to the particular actions making so I don't know see I don't know a I don't know B and I don't know the noises what I do know is why the action that the subject produce on trial and and where I put the target on that trial T so I can estimate this tilde here why till the error as being the difference between the action that he did and the consequences of it so I can say y tilde of n is equal to where I put the target T of n plus say I put a perturbation to that target and we call that you have it this is the perturbation that I did this is what I plotted here you have it is the perturbation that I call the delta T which is basically the change in the target that I made this is this is the perturbation that I produced so this is this is where the target was this is where target ended up the difference between that and what the subject actually did which is the action that he performed it's it's going to be yb what what what it's going to be y of n minus y of b y of n plus yb so this is what he expected to happen this is what actually did happen the target was moved by some amount and here's what he expected to happen to it that's equal to let's see T cancels here it's going to become u of n this cancels here minus c transpose x of n minus epsilon y so this is the state update equation this is the measurement equation and this is what goes into here y tilde is the difference between you what I did and what the subject expected so now I'm going to put this back into this equation now so I have x of n plus 1 is equal to a x of n plus b times y tilde which is u of n minus c transpose x of n epsilon y plus epsilon x which is equal to a minus b c transpose x of n plus b u of n minus b epsilon y plus epsilon x that's my state update equation it written in the canonical form so I have this matrix that I'm going to be estimating this vector that I'm going to be estimating and these noises here so when people perform actions they may have a bias so it may not be that when you give them a target they may not move to the target they may consistently understood it yeah yeah it's it's the idea that the idea that you have I give you something to do you do it and what for whatever reason you don't go where I ask you to do you do something other than that but you do it consistently you have a bias in your behavior so we want to take that into account when we want to find the difference between what you predicted and what you observed so if you had known that you had a bias by the time you get to the end of your movement this error here is not really an error it's just you predicted it so that's why I'm taking taking that into account this way no it means so it's it's in fitting these things I'm sorry it's so difficult to see the darn thing in fitting these things to behavior what what one of the important things is that what is error right and what error is is the difference between it's a difference between what I observe minus what I predicted so what is what did I predict well I don't really know what you predicted all I can see is that what I would imagine is that your prediction should be based on all the previous data is that you know on on average you have this bias in your performance so that must be what you would predict on that trial so I'm guessing that would be the case which is why I put this bias here in here in principle we have to guess what is error that you're learning from and in this case there's if there's bias in your performance I assume that doesn't go into your errors okay so okay so what this means is that when one fits data like this what one gets is so let me just fill this in so here's here's the data here's the performance you know what one gets is that there is some that ends up with two states here just like what David showed you guys in the in the two-state model what one gets is that there's a fast state that rapidly learns forgets rapidly learns forgets and then there's a slow state that slowly builds and one can fit this kind of states-based model to actual data and you get this two-state process out of it now the lecture today is about the problem of fitting such system of equations to data using a different concept a concept of expectation maximization yeah and our problem remains the same the particular problem that I want to solve equal to a x of n plus b u of n plus epsilon x where this has variance q and y of n is equal to c x of n plus epsilon y this has variance r and what our problem is as follows what we want to do is to estimate the parameters of the system let me call that theta and the parameters of system are a b c we have the noises are q and we have the initial state x hat and the initial covariance p0 our uncertainty about that state this is these are things that we want to estimate and on the other hand what we also don't know are the states x hat from trial one to trial n so we don't know the states and we don't know the parameters in em what one does is that what one says is that fix the parameters in the expectation step we fix theta we find x hat going from one to n and of course the way we do that is via the common filter if we knew the parameters we could find the states in the m step what we do is that we fix the states we find theta okay so to do the e step you guys know how to do that the m step is what's interesting which is if we knew the states how would we find the best estimate associated with the parameters and it's an iterative process if you go back and forth and you do this until you find that it converges so the critical thing in our formulation is going to be describing what's called the complete log likelihood and what the complete log likelihood what that means is that you want to find the probability of all the states and all the observations given that you had the inputs and the parameters theta so it's a likelihood function right unusual likelihood function only has complete log likelihood it has also the hidden states so what I'm going to do is to write for you what the complete log likelihood is and so just like every other time when we do a likelihood we find the conditions that maximizes that likelihood so we're going to write the probability of all the hidden variables x all the hidden states and all the observations the joint probability assuming that we know the parameters of the system assuming that we know the inputs of the system you and then once we've written up that the probability that's likelihood we find the log of that we find then the maximization of that likelihood we find the parameter theta that maximizes that likelihood that's what we're going to do today so the key step is this this we know how to do this the e-step the m-step is find the parameter theta that maximizes this quantity so let's begin so suppose that I only have the following like likelihood so I have the probability of x0 x1 y1 given u1 suppose that's only we only have one observation y1 and we only have one product so what is this probability so this is equal to p of y1 given so this joint probability I'm going to write it as a conditional probability so probability just to remind us the joint probability so p of a and b is p of a given b times probability of b p of a given p of a and b given c is equal to p of a given c times probability of c probability of that b occurring given given c so probability of y1 I'm going to write it in terms of u1 and so y1 given u1 and x0 and x1 times the probability of x0 and x1 given you want so this probability here is equal to probability of y1 given just x1 right because y1 only depends on and this probability here p of x0 and x1 I can write it as p of x1 given u1 and x0 times probability of x0 given u1 which is equal to this doesn't have any such there is no dependence between x0 and u1 so then I get probability of y1 given x1 times probability of x1 given u1 and sorry x0 okay so that's that's my first sorry let me this is what we're going to do is we're going to go from u of 0 so to me to write this I need to have u0 here because u0 is going to make a change from x0 to x1 so I apologize for that this is u0 okay all right so next step is I want to write what is the probability of x0 x1 x2 y1 y2 given u0 and u1 so I'm just going to add to this now if I have two measurements yeah yeah yeah yeah it's a vector yep good question right so we have two measurements now y1 and y2 rather than just one okay so that's all so before I only had one measurement just why one I have two measurements I want to write where am I going because I want to write this probability all the x's from one to n all the y's from one to n I guess is this still confusing do you see it okay well see x0 is a part of theta right x0 yeah yeah so I'm going to have to you know if if if I fix this I have to find what x0 is it's part of the one of the things I don't know all right so this guy let me I think I'm going to write it over here it's going to be kind of long well maybe you'll fit so p of x0 x1 x2 y1 y2 given u0 and u1 this is equal to p of y1 y2 given u going from 0 to 1 x going from 0 to 2 times p of x going from 0 to 2 given u going from 0 to 1 which is I'm going to break this up so this is p of y1 let's start with y2 I guess y2 given u01 x02 y1 times p of y1 given u0 to 1 x0 to 2 second term p of x from 0 to 2 um let's start with p of x2 given u0 to 1 x1 to 2 times p of x1 to 2 u of 0 to 1 um okay so this term here this is just equal to p of y2 given x2 right because y2 only depends on x2 this term here p of y1 given this is just equal to this is the only thing that matters is p of y1 given x2 sorry x1 um this term here p of x2 what matters here is x1 and u1 then I have this term here which I have to break here x2 given u1 0 and then x x1 times p of x1 u01 um this term and then this stays as it is this comes stop p of x1 this is just 0 x1 doesn't depend on um oh wait uh oh this is y yeah so this is this is not right so if I take x2 out of here what I left behind is 0 and 1 then I get um so okay I have this this x2 depends on x1 and u1 then this term here 0 and 1 okay now we're in good shape all right so this this term here is just p of x0 because it doesn't depend on um oh actually this is just 0 uh yeah that's fine okay so I can write the my as follows this is equal to just like that derivation that I have there is a multiplication of i going from our little n let's call it going from one to n of the first probability so what is this probability here p of y given x this is going to be p of y of n given x of n times n going from one to n of p of of n plus what um x of n I did I write it x of I wrote it as x of n plus four given u of n x of n times p of x0 um yep it is we're gonna we're gonna try to find it why no because no so um so in this step we have to find theta right and theta includes x0 so so x0 is going to be a normal with an expected value x hat with variance p yeah yeah sorry I forgot it thank you okay so we can write these things right because we know the relationship so we can write p of y of n given x of n what what that is so let's let's write it so p of y of n given x of n that's a normal with mean um c times x hat of n and variance uh r which is equal to um 1 over square root of 2 pi raised to the power m m is the length of the vector y times this variance covariance it's it's determinant um in the denominator times an exponential of minus one half um y minus c x hat of n transpose r minus 1 y minus c x hat of n so just cx I guess this is right so that's that's the probability of y given x now you have that the next term that we need to know is probability of x of n plus 1 given u of n and x of n that's um um it's normal with mean a x of n plus b u of n and variance q does that make sense that's what this equation is all right so what we need to do is find the log of this quantity I just wrote down for you what this is I wrote down what this is and this is just some some mean and variance x and with some you know it has some mean and variance that's going to be p at time zero at the step zero so we're going to find the log of this function write that down so the log of of this thing that I've put there look the the log of that likelihood function is going to be the sum of n is equal to 1 to n that's going to be the multiplication there of this the log of this of this quantity here and let's write that down let's see what that would be let me begin with what's inside the exponential so I'm going to have a minus one half y minus c x of n transpose r minus one y minus c x of n and I'm going to have this term here so the term that I care about is determinants that has r in it because r is one of the things that I need to find of the parameters that I don't know so what that's going to be is minus one half n times log of r because remember r is the determinant here being raised to the power of minus one half so we get them because there's n of these things that are multiplied by each other so there's minus one half times n times the determinant of r next what I have is the and then I have some constants you know associated with two pi raised to the m let's forget about that because we don't need that next thing that I have is the multiplication of this times this sum here of the n that are being multiplied that normal so I'm going to have another exponent minus one half of x of n plus one minus a x of n minus b u of n transpose times q minus one times x of n plus one minus a x of n minus b u of n right that's the exponent of this normal distribution yes they are sorry correct thank you very much very good so I think that's right so now what I need to do is maximize this log likelihood so to do that what I need to do is that what am I maximizing it with respect to what we want to find the derivative of this with respect to all the things that we don't know all those things that we don't know a b c r q x hat and p I also have my I forgot there's also the normal associated with the there's also minus one half the probability of x zero which is I guess that would be my so the probability of x zero is normal with mean let's call it um mu I guess so this is x zero minus mu I guess I've been calling it x hat of zero haven't I so let me call this x hat of zero that's our log of the complete likelihood so what we're going to do is that for each of the parameters we're interested in we're going to find the derivative of this with respect to that parameter now the idea is that you have started with some estimate of a b c r q x hat and p you begin with some estimate now what you do is do you maximize this log likelihood to find you know a better estimate so we know this because we began with some estimate for theta we found the best estimate for x hat now what we're going to do is that okay given that we have this estimate of x hat and some prior belief about what these things are find a better estimate for them and and we can do that so we're going to maximize this this this likelihood so through that you notice that there are these are variables right so this is every term here every multiplication here is going to become a scalar because it's just a probability just a number so we're going to be finding derivative of scalar quantities with respect to these matrices so we're going to have to find the derivative of x of n transpose c transpose r minus 1 y with respect to r or with respect to c these are matrices that's okay we can find derivatives of you know of scalar variables with respect to matrices it's just going to become a matrix so derivative of the scalar variables with respect to a matrix is going to give us another matrix and there's some useful idea in your on the website there's a file called useful math and along the things that are there are how to find derivatives of scalar quantities and for example the derivative of the quantity a transpose xb where a and b are vectors dx is is the quantity a b transpose and the derivative of let's see the other one that we're going to need here is derivative of a transpose x transpose c xb dx x is here is a quadratic term the reason why we're going to need that is that some of these multiplications here is going to give us a quadratic term so for example this multiplication here quadratic in a so we're going to have to know how to find the derivative of a scalar quantity in which their matrix is being multiplied by itself this is equal to of two terms um it's c transpose a transpose c h so let's uh let's do some of these derivatives so to do that what i'm going to do is multiply out one of them c matrix c here so we have x hat we don't know c how to in terms of the quantities that have seen it so the log of that likelihood so that's going to be a sum let's see i'm going to have y of n transpose r minus 1 times cx of n and i'm going to have minus xn transpose c transpose r minus 1 y of n this is a scalar quantity which means that these two terms are the same terms because this is the same as this this is the transpose of that term um so that's equal to minus one half two times um two times y of n transpose r minus one c x of n plus so forth okay find the derivative of these and i don't have any more c's right those are the that's the only place where c appears so i want to find the derivative of this with respect to the matrix c this is what i'm going to use this term here so i have this quantity where this is a transpose and this is b what i'm going to end up with is this transpose so i'm going to get minus one half the two cancels out i'm going to get minus r minus one y of n x of n transpose for the first term looks good this and this says yeah this is a this looks like a matrix of the it's shaped and then for the second term this this is a quadratic in c i'm going to use this so i'm going to get x is what i'm finding the uh the the derivative with in this case so that's just don't don't confuse that with this so i've used x and c here i've used x and c here but they mean different things so in this case this is a transpose and r is here c this is b so i'm going to get r minus one c x of n x of n transpose i'm going to get two of these check my notes see if i got that right i actually did it right so we said that equal to zero and what we find is that um uh sorry this this two goes away to because i have minus one half in the front of it so this is just uh this is just this i have um r minus one c x of n x of n transpose is equal to r minus one y of n x of n transpose so c is equal to um sorry i forgot all this all this has a sum in front of it i forgot the sum so c i'm going to put a hat on top of it that's my estimate of what c is it's going to be equal to the sum y of n x of n transpose times um this term here the sum inverse so what we're going to do we're going to replace these values here with our expected values for this so um i have x as the mean and so if i if i put the expected value for these guys what i have is that the sum one to n y of n x hat of n transpose this i can compute right and this is multiplied by um the expected value of x of n x x of n transpose minus one expected value expected value of x x transpose what is that how can i compute that the quantity in principle what's the expected value of x x transpose well you remember the equation for variance right so what's the equation for variance variance of x expected value of x minus is expected value right so this is the variance equation p variance of x the expected value of um x gotta catch those stupid mistakes otherwise we'd be going in circles okay so this we have this quantity variance of x the value is y p and this we have as well so this quantity here is the sum of p of n the uncertainty of the state in trial n plus the expected value x hat of n x hat of n transpose that's my estimate of c if i know x hat which is what we're assuming so we begin with some estimate of ab and all those things we use the common filter to estimate x hats and non-certainties now we go back and re-estimate all the parameters ab c and d using our state x hat i can compute okay yeah go ahead what's the expected value of c so c is c is so the c that will maximize the log likelihood is this right the hat value of c the expected value of c is going to be the expected value of things inside here my estimate of c is going to be the expected value of it that maximizes the log likelihood so this is the c that maximizes the log likelihood right but but i don't have x right x is a random variable all i can do is say okay well what's the expected value of c if that depends on the expected value of x and its variance yeah or really it's a trick that just says well we're going to replace these x's which are random variables with their expected value and this quantity here is like a variance okay all right so let me do one more of this it's going to be basically exactly the same except that we're going to do it for let's see which another one that's in the second one i was going to do it for was a so why is this why does this make sense so look look at this equation so why is this equation by x n transpose and then solve for c that's what that is so you should when you look at it you see okay well where does this come from it comes from multiplying both sides of the equation by x n transpose effectively what this is so there is a relationship between that and this um let me do it for a so in your book you have it for all the parameters but um there's no magic here obviously just finding these two rules and those derivatives and because the derivatives are from the scalar quantities the basic result here so this is similar to um how we're going to do a basically we're going to end up when we you know what we're going to do is that we're going to multiply out the second equation this part here slide that out we're going to get some scalar quantities that have matrix a in it are going to look like this some of them are going to be the quadratic form like this we're going to find just like I've done here but the result is x of n transpose so we're going to see that let me just multiply it out even though this is not the way that's what I have and then I use the expected one to transpose multiplied by some expected value and you notice that this estimate with some estimate of a b in all these things you compute x hats but you compute x hats you get these expected values and then now that you have your x hats you compute all the old values of x and b the iterative process and you have the similar equations for you know that you can compute the derivative of this with respect to b the derivative of this with respect to respect to r and initial p and x hat okay so in machine learning em very powerful tool and it's this iterative process by which you assume in the case of this structural learning problem that you you begin with some estimate of the e-step so you fix data hat use the Coleman filter to find x hat step you fix x hat and you find a new estimate yes some parameter yes yeah so let me think what we're going to end up with is a new estimate of x hat no I don't I don't know when you maximize this when you find the derivative with respect to x hat we're going to get an x zero in there yeah yeah yeah but um what I don't know is um what's the what's our you don't have an x zero here as our as our given because I have I don't I'm kind of confused myself I don't know let me think about it I don't know good question I need to make a homework assignment for this so that everything's clarified all right by actually doing it all right guys good luck with this with