 마지막으로 이 워크샵의 주인공은 정윤 조, 또한 한 명의 기관자들입니다. 호프의 모델의 미래의 대신에 대해 이야기하겠습니다. 어서 오세요. 저는 모든 걸 이미 다가오셨고, 제가 짧은 공연을 해보려 합니다. 그리고 또 좋은 것은 이 소식에 대한 중요한 것들은 이 소식은 지난주에 배웠던 것입니다. 그래서 저는 매우 편안합니다. 그래서 지난주에 배웠던 것입니다. 오케이? 그래서 저는 그래디언트에 대해 이야기할게요. 그래디언트에 대해 알아볼 수 있습니다. 작년에 배웠던 프로필스 메타입니다. 저희는 롯스 펑션의 옵젝티브 펑션이 있습니다. 그래디언트의 펑션은 그래디언트의 펑션을 사용합니다. 이 길을 따라서 왜 이 길을 따라서요? 사실 이 길은 첫번째 방법입니다. 이 롯스 펑션의 텔러의 펑션이 첫번째 방법입니다. 이 길을 따라서 세타, 티플러스 1, 마이너스 세타, 티는 이 길이 아무것도 없습니다. 이 길을 따라서 다음 롯스 펑션을 제거해야 할 것입니다. 이 롯스 펑션을 제거할 수 있습니다. 이 롯스 펑션을 제거할 수 있습니다. 이 롯스 펑션을 제거할 수 있습니다. 다음 롯스 펑션은 이 길을 따라서 세타, 티는 이 길을 따라서 롯스 펑션을 제거할 수 있습니다. 이 길을 제거할 수 있습니다. 이 길을 따라서 이 길을 따라서 다시 다시 돌아올 수 있습니다. 이 길을 따라서 세타, 티는 수 begins. 이게 도전일 것인가요? 이건 도전일. 이 길을 따라서 세타, 티는 원을 치 elektro시辭 screaming 이 대성의 주인공의 주인공의 주인공은 다른 주인공의 주인공을 포함하고 있습니다. 여기 보이죠, 해시안 메트리스입니다. 이건 해시안 메트리스의 롯스팅션을 제공하는 것입니다. 만약 이 공장을 소개합니다, 이 기술은 second-order method because this loss function I also have the Taylor expansion now up to the second-order. So if you optimize this with respect to theta t plus 1 then you get this equation, okay? So again, calculating this is computationally very expensive. So smart people developed to overcome this way, this method they developed the cosine Newton methods. So BFGS is one famous one and recently people use actually natural gradient in deep neural network. So in that sense they use this chronicle delta factorization approximate cover chart K-fact method. So if you can actually imply this into neural networks if you use these numerical techniques, okay? Today I will introduce another very interesting algorithm called mirror descent. So as you see here, all these things are convex optimization. This field is developed very, very well. So in this society they are using mirror descent. Let me explain what mirror descent is. So you see again this proximal gradient, right? So previously we used this form but here you can see theta and theta t. We consider the Euclidean distance between theta and theta t, right? But it's not necessary to use Euclidean distance. So let's consider some more general distance. This. So as a candidate we can think about fragment divergence. yesterday Professor Coppola has introduced us fragment divergence. So if you have some convex function f, f which is a function of theta. So f is a convex function of theta. Then we can define the distance between parameter theta and theta t this way. So let me explain geometrically what this means. So we have a convex function f of theta, right? So we have a point of theta and also current point of theta t here. So this fragment divergence represent actually the gap between the function of value of theta and the Taylor expansion expand from this point of theta t. First order Taylor expansion. So the distance between f of theta and first order Taylor expansion that's a fragment divergence. So because this is a convex function only if theta and theta t coincide then this fragment divergence becomes zero, right? Otherwise it increases. So instead of this Euclidean distance we will plug this fragment divergence into here, okay? Then this is a loss function or Taylor expansion, first order Taylor expansion and this is the proximity with a fragment divergence. So now we want to optimize this with respect to theta, right? Just you differentiate this equation with respect to theta. And here theta t is just constant. Then you get this equation, okay? Then I will use another variable mu. So you learned this, right? This is a conjugate variable or with respect to theta, right? f is a convex function. So at every point theta they have different slope, right? So instead of the position theta I will use the slope as a variable. That's this relation. And actually we have a convex function f which is a function of theta but theta is a not good variable which is abstract. But we will see that mu can be more convenient variable. So I will change my variable theta to mu. How to this? You know everyone knows how to do this, right? Using this Legendre transformation. And because of the duality between f and mu you can easily see theta is a derivative of g, okay? So using this new variable mu then I can change this equation into this way, right? So this part f, this derivative is mu t plus one and this one is mu t. So I rearrange this then you get this equation. So this is a very interesting equation actually. So let me check. This looks like a gradient descent but it's not gradient descent. This is the update of mu but this gradient is not with respect to mu but with respect to theta, right? So you should be careful for this point. But let me reinterpret this gradient using chain rules. So you can apply chain rules here then. So our theta is this one, right? So if you plug this theta into here then you get this one. So in terms of mu actually this gradient is natural gradient considering the curvature, right? That's the essence of the middle descent. They are updating using gradient descent but implicitly they imply natural gradient. That's the key essence of this middle descent. So let me summarize how you update your parameters using this middle descent. It's like this. Usually when we use gradient descent we just place in this space, primal space from theta t you update to theta t plus one. But in the idea of middle descent first you transform this theta into mu using this loop, transformation rule. Once you are in the dual space you update this using gradient descent but keep in mind this gradient with respect to theta not mu. So you are doing gradient descent but implicitly you are doing actually natural gradient considering curvature. But computationally this update is cheap, right? You are using because you are using gradient descent. So once you update your mu then we have to go back to our primal space in theta space, right? So this inverse transformation is you can use this mapping, right? So this is middle descent algorithm. So this looks very fancy. So I want to apply this for updating my neural network. So we spent quite a lot of time which model is good to apply this algorithm? Because middle descent has been studied for a long time in convex optimization but not much in machine learning. So we want to find the right model. So finally, as a user Hopfield model comes out. That's I think the best model to test this middle descent. So let me remind you about Hopfield model. So suppose you have a data which is in multi-dimension and we have the frequencies of these samples. That's the empirical data distribution, p hat x. So we have this empirical data distribution. Then our goal is we want to model this distribution using our Hopfield model. So Hopfield model is this model, okay? I think you know about Hopfield model. You can think like this. If you have a sample x you have this network and each element has a bias and between elements there is an interaction. So this is energy-based model. So given this sample we can define energy in this parenthesis. Like this way. And this normalization vector is partition function as you know. And so we want to mimic this empirical distribution using this model distribution, right? So to represent this empirical distribution what is the best parameter B and W? That's the question of learning, right? But Hopfield give us a very good solution. So B is nothing but the expectation value of xi given empirical distribution. And this W, jk is a correlation between xj and xk given empirical distribution. That is called Hopfield solution. So you can understand intuitively understand that this can be a good solution. Because look at this. If you in this samples their probability should be high, right? If their configuration is consistent with this average behavior of bias and correlation that you get a high energy. No, sorry. High values in this parenthesis then automatically you have a higher probability. In this sense, this should be a good solution, right? So this energy model looks a bit complex. So to make it more simple I introduce another parameter variable, theta. Theta includes not only bias but also weight interaction. And I also observe the observable which includes x and also correlation between x, right? Then you can simplify this model in just like this one. So I use the Einstein summation convention here and partition function. And Hopfield solution is summarized this way. Again, this is an expectation with respect to empirical distribution, okay? So actually this Hopfield solution is very good to represent the empirical distribution. But actually you can do better by using the maximum likelihood estimation. So again, you can define the discrepancy between empirical distribution and your model distribution. Then if you optimize this with respect to theta then you can enforce your model be close to empirical distribution, right? Then you can do better than Hopfield solution. That is called a Boltzmann machine. And you already know what is the best measure of the distance between empirical and model distribution. That is represented by Kuhlberg library divergence. You already know that, this, right? So the distance between two distributions. And to apply the gradient descent method, you need a gradient to this loss function with respect to theta, right? So because this Hopfield model is an exponential family model, so it's very easy to compute this gradient. So the result is this. I'm sorry, it's too dark. But as you see in the lecture of the people, the gradient is nothing but the expectation difference between model distribution and empirical distribution. So you can easily calculate this one. So let me remind you of the Hopfield model. This is a partition function of Hopfield model and this is a free energy in physics in mathematics cumulant generating function you learned. So again, in this exponential family model, if you differentiate this app with respect to theta, you can get cumulant, right? So if you differentiate once, you can have the first cumulant, expectation value. And again, Legendre transformation. And your theta is, can be represented in terms of g. So you are ready to apply this middle descent for this Boltzmann machine model. So again, the primal space and middle space or dual space. So first step is going from primal space to dual space. And this is done by this mapping. But in this exponential family model, this is nothing but calculating expectation value. Therefore, this transformation is easy, you can do. So once you are in this dual space, muti, then you can update using this middle descent because you already know this gradient of loss, right? And then you have a muti plus one. So far, so good. But there is no freelance. The difficulty is how you come from muti plus one to theta t plus one. So that's the difficulty part. But here we can use again the Taylor expansion. Actually we don't need to know the overall length curve of g. We only need to know the local length curve of g near muti, right? That's why we can use the Taylor expansion of g. So this is Taylor expansion up to second order. And we need this differentiation with respect to mu. So you can differentiate this. And you get this. And what is this? So you know that this is theta t plus one, right? And this one is a theta t. And this one is a coverage I will show you again later what this means. Actually that is the covariance matrix of your data. And this one is this one. So now you have this muti plus one minus muti in this calculation. You know this, your previous step. So you can get this one. So actually you can use this middle descent in Boltzmann machine. So here are two important points. One is this. Usually when you use gradient descent, you start from random theta, right? Random theta. But in this middle descent, look at this. Muti is nothing but the expectation value, right? Therefore, if you use this middle descent, you have a very good starting point, mu zero. Based on this empirical distribution p hat. I think that's the important advantage using middle descent in this exponential family model. That means you don't need to start from random initial parameters. You can start very good initial parameters based on your data. That is our point. And another one disappointing part is this. Actually this gradient descent implies implicitly the natural gradient, right? So you don't need to calculate the curvature explicitly in this process. However, for this inverse map, now I have to calculate this one. So I lose the advantage of this one. But I emphasize this middle descent can imply this good initial parameterization. So let's calculate this curvature. So dg, dmu is theta, right? So you plug into here. That is just the inverse of this one. And if you remind mu is df, dseta, then that is just one, right? So let's explicitly calculate this partial derivative. Mu is expectation of this observable j, right? jth observable. So if you differentiate with respect to theta i, i, so you have these two terms. The term is that is covariance matrix. So let's think about this. This is the middle descent. And this is our inverse map. So if you combine these two, you have this actually natural gradient. So middle descent is actually same with natural gradient. But our advantage is we know where we should start. In this sense, middle descent is nothing but natural gradient plus good parameter initialization, okay? So we test this. We actually synthesize the data. So we choose bias and weight from samples from Gaussian distribution. So we hide this value b and w, which are true, true theta, right? We hide it. Based on this theta, we generated samples of x. So now I give you this x. Then the task is you should infer the true theta. That's our task. So we did this using middle descent algorithm. So now you will see how your loss decrease with iteration. You compare with gradient descent and natural gradient descent and middle descent. You will see how loss changes with iteration. And also you will see the final inference. True theta and inferred theta. How they are coincident compared for these four cases, okay? So the first, the loss change with the iterations. As you see here, this one, red one is the middle descent. In other words, natural gradient plus a whole field initialization. That is this red line. And then why can you apply this idea to gradient descent with a whole field solution initialization? So that's this blue line, okay? And this gray line means gradient descent with random theta. So we tried with several random initialization. And the green line is natural gradient, but with random initialization. That's this green. So as you see here, this middle descent is the most effective way. And also you can, if you see the inference result, the whole field solution is not so good. And Boltzmann machine, well maximum likely estimation is this. And this is natural gradient with random initialization. But this one is middle descent. Natural gradient with whole field solution initialization. So it is ideal works where we checked this. Okay, last slide. I will summarize. So middle descent is this. In gradient descent, you usually update in primary space from theta t plus t to theta t plus one. But in middle descent algorithm, they use the dual space of this theta. So you should first update from theta to mu t. But if we use exponential family model, this transformation is not difficult because this is just calculation of expectation value. And here, you just do gradient descent. But again, it implicitly implies natural gradient with respect to mu, okay? And then you need to go back to your primary space. So we use this linear inverse mapping. So we applied this to Boltzmann machine. And we confirmed that good parameter initialization is very effective. And okay, this is what we found so far. But we want to generalize this to more broad models including hidden variables. For example, restricted Boltzmann machine have hidden variables. So in this case, hope hidden solution, you based on data, you calculate the initial mu zero, right? But if you have hidden variables, we have no way to fix the variables correspond to hidden variables. That's the one problem. And another one is actually we good parameter initialization based on data, which is a very important subject in machine learning. So we want to go for that direction. Okay, that's it. Thank you very much. So questions or comments? So really it seems to me the thing that you really use here as convexity and the exponential family only comes in in terms of some way of identifying starting parameters in terms of the derivative, right? Yes. So that's really what you want. Do you think for non exponential family models just a better starting point as long as you have a convex optimization problem? Yes, I think so far I think our idea only applies to convex model. No, that's fine. It's just I'm wondering about non exponential distributions how you would extend this. Somehow you want to connect the non exponential free energy also to a convex function. You need convexity otherwise that matrix is not necessarily invertible, right? Yes. Any other? Thank you very much. In the last slide and you were comparing different algorithms in terms of the step size, I think as a function of step size. So how about the computational time and so yeah middle decent looks really nice so what we have to pay for in the real world so I want to know that kind of thing. You are very sharp. Because this is a convex problem so if you iterate more and more at the end every algorithm goes to the global minima but our algorithm reaches fast but again to compute the curvature it requires a lot of computational cost so in real computation time is I think more or less similar so in that part we don't have so much advantage. Right. Any more questions? And then let's thank the speaker again. Maybe on out. Okay. Let me conclude our workshop. I hope you enjoyed the lectures last week and workshop this week and I hope it inspires you and you learned many things and I hope all these things help for your future study. So this score is possible with the generous financial support from ICTP and also KIAs and I would like to thank to our staff Miss Liu and Miss Lee so and I hope you safely return your home and my last announcement please have your lunch at the cafeteria this will be your last time. Thank you everyone.