 The topic of learning is a very rich and fascinating one in game theory. We will be able to sort of just get a taste for it. And we'll be focusing on two examples of learning in games, one method called fictitious play and the other called no-regret learning and in particular an algorithm called regret matching. But the topic really is vast and so let me say a few words before we actually look at the specific methods. First of all, we should recognize that learning in game theory is fundamentally different from learning in other disciplines. For example, as it's done in machine learning in AI and computer science or in statistics or in optimization, in those disciplines one usually has in mind a single actor acting in an environment. Not an environment is unknown to the agent. It may be stochastic, it may be partially observable and so it's very difficult to figure out what an optimal strategy is but there is a well-defined notion of optimal strategy and the goal of learning is to learn something about the environment and how to act best in it. In the case of game theory, the environment includes or perhaps consists entirely of other agents. So even as you're trying to learn and adapt, so are they. And so what ends up happening is you can't really divorce the notion of learning from the notion of teaching because as you adapt you're influencing the activities of other agents. And just imagine informally the situation where there are two agents who repeatedly drive towards each other playing the game of chicken and maybe a drag racing by adolescents and perhaps one of the... So what they want to do is of course they want to just zoom straight ahead and have the others go to the side of the road and give them the right away. Of course if they both do that they collide and that's a bad idea. And so they sort of test each other and over time dare more or less. So imagine that there's one driver who's an extremely good modeler of the other driver, driver two. And so driver one learns very well whatever driver one strategy is driver one will learn what driver two strategy is and best respond to it. Over time we'll figure it out and it'll be a great, great responder so it seems like you can't do any better than that. Well imagine that player two is sort of a bully driver who doesn't really model the first driver very well and narrows ahead not caring about the circumstances perhaps willing to take a few hits here and there to scare off the first driver. What's going to happen is a second driver who's a terrible learner and a very bad modeler of the first driver is going to keep going straight ahead and the first driver is a wonderful learner and best responder is going to learn to accommodate. And what happens is that the second driver was perhaps a bad learner but a very good teacher. So that's one thing to keep in mind when you think about learning games. And the other is that learning is an overloaded term and there are many things you might learn in the context of games and we'll be looking at a very particular slice. The context we'll be looking at is specifically repeated games and when we speak about learning in repeated games we'll really be speaking about strategies that as they unfold draw interesting inferences or use the accumulated experience in an interesting way. That is the nature of learning that we and in fact most of the literature in game theory considers. So with that in mind, here are two examples. And this is perhaps the granddaddy of learning regimes in game theory called fictitious play. In fact it really was not conceived initially nor is today viewed as a realistic or effective learning method but it does contain many of the elements that you see in more involved versions of learning. It was presented first as a heuristic for computing and computing a natural equilibrium in games. It turns out not to be a very effective procedure but it is an interesting kind of basic learning procedure. The way it works is simply each agent starts with some belief about what is the strategy of their agent. Each agent best responds. The agent updates their beliefs based on what they observed in this iteration of the game and the game repeats. As I said this is a very general regime and in fact this is a general regime of model based learning where you have a model of the agent which you best respond to an update over time. Fictitious plays is a special case where the model is simply a count of the other agent's actions so far and you take their counts as their current mixed strategy. So a little more formally, let's assume that WA is the number of times the opponent played action A in the past and there are some initial values for those that are non-zero and then you simply play A with probabilities that is proportional to the time that it was played in the past. That's a very straightforward simple procedure. There's something a little paradoxical going on because every agent, assuming, and we'll talk about two agents here, the two agents are always playing a pure strategy and yet they're modeling each other always as playing a mixed strategy. So be that as it may. And we should also note that you need to worry about edge cases such as tie-breaking, so what happens if you have two actions who would play an equal number of times in the past? Well, you need to worry about that. Here's an example of how it might work and in the context of matching pennies. Matching pennies, again, two players each choose heads or tails if they both chose the same, the first guy wins, if they chose differently, the second guy wins. Let's assume that these are the initial frequencies that they have in mind and so I's belief about two is that two played heads with, you know, a frequency of 1.5 and tails a frequency 2 and these are players 2's belief about players 1. Okay, now it's round one, what should they do? Well, player 1 wants to best respond to his beliefs. He believes that this is the distribution of player 2 and he wants to match, so he's going to play tails so he can match this is the best response to this mixed strategy that he ascribes to player 2. So he's going to play tails. What about player 2? Player 2 has these beliefs and he wants to mismatch so since he believes that player 1 will play heads with greater probability than tails, he's going to play tails so he too is going to play tails and the stage is over. Let's move now to the next stage. At this point what happened? Well, these are the updated beliefs of the players. Player 1 observed player 2 playing tails so he increases the 2 to a 3 and so does player 2 increases his beliefs in what player 1 will do. So what do they do? Well, player 1 still wants to match player 2 and he still believes in fact even more strongly that player 2 will play a tail with greater probability so he's going to play tail again. On the other hand, player 2 now believes that these are the probabilities so player 2 believes that player 1 will play tails with a greater probability, player 2 wants to mismatch and so player 2 will now play heads. And you continue this calculation and you can persuade yourself that the play will proceed in this fashion and this is how fictitious play takes place. So notice something interesting. The strategies don't converge and if you were to continue to play this out you would see that the Ts and H in both sides sort of ebb and flow but you will see that there's a certain balance taking place over time and this is, and in fact in this game they would converge that on average if you look the long-term average of the strategies each of the agents will play tails and heads with equal probability 0.5 and so we call this the empirical frequency. Now notice that in matching pennies 0.5 is also the unique natural equilibrium and the question is, is this an accident? And the answer is no. And here's a theorem. The theorem says that if the empirical frequencies of the player's plays converge in fictitious play then they have to converge to a natural equilibrium of the game. Now they may not converge in general that's why it's not an effective learning procedure in general but there are a host of conditions under which even if the play doesn't converge the empirical frequency does and here are some of the conditions that are sufficient if the game is zero sum if it's solved by something called iterated elimination of strictly dominated strategies if it's something called a potential game which we won't define here or if it is a two by end game in other words one player has only two strategies the other may have more but it has what's called generic payoff which we also won't define here but the main thing to take away from this is that there are some sufficient conditions that do they guarantee that the empirical frequency of play in fictitious play will converge even if the play itself will not let us now switch to a very different form of learning it's a whole class of learning called no regret learning and it's different in a fundamental way first of all the methods themselves will not be model based they will not explicitly model what the other agent is doing but rather will adapt the strategy directly that's one difference but perhaps more fundamentally in this case we don't start with a learning method but we start with a criterion that we want the method to satisfy namely the no regret criterion and so what does this say that a regret of an agent at time t for not having played some strategy s is this difference the difference between the payoff he actually got at that time and the payoff he would have gotten had he played strategy s that's a naturally enough notion we will now define when a learning rule exhibits no regret it will be if the case that if in the limit agents will not play will not exhibit any regrets in other words if you as you go to the limit the probability with which the regret will tend to 0 is 1 those rules will be called no regret learning rules and here is one such rule which is surprisingly simple and it's called regret matching and the way it works is as follows it says simply look at the regret that you have experienced so far and for each of your pure strategies and pick the pure strategy in proportion to its regret so if we define again the regret of the strategy of time t as r t of s then the probability at the next time that you play s this is the sum of all regrets across all pure strategies and take your relative to the sum of all regrets and so a very simple rule and it has surprisingly strong properties first of all it's provably exhibits no regret and furthermore it actually converges the strategies when you use a regret matching converge to a correlated equilibrium at least for finite games finite games that are repeated so those are two examples of learning rules one model based specifically fictitious play another one that is model free regret matching that one of the family of learning methods that are exhibit no regret as we said in the beginning the topic of learning games is a very rich one but at least we have a taste for it now