 Okay. So yeah, I'm Hardeep. I am basically running AI and Advanced Analytics Center of Excellence at Accenture. So today what I'm going to share with you is about AlphaGo Zero. I just happened to read this nice blog about this, and they had some bit of internals on how AlphaGo Zero works. It's one of the poster childs now, and there was some code as well. So I'm just going to share my understanding about it. That's the disclaimer. I'm not a reinforcement learning expert, so I'm going to just share what I've learned. So if you have any questions, I'll try to take them, but if I can't answer, then we'll take it offline. Okay. So before I go into the hardcore stuff, I'll just try to- This was the moment a computer called AlphaGo beat a master of ancient Chinese game Go. It's not the first time a grown master's been handled by a machine. But what makes AlphaGo different is that it's the first demonstration that machines can truly learn and don't think in a human way. AlphaGo is a victory shock to experts in the artificial intelligence community. Many thought such an event was at least a decade away. So firstly, a few questions. Why is this important, and what's all the fuss? AlphaGo shows that machines can really learn. How so? Well, instead of using brute force to calculate all the moves it can make, like previous AIs, AlphaGo used reinforcement learning and neural networks to mimic the learning process of a human brain. Keep in mind that the ancient Chinese game Go has as many more possible moves than chess, as there are in the entire universe. So there's no way of just calculating every possible move on the board. That's practically impossible. For this reason, Go is the holy grail of AI, and learning to do such a task from scratch is a huge feat. So I'll just stop it here. So this was just introduction about AlphaGo, so just two levels at the playing field. So AlphaGo was back in 2015 and 16. So where are we now? So in 2017, they released a version, which is a much more advanced version, which is called AlphaGo Zero, which is able to learn from a scratch. So AlphaGo, which defeated least at all, was a mixture of supervised learning and reinforcement learning. So they started with a lot of games of grandmasters. They trained the initial network on that, and then they use reinforcement learning to refine it. But with AlphaGo Zero, what they have done is they have started blank slate, and the machine learns totally from random weights, and within a couple of days, it's able to go ahead. So I'll just show you a visual for that now. So when Google released this, when they trained AlphaGo Zero, within three days, it's able to already beat the version which beat the grandmaster least at all. Then over some period of time, around 21 days, it beat a version called AlphaGo Master, which was online and with which a lot of other grandmaster and champions played with. In 40 days, it became a version which was like unbeatable and the best go program at that point. All this was done with a lot less compute and a lot more intelligent. So if you see AlphaGo Fan is the FANUE version, which was the most initial version, was using a lot of GPUs, so it had a lot of power consumption. Then the AlphaGo Lee came down to TPUs pretty less. But if you see AlphaGo Master and AlphaGo Zero, which are the latest versions, they are using a handful of TPUs as compared to that. All this is possible because of the algorithm. They did a lot of optimizations and we are going to see some of them. On the right-hand side, you see ELO rating chart which is essentially how much intelligent is this version as compared. So AlphaGo Zero is on the far right and it's the most efficient version. Then another interesting fact is like when it started learning from scratch within three hours, it could do the basic moves. In 19 hours, it learned the fundamentals and it started going ahead. In 70 hours, it has reached super human levels and it can really play like a grandmaster. So now the algorithm is pretty simple. So it's essentially a similar thing like how a human being will think. You mentally play the game, you create a tree out of the moves, and then when you reach a state, you evaluate and you back propagate the moves through the tree. Then after you finish thinking about future possibilities, you take the action which is the most promising at that stage, and in the end you see where did you misjudge and update the network. So now I'm going to go into the details of the algorithm. So it has a couple of parts. We start with the game state. Essentially, Go is played on a 19 by 19 board. So they constructed this network which is essentially zeros and ones telling you which of the pieces is black or white. So for the black one, they have the current state and the seven previous states. Similarly, for the white, they have a current state and seven previous state, and the last one is telling this state is either is black going to play or is it white going to play. So this is the input to the network that they will give. So this is the game state, and then they have this neural network. The neural network has got a lot of layers. So there is a basic convolutional layer. There is a lot of 40 around residual layers, and it's a double-headed network. So it has a value head and a policy head. So we'll see each of these now. So on the convolutional layer, it's like a normal convolutional neural network. You have the input and then you have three by three, two 56 of them, and then you do batch normalization, which is essentially bringing the weights in a bound, and then you have a relu function and that's about it. But if I go ahead, in the residual layer, they have combined two con layers, but then on the second column layer, they're doing a skip connection. So that's from the ResNet architecture that Microsoft released earlier in 2015 or 16. So they have used a similar and they have used 40 of such layers and then finally, they have something called as a value head. So value head is the part of the network which predicts the likelihood from this state, are you going to win the game or not? What's your likelihood to win the game? The policy head tells you all the moves and it tells you that what's the most likely move you should be taking. So it will tell you on the board in terms of probability map saying that what are the various moves which are available and for those moves, what should be the likelihood to take those moves. So the policy head talks about the moves and the value head talks about the probability of winning the game. So apart from this network, the third thing they have is this Monte Carlo tree search. This is one of the key parts of their algorithm. So essentially before I go into this, let's see these four variables which they have talked about which is the number of times this path have been searched or this move has been taken then this is the cumulative value of that game and then this is the mean value Q is the mean value and then P is the probability. So what they have done is they put the tree within the, so they take at each stage of the game, they simulate using the same network that we have seen earlier and the network has the two heads. So it will give you the probabilities and it will give you the value and the probability vectors and then the tree will compute other steps. So essentially it is a simulation, so at each stage of the game, it will have around 1600 simulations for the AlphaGo zero version. So it will take something as the average mean value of the network and it's basically here U is defining a function between the prior probabilities and the number of times this move has been explored and it will take a sum of these. Then in the next one, it will keep simulating till it reaches a leaf node, and where it will find the final probabilities and the final whether it's winning the game or it's losing the game and then it will back propagate all the steps over the tree. So it will calculate all these formulas and then it reaches a stage where it has to select a move. Now at this state, it has got the value for each of these paths that it is going to explore. So if it is playing more deterministically, that means it's playing in a challenger mode, then it'll pick up the best, but if it's still playing in exploratory mode, then it might pick up the value, which might not be the best, but it might explore other options as well. So this is the Monte Carlo search tree part. Now, how do these things come together? You have a stage of the game where it's called self-pray, because it's reinforcement learning. So reinforcement learning is mostly about generating your own data and it needs a lot of data, and it's going to learn from those self-play games. So it's going to play 25,000 games with itself, the network, and it's going to use the game state, and at each stage, when it takes the next move, it's not going to take what the network has predicted, but the part which the tree is predicting, the Monte Carlo search tree is predicting, and if the game is winning or losing, it will store those values, and it'll keep all these things as played games in a memory. Because in reinforcement learning, you build up this memory of games, and then you sample from that, and then you basically use it to retrain your network. So the second phase is basically retraining your network, where you have the memory of 500,000 games, and you sample a few games out of it. Why it does that sampling rather than doing the last end games is because say if your solution is tending towards a not-so-optimal path, and you keep training on that not-so-optimal path, then you might end up going in a unoptimal solution in the end. So it's always better to keep a memory of games and then select randomly out of it. So the loss function that we minimize is the prediction from the network. So the network learns how it is giving a prediction, but then you have the tree which is predicting based on exploring all the paths. So you want to make sure that these two are as close as possible. So this is a cross entropy which tries to make them similar. Plus you have the value function which is telling how likely you are going to win the game, and you are actually winning or losing. So you have to minimize, so this is a mean squared error here. So after retraining the network for couple of times say 1000 times this, then what you do is you evaluate the network. So whatever you have now, the network now you will play it with the best network you have so far, and you will keep repeating this process. And if this network wins more than 55 percent time, then this becomes the best network and the cycle continues. So this is how they have trained the version for AlphaGo Zero. So now what the code I'm going to show you now is about Connect4. So the same algorithm can be used for playing Connect4 as well. So Connect4 is a game where you have to make four consecutive dots in a row either diagonally or up or down. So it's by no means a simple game. Again, it has a lot of combinations, but then you can also use the same algorithm to train this to play this game. So I'm going to shift to the code. So on the code side, there is this code available on the GitHub. So there are some key files which I'm going to go through. There is a configuration file where we'll look at some configuration, there is Game Rules. So this is the file where you will define your game and you'll define your rules so you can define your own game and own set of rules. To do that, then I Python notebook is the place where you have the self-play retrain and evaluate function. Then functions is just where you create the code of the main functions is a helper function. And agent is the main class where you are defining these agents which are going to play or take the action. So it is all the key code of taking action, choosing action, doing a Monte Carlo tree search, simulating and all that is in the agent class. Then the helper functions for Monte Carlo simulations are in the MCTS class. Then you have a class for the model which is in Keras. So the Keras model is essentially going to talk about the whatever architecture I described. And then you have a class for memory which is just the basic structures to store and access this memory. And then you have a main class which is a command line interface to run through. If you don't want the I Python notebook, you can use that and then there are some ancillary class around it. So let's shift to some code now. So this code is available, the files I talked about are available on this GitHub repo. And I'm going to go through the files as I talked about. So this is the configuration part. So episodes is the number of games that you have. Okay. So episodes is the number of self-play games that you have to do for a game as simple as correct for you, don't need so many of them. So 30 per iteration is enough. Then this is the, you saw that earlier it was doing 1600 simulations. In this case, we are only doing 50 of them because now see if you make it as big a problem or if you do a simulation so many times, it's going to take more time to train. And since Google has access to TPUs and all that, it can afford to do so much computation. But on a simple GPU like a 1080 Ti, you need a lot lesser scale so that you're able to train this algorithm in a faster ways. And this is just instead of 500,000, the memory size I have taken is 30,000 in this case. And then this is the retraining part, the batch size, the normal things, learning rate and momentum and trending loops, and the kind of hidden layers that we have used. So that is the config part. Then this is the way you define the game, the game class, which is essentially defining your current rules of the game and what are the winning positions of the game. So you define the game state here, and then you have allowed actions in the game, so which is essentially implementing the rules, and then you have checking when the game ends, what is the value of each step, how do you maintain the score, how do you take action in the game. So those functions are here, you can have a read of this code at your own leisure. Then this is the main IPython notebook. So here it just takes the initial imports and then the code starts here where it initializes the memory. So the code has the functionality to stop in between and reload from the last checkpoint where your memory and the model has been saved. So it does that housekeeping and loads the initial stuff. Then it creates the agent or the player which is going to learn, and this is the infinite loop, it will just keep learning until you stop or kill it. So this part it starts the self-learning part, and it will call display matches functions, which is defined in the agent file, which I'll go through later, and then this is the retraining part. Again, the replay function is with the agent file. So they are implementing all, but then this is what it has done here, and then this is the final evaluate part, the tournament where you are trying to pit the best player versus the current player that you have trained. So this is the code for the functions file. Again, so the function that we were discussing about the play matches, so it plays for the number of episodes, so number of times you are going to self-play the game. I think it was 25,000 for that, and for us it's just 30 or 50, and at each step, it's going to take the players, set their game state, and then it's going to play this game in this while done loop, and here it runs the act method, which is in turn going to do the Monte Carlo simulation, this act method is in the agent class, and then you have the memory management part of it, where you add the memory and you keep the memory up to date. Then I try to move to the main file with this the agent file where you define the, so user is when you are playing as one of the players and agent is when it's a computer player. So this is the simulate method where it uses basically the APIs from the Monte Carlo search tree. So whatever we discussed like it moves to the leaf node is able to evaluate the leaf node, and then it's able to back propagate, and it computes all those values which is there within these APIs. The act method is where it runs the simulation, so it starts at each step and builds a Monte Carlo search tree, and then it starts the simulation part for each step. So whatever is the output from the Monte Carlo search tree is what it takes as the next action, then it has some code for prediction, evaluating leaf nodes, choosing the actions, replaying, which is nothing but retraining the network. So there are functions from the model class which you'll see a little later, where you have the entire Keras code, and then you have some prediction building the tree and all this stuff. So then is the Monte Carlo search tree method. Again, you have a tree with the root node, you have edges, and then basically the formulas that we were seeing there for U, Q, and to compute all the various stages. So all this code is residing in this Monte Carlo helper methods. Then finally, you have the model in Keras, which is if you see down here, the residual CNN network. So it is having the residual layer, it has the con layer, it has the value head, and it has the policy head. So whatever we saw there is the code is here. Then finally, the last part is the memory part, which is just the structures to upkeep the memory. So you can have a play with this code at your own leisure, but the great thing is that this code is open source now. So you can plug in your own game or you can learn from it, and you can hopefully put it for some useful problem solving. One of the main bottlenecks or what I will say is to productionize the reinforcement learning, problem is you need a lot of data, so you need some kind of a simulation environment, wherein you can play these kinds of self-cams to learn. So that at present limits the amount of practical use cases you can put it at, but then wherever you have the ability to do some kind of a simulation, this is a very good general purpose algorithm that we can try and use. So I think that's about it for me now. Any questions? This code is designed for the Kharu. Search tree? Yeah. So this opens a mechanism. Yes. So the entire code is there in this GitHub repo. So this file, yeah. So all you need to do is you need to change this game.pif file with the game structure, the validation rules for the games and all that stuff. So if you go through this code, if you change the game file, this entire other code will just plug and play and work for you. So you can retrain. So what I did here was, I use this code and I train this over my GPU 1080 Ti. It ran for like after 36 hours, it become unbeatable. So initial versions, it was taking moves which were not so sophisticated and it had gaps, but after some bit of training for more than a day, then it became a version which is really unbeatable. I can't hear you. The license of the repository is GPL. Okay. We need to make our source, but if they use this, then we need to make our own source code as open source. Is that correct? I think so, yes. Okay. Thank you very much. Thank you.