 Artificial intelligence is one of the biggest buzzwords of the decade, and AI is usually created through human-generated data. But such data is unreliable or simply just not available. For example, if we want to build a machine learning classifier that classifies objects, we can't just tell a machine to do it. It needs a labeled dataset consisting of images containing an object and the corresponding object category. You'll agree with that right? Now this category is manually labeled by humans and can be collected from one or many sources. Regardless, for this problem and many other machine learning problems, there may not be enough labeled data out there for a machine to effectively learn anything. And even if there is data available, there's no guarantee that all the labels are correct. Since we're going to talk about gameplay, here is a more fitting example. Playing data taken from humans playing games is only as perfect as the humans playing those games. They are prone to errors. So one thing is for certain, incorrect data leads to misleading results. So less human intervention in the machine learning process, higher is the chance of fitting a better model that adequately generalizes the task at hand. This is the philosophy behind DeepMind's AlphaGo Zero, and is the primary reason it is the best Go player in the world of humans and human dependent AI. Go is an ancient two player board game. Invented in China over 2500 years ago, it is one of the oldest board games played today. AlphaGo Zero is the successor of the famous AlphaGo. Now AlphaGo was the first computer that was able to defeat one of the best Go players in history, Lisa Dahl, with an overwhelming 4-1 victory. Alpha didn't just defeat Prodigus Go Masters, it also used techniques considered revolutionary and is currently studied by enthusiastic Go players. To think that a machine could contribute this much knowledge to the most studied game in history is a feat in itself. AlphaGo used reinforcement learning to learn the game. It looked at thousands of games played by Grandmasters and other professionals to climb the ranks and become the best Go player of all time. However, all of this changed with AlphaGo Zero, a system that doesn't learn from games played by humans, but rather games played by itself. By getting rid of human input data, this new system was able to beat its predecessor AlphaGo 100-0 games. However, we have learned success stories of many systems that learn this way. For example, DQN learns Atari games without human input data. So what makes AlphaGo Zero anything special? It is because we are considering the game of Go, a game that requires a lot of forethought and has a much larger search space because of the thousands of possible moves that can be made, and even after making a move there are thousands of other possibilities that branch from it. AlphaGo was the first AI to achieve superhuman performance in the game. I didn't find the exact implementation details of AlphaGo, but we have the published version of AlphaGo fan. This used two neural networks, a policy network and a value network. A policy is an action or a set of actions to take. The policy network is a supervised learning model that determines the next best move or the optimal policy. The value network predicts the outcome of the game depending on the move made by the policy model. As trained, these two neural networks are combined with the Monte Carlo tree search. I'll explain this shortly, but basically Monte Carlo tree search was used in the policy network to narrow down the best possible moves and the value network was used to evaluate future performance when these moves are made, providing a look ahead type mechanism. This allowed the AI to make an optimal move. Now AlphaGo Zero takes a different approach. There are four key differences between AlphaGo Zero and its predecessor. The first is self-play reinforcement learning is used. So there's no supervised learning using human data. The second is that the only input features are the black and white stones that are used in the game. Then unlike AlphaGo, which uses two neural networks, this uses only a single neural network. And the fourth, it uses a simpler search tree mechanism to determine the next best move. This will make a lot more sense when I go through the details of AlphaGo Zero. So let's do that now. Consider a neural network F initialized with random weights, theta zero. This network consists of numerous layers of convolution, batch normalization, and activation, and is finally made to output two quantities. The first is a vector of move probabilities. In other words, a probability of taking an action A at a certain state. And the second is a scalar V, which takes two possible values. One to indicate the current player wins, given the current state S, or negative one, which indicates a loss from the current state S. So these values are predicted by the neural network at the state, say, ST. The goal here is to train our neural network to estimate the accurate move probabilities so it can make an optimal move in the actual gameplay. In other words, the goal is to make the neural network good at predicting P, the move probabilities. P over time will help us determine the optimal policy. When you play Go, or chess, or any other two-player game, what do you do before you make a move? You think. You think about the best move to make based on the board. AlphaGo Zero thinks and makes a decision using the Monte Carlo Tree Search, MCTS. From a high-level perspective, MCTS takes the current state that is the board and the output of the latest network, f theta t minus one, in order to make the next best move. So how does it use the neural network? Initially, the neural network predicts some output move probabilities, P. But AlphaGo Zero is not very confident in this policy determined by the neural network, and it is right to think so. After all, we just randomly initialized our neural network with some weights. We should just take the current predictions of P with a grain of salt. So here's a scenario. It's AlphaGo Zero versus AlphaGo Zero's ghost. And it's our turn. Time to think. So AlphaGo Zero determines the current board state and executes MCTS, Monte Carlo Tree Search. It uses Monte Carlo rollout, that is, it hypothesizes random moves and determines the rewards along the way. The deeper down the tree it goes, the more into the future it simulates the current game, visualizing as many scenarios as possible. The Q values for every state are updated by backing up the tree. This allows us to determine better move policies, pi, and better policies, that AlphaGo Zero is more confident in. So since AlphaGo Zero has determined a better policy, it makes a move based on the new move probabilities pi. Now, the weights of the neural network are updated such that its predicted move probabilities P more closely align with MCTS's move probabilities pi. Also, the neural network's outcome predicted winner, for the current state, V, should closely map to the outcome predicted by Monte Carlo Tree Search, which is Z. And so we have an objective function or a loss that the neural network should minimize. So it works to minimize the distance between the scalars Z and V, and also to minimize the distance between two vectors pi and P. Once the ghost makes its move, we see the board state at ST plus 1. We do the same thing we did in the previous state. The neural network generates move probabilities P and predicts the winner from the current board V. AlphaGo Zero is now a little more confident in these probabilities. So from the current state, it executes the policy output P. But after a few iterations, it grows less and less confident of the policy. And hence, AlphaGo Zero defaults to rollout, that is randomly hypothesizing actions. It collects rewards along the way until time runs out for that move. The Q values are backed up and updated for every state. This new set of Q values determines a more confident move probability vector, pi T plus 1, and hence exhibits stronger confidence in this new policy. So it updates the neural network parameters to theta T plus 1. And pi T plus 1 is used to determine the next best move. This process goes on. I'm not sure if you already picked this up already, but here's the analogy. The idea of self learning involves AlphaGo Zero playing itself to constantly improve. Its former self is the neural network. This helps MCTS generate a better policy and allows the neural network to improve, that is, to its current self. I would make the analogy of Mario Kart where you play against your own ghost. But this isn't really the case. Because in this case, the ghost is helping you get better. I would say the concept is more similar to GANs, Generative Adversarial Networks, where the generator gets better at fabricating data and the discriminator gets better at recognizing fabricated data. And these models help each other improve over time. The result, AlphaGo Zero initially started out with just input rules to the game and no prior knowledge. After three days, it surpassed human level of go playing. After 21 days, it reaches the level of AlphaGo Master that defeated Go Masters including the reigning world champion KG in May 2017. And after 40 days, it surpassed all versions of AlphaGo to become the best go player of all time. A very impressive feat indeed, makes you wonder about the rapid improvement of gameplay over the course of just a few years. We've come a long way, but there's still much to improve. A very intriguing future for a very intriguing field. Thanks for stopping by today. If you liked the video, please give it a big thumbs up and subscribe to the channel for more awesome content. Subscribing allows immediate access to this wonderful content and it makes me feel giddy to look at the numbers. It's a win-win situation, so do it. See you guys in the next one. Bye. Bye.