 Reinforcement learning. A method of learning where an agent is given positive rewards for correct moves and penalized for wrong moves. This simple concept forms the basis for modern AI breakthroughs that we've seen over the past years. AlphaGo Zero, for example, learned the ancient game of Go from scratch by continuously playing against itself. And within a matter of 40 days, it surpassed every human and computer to become the best Go player in the world. DeepQ networks incorporate Q-learning and neural networks to enable agents to learn and complete tasks in a variety of environments. However, there are some disadvantages of this carrot and stick approach. An approach where every action is either rewarded or penalized. What if you are nowhere near your objective? Then an action doesn't have a properly defined reward or punishment. A sequence of such actions can cause an agent to just go around in circles. For example, consider the case of an agent in a supermarket. And say it's trying to look for some cheese. Let's also assume that the agent has no knowledge of the environment. And it starts in a lane containing meat. An action by the agent could be to go to the next aisle. Say the agent now sees fish. In the next action, the agent might try to go to the next aisle again. But there is also nothing stopping it from going back to the previous meat aisle. If not in the first iteration, the agent might eventually get stuck checking the same aisles over and over again. So what happened here? When the agent moved to the fish aisle, it didn't actually seem closer to the cheese. So there was no defined positive reward. But we know intuitively it makes sense for an agent to continue exploration instead of going back to what it has previously seen. On October 24th, Google dropped a blog and paper on how we can make an agent explore using curiosity. The idea is to have an additional reward type. Besides the reward for getting closer to an objective, the agent should be rewarded for discovering new parts of the world. So how does this help? When an agent starts at the meat section and sees no cheese, it moves to the fish section. However, it still sees no cheese. At this point, the agent has two options. It can either move back to the meat section or it can move to the next aisle. This time the agent would most certainly move to the new aisle. Why? Because it gets a reward for seeing a new part of the supermarket, the pots and pans section. Moving back to the meat section, the agent gets no reward because it's already seen it. This reward for seeing new parts of the world is equivalent to rewarding curiosity in human beings. Question, how does the agent determine whether it has seen the same place before? The agent keeps in its memory snapshots of the states that it has seen. However, comparing the memory directly to the observations doesn't help much. You may be looking at the same room in different angles. Instead, we can train a neural network that takes as input the current observation and the previous states in the agent memory. The model can then estimate the number of steps it needs to observe the same state again. Or the output can be a binary classifier that states whether the observation is reachable in some k steps or not reachable in some k steps. The results are pretty astounding. We now have an agent that explores the world and no longer loops in a corner. So how is this better than past attempts? In past models, surprise would merit reward instead of just curiosity. Every time a model took action, it would predict the resulting state. If incorrect in its prediction, then it would be rewarded. This would intuitively entice a model to explore the parts of the world that it predicted incorrectly. However, this surprise-based reward system has its own drawback. It would get stuck when encountering a TV. And why would this be the case? Say a model goes from state A in some aisle to a state B facing a TV. Seeing a TV for the first time, the model would have never thought it would encounter such a state. So the surprise reward brings the agent to the state B. Now the agent has several options. It can either move forward, move back, or stay in the same position. On staying in the same position, the TV flips channels randomly. This is once again not something the agent would have predicted. And hence, it's surprised and rewarded for staying in the same place. This cycle repeats because the agent would never be able to correctly predict the state of the TV, which is also random. So it gets stuck in front of the TV and no longer progresses in the original objective of finding the cheese. This is similar to procrastination in human beings. The episodic memory-based curiosity we discussed before, however, can avoid this. Consider the agent is in front of the TV. Every time the channel flips, the agent observes a new state and stores it in memory. However, since the number of channels is finite, once all channels have been flipped, the agent is no longer rewarded for seeing the same channels. And it will move forward, looking at parts that were unexplored. This research introduces a new fundamental concept of environment interaction behaviors that can be useful in reinforcement learning research. Looking forward to read more about that in the future. Check out the paper on the blog post and the links down below. When I got the Twitter notification for this paper, I just had to make a video on it. Hopefully now the paper becomes more accessible. Now hit that like button, hit that subscribe button, ring that bell for notifications, share the video, do what needs to be done, and I'll see you in the next one. Bye-bye.