 Hey, morning, let me get this first, okay. Well, I'm Yenduran and this is hacking reinforcement learning. Before I start, I would like to ask you a few questions. First, who here is already familiar with reinforcement learning? Well, a lot of people, nice. And who's expecting to learn something new about it today? Oh, then, okay. I'm super sorry to disappoint you. Because I got super sad when I found out that there was no breakfast and no social event. So I decided to go on a strike. And instead of talking about reinforcement learning, I will just be telling a story about hacking AI corp and playing some video games. So, first thing first, what's AI corp? Well, every hacking story needs a big corporation. And AI corp is huge and super cool. People use its services to do AI stuff. But in the end, there's some sort of token exchange where you can make pretty good money if you're smart enough. In order to do so, you first have to access the environment or the end marketplace where you can exchange money for data. Once you have enough data, you will be allowed to access the algorithm marketplace where you can trade money for research. Everybody loves research because once you have it, if you know how to do a little magic with it, you can transform it into a lot of money. So you only have to repeat this cycle over and over again and become a millionaire. The thing is that if we found a way to bypass the end marketplace and get data for free, then we will have more money to trade for research, so we'll get rich quicker. Today, I'm gonna talk about the master plan to accomplish this. As any hacking process, first, we need to gather some information about our targets. So let's start talking a bit about reinforcement learning and planning. Most people use the environment to do reinforcement learning, which is like doing machine learning when you don't have needed labels, not examples available. The environment represents your data source and also the tasks to be solved, for example, playing an Atari game. So first, you need to generate and label data by sampling from the environment, for example, playing a game, taking random actions. So it's time you take an action, the environment will return several things. First, you'll get an umpire array, which represents an observation, a float, which represents the reward, the bigger the better, a Boolean indicating if the game has already finished, and a dictionary containing additional information. In training time, you will use all this information to create a label for each example. So you end up with a dataset of played games. All the reinforcement learning algorithms are about creating labels which are good enough to be learned by a neural network. So you need to train a neural network with those examples. So you get a model which is capable of getting highest score observations when using only the observations in test time to play the games. And there are also a handful of weird people who uses AI corp services to do planning rather than reinforcement learning. Both approaches aim to find a way to fight, to find highest score games. But when doing planning, the approach taken is a bit different. In planning, instead of training a neural network, you can do whatever you want to sample the environment and get a good score. The aim of our hack is to use planning to generate super cool datasets so we can train better our reinforcement learning models. So let's get to talk about planning a bit. A few years ago, I met on Twitter Sergio Hernandez, which is an Spanish mathematician who's even crazier than I am. Together, we challenged ourselves to derive from first principles, what we call fractal AI, which is our own theory on artificial intelligence that turns out that can also be used to solve planning problems. We love weird challenges, so we basically did it for the pleasure of finding things out and have some fun along the way. At first, we got really curious about the paper that defined intelligence as a thermodynamic process, and this was done within the framework of non-equilibrium statistical mechanics. Back then, deep learning wasn't a thing on the field, so this guy, Alexander Vinesgross, worked out an equation for intelligence. This equation may look a bit intimidating at first, but it's actually easy to understand. It says that if you want to look intelligent in any situation, then your next action will be the one that leads to the higher number of good possible outcomes in the future. And what are those possible outcomes? Well, you only have to count all the path that exists up to an arbitrary point in the future and map them to a score. This means that if we know the future, this equation tells us how to choose the right action. Unfortunately, knowing about the future is not that easy. So the only thing we can do is sample a bunch of random walks and use them as if they were the full cone, which is like the space of all future possible actions. Let me show you an example. This is how a cone can be approximated. You can see that the path of the cone has two different colors. Blue for the path that is started by moving left and red for paths that are started by moving right. The action chosen at each step will follow the color distribution of the cone. The cone, if the cone is blue, the cards will turn left a lot. If it's red, they will turn right a lot, and if the colors are balanced, they will just keep going straight. This is because the environment here is continuous, but if it were discrete, we would choose the action which has the most popular color. How we sample the cone is important, because the bigger the cone, the better. The cone represents what the agent sees at the given moment, and in order to take an action, it will only take into account the information that it sampled at its time step. This means that according to the equations, we only need to calculate a super huge cone to have a gold-like agent capable of solving any problem. Sounds cool, right? So why does nobody use causal and tropic forces? Well, you can see in the Google Trend plots how virtually nobody talks about it, and the few people who does it just for criticizing it. I mean, it is so unpopular because sampling cones are super hard, even if you make simplifying assumptions. The complexity of the problem scales exponentially with the time horizon. The equations presented can only be solved analytically in very few cases, and Monte Carlo algorithms fail to approximate the cone when it gets bigger. If we knew how to solve the equations, we would have, like, the ultimate planning algorithm, but even though this approach may sense intuitively, it isn't practical for solving problems that appear in reinforcement learning papers. It is impossible to calculate exactly a big cone, but is it possible to sample fast, super small subset of paths that can approximate well the full cone? This is the question that we try to solve. During the last five years, we spent most of our free time trying to figure out a way to sample interesting sequences of actions. We only had our own laptops, so we needed to do it under heavily constrained computational resources. We were forced to find a new method. In order to have complete control over how we want to explore the space of all those possible trajectories. After trying and failing, literally, like, 500 times, we managed to create a sort of stochastic wave function that allowed us to do exactly that. In the end, Fractal AI is nothing but a set of rules about how to define swarms and move them around in any state space. Our theory also offers tools to measure those swarms and compare them, so they can be used to solve integrals and different kinds of equations. So now we already know a bit the background that we needed. It's time to find vulnerabilities and build exploits. If we want to properly hack reinforcement learning, we need to know how this workflow is implemented. It turns out that all the modeling amounts to a bunch of TensorFlow code. And the environment follows the OpenAI Gym interface, which has become a de facto standard in the field. So, we want to preserve this public API in order for our TensorFlow to not notice how we're hacking it. So, we're going to perform a good old man in the middle in order to hack it. We're going to position ourselves as a proxy between the TensorFlow code and the environment. First, we're going to discard the action saved by the TensorFlow code and calculate our own. How? Well, first, we need to read the internal state of the environment and then calculate our malicious action. This action will be injected first inside the environment so we get better games and then we also will inject it inside the info dictionary so the TensorFlow code notice what we are doing. In order to calculate those actions, we will use two different zero days, which we call fractal Monte Carlo and the Swarm Wave. They will allow us to inject actions that lead to superhuman performance games, while maintaining the original flow of reinforcement learning intact. We will be using Swarms because it's actually the fastest way we found on how to calculate the expectations on how good will be the paths that follow given action. Swarm will allow us to take advantage of both the information contained in the status page itself, this is the observations, and also in the reward function. Let me show you an example of how a Swarm works. You can see a Swarm is nothing but a cloud of points that moves around randomly, while also avoiding crashing against the walls. This cloud of points in itself is nothing but a mathematical curiosity, but if we care to record the path that each of these points, also called walkers, follow, we will end up with a cone which is truly bigger than anything else you can calculate with other algorithms. The only problem with the Swarm is that they are like the closest thing to alien mathematics that we have seen so far, because they are built on a different set of assumptions than traditional statistics and probability theory. We think that it should be possible to prove that given enough time and computational resources, they end up working, but when you try to measure how well will they perform with respect to other algorithms, you'll run into a lot of trouble. That's why we hypothesize that this family of algorithms lies beyond what can be proven with currently no mathematical tools. So the only option we have is to try them out in real problems and see if they perform well enough to be used to solve them. And also forget about rigorous proof because for now we don't have any alternative. So let's present our zero days. The first one is called the Swarm Wave. This algorithm basically consists in moving a Swarm and recording the path it follows. So we end up with a cone. Unlike other planning algorithms that built a tree of paths after taking a step. In this case, when using a Swarm Wave you only need to sample one cone in order to know the trajectory that the agent will follow. This is how a Swarm Wave looked like. It built the path by keeping track of trajectories of the walkers and keeps expanding this tree until you reach the desired time horizon. This is not exactly the same algorithm that you can find in the repository because this is an older version of the project but the idea is pretty much the same. For example, here in this case we managed to calculate a path in a continuous environment which was 1,500 steps long. Our other zero day is what we call the Fractal Monte Carlo algorithm. If we wanted to calculate path the other ways algorithms do we would just need to use this one instead of using only one big tree to calculate the cone we calculate like a mini cone after taking each step. It is true that doing this using this algorithm requires more computational power but it greatly increases the performance of the agent. Another cool thing about Fractal Monte Carlo is that it allows us to approximate for every action a utility distribution of each one of them but let's see how it works in an environment which is more difficult than the one used to show how the swan wave works. In order to test this FMC algorithm Sergio built what we call the hardcore lunar lander which is an environment where you can deploy several spaceships. Each spaceship has two continuous degrees of freedom which represent the two propellers that it has. It has a fuel bar and some health points and in order to make it more interesting we attach it to a rubber band with a hook at its end. This spaceship will be rewarded if it uses the elastic hook to pick up a rock and it will be rewarded again if it's capable of dropping the rock inside the inner green circle. Once it drops the rock it cannot be picked up again until it leaves the outer dashed circle. The game will finish if the rocket crashes or runs out of fuel. We have drowned the cone here the same way we did with the carts but in this case things get a little messier. These gray lines here represent the future trajectories that the spaceship will follow while the color lines represent the trajectory that the hook will follow. If the hook's path changes color it will mean that at some point in the future the hook got either attached or de-attached from the rock. It's actually easy to check that even though the agent is only exploring a tiny fraction of the entire space of future paths it is capable of behaving reasonably well. For example, here you can see how the spaceship can take advantage of the elastic properties of the hook to catch the rock again and again and score several times in a row before losing it. It looks like our hypothesis of finding a small subset of useful paths still holds in these weird environments. So let's see how we can take advantage of it to build our exploits. So now we have everything we know so let's exploit some stuff because it's demo time. OK now it's time to play Pac-Man. So while I speak I will just leave a SwarmWave solving a Pac-Man game. I hope. Yeah, it's working, perfect. OK, in order to run the SwarmWave you have to take into account several parameters. First, the number of walkers which is the size of the Swarm that you will be using the bigger, the better this is like something which allows to calculate a live demo. I'm not sure because I didn't fix the random seeds but we should get between 20,000 and 30,000 points. You can also have to set some hard limits on the maximum score that you want to sample and the maximum steps that you are allowed to take. These numbers here represent how many times we want to repeat the same action after the same one. Most algorithms used to fix a number which they call like a skip frame but in this case we will just sample how many times you are going to repeat that action from this distribution here. Then we only need to create an environment, choose the kind of distribution that we will use as a prior and run our SwarmWave. It should be about to finish or it turns out that it performed fairly well. Okay. It took one minute and 18 seconds and it scored 36,000 points. Let's see how it played. If you want to run the same demo every old material it's in the online repository so you can you only need to check the link which is in the abstract of the talk in order to play this. So yeah, pretty awesome, right? What does mean scoring? Oh wait, there's more video games to come. So what does mean getting 36,000 points in one minute and 18 seconds? I built a few fails which are about benchmarking the algorithm both against human performance and two papers that I have linked here but as this is a bit boring let's just plot some graphs. First, the score. Here we are comparing the score that the Swarm Wave achieved against different algorithms from other papers. For example here you have the human performance which was like 40% of what we achieved used TC which is the one that AlphaZero uses or also we are benchmarking it and this one which is the state of the art and these two algorithms which run in a cluster in Amazon. This score is not that impressive so why are we using this instead of other algorithms? Well, because when you care to plot the efficiency of the algorithms we get this which is how many orders of magnitude faster Swarm Wave is with respect to other things that you can use. For example, you can see that it plays about five times faster than a human but when you compare it with other algorithms you get that it plays about 5,000 times faster than a cluster in Amazon or about 30,000 times faster than the algorithm that AlphaGo uses. So this is basically the only way you have to run a demo of Pacman in life in a talk. This was the Swarm Wave but what about Fractal Monte Carlo? Let's try how it works. In this case I'm making the environment a bit different because I'm choosing not to clone the random seeds that we are using in this case so instead of being a deterministic environment what the agency is is an stochastic one so it's not capable to predict properly where the ghost will be in the future. In this case we are taking about 2,500 steps samples per step at most and well it just tried to do the best it can in order to compare it with other algorithms to get an idea of how they work they usually use about 105,000 samples no, yeah 150,000 samples per step instead of 2,500 OK, so this time it didn't manage to finish the first level you can expect if you try it with these parameters to some time guest up to about 10,000 points but given that the task is so difficult we would need much more computational power to run it in real time this was just a small proof of concept of how we can solve the games but we are not actually performing the hack in order to do so we need to impersonate the interface of OpenAI Jim we can do that but in this case I'm showing you something a bit more interesting which is hacking instead of OpenAI Jim OpenAI baseline which is the fact of standard also for high quality implementations of reinforcement learning algorithms where they instead of using only one environment they use several in parallel instead of getting one observation you get a batch of them this time I'm using a pre-calculated database of games that I calculated yesterday while visiting the castle about a keyword this basically allows you to load the games and play high quality samples without your algorithms not noticing it you can see here that when I call a step I just take a string and it's completely in north and we only take the actions that our algorithm choose to take if you think about this is that when you get 100,000 points the counter restarts but I mean you can keep playing as long as you want when using this kind of environment you need to choose how long you want the chunks of the games calculated to be in this case we are either playing until you lose one life or see it just restarted the counter either until you place you lose one life or you get 35,000 points another curious weird thing about cubit that I don't really understand is that sometimes when you fall off the cliff you don't lose a life I'm not sure if there's a book about or what but I mean this is clearly taking advantage of it ok, enough for today this can clip playing forever so we managed to implement the proof of concept of the hack that was introduced a few minutes before but instead of mimicking the environment interface you can also generate data several other ways for example here we have also utilities for generating whole games so you can save them and use them later to do whatever you want to save them or instead of whole games examples that obey more of a supervised learning fashion for example here you can create a generator which outputs the current observation the action that should be taken the reward that you will get how the net observation will look like and the Boolean indicating if the game is over if instead of one single example a batch of them and feed it to Keras for example you can use this function here so this is pretty much the demo that I wanted to show you but we only played one game so what can we expect if we try it in other different kinds of environments generating data with our hack is actually faster and cheaper than having a human to play any Atari game but pitfall and using a wave you can even beat the human record in 32 out of 50 games which are the ones that are always tried in planning papers these two games here boxing and ice hockey you get a reward of plus one if you manage to score a goal in hockey or you manage to hit your opponent in boxing if it's the other way around you will get a minus one reward this means that even though the rewards are relatively sparse these kind of environments can easily be solved it is even possible to sample to get a relatively good performance on montezuma revenge which is a game with extremely sparse rewards in montezuma after starting a game you will only get rewarded once you are able to get the key which requires a sequence of precise movements where you won't be getting any reward at all meanwhile the wave can handle it because the swarm also uses information contained in the observation this means that when no reward is available the algorithm will just focus on exploring new new regions of the state space although it would be better to derive a specific algorithm for those kind of problems it's nice to see that the swarm wave is also pretty robust but the swarm wave really shines when we don't have sparse rewards in that case Atari games are pretty easy to solve with a swarm of only 15 walkers you can solve a steroids and solving Pacman takes about 5000 a funny thing about those environments is that most games have a hard-coded maximum score once you reach it the score will either stop increasing or it will reset the counter to zero for example like in Pacman or Qbert but Atari games are not the only kind of problems that a swarm wave can solve you can also try more challenging environments like Sonic or other Sega games in the OpenAI Retro Library even with a relatively small swarm for example this Sonic GIF you can see here was calculated using a swarm of 200 walkers and in this laptop it took less than 2 minutes to calculate if you want to go to a more hardcore environment you can install the DeepMind control library which allows you to try it on robots in this case we use the same walkers that we use to solve Pacman this is 5000 to get the robot to walk pretty well even though it looks like it's a bit drunk but taking into account that this is because we were sampling a six-dimensional vector completely at random so it could be worse well this is how the swarm wave performed now let's take a look at Fractal Monte Carlo this is like we show the difference between using the method proposed in the original paper of causal and tropic forces versus using Fractal Monte Carlo when we are bound to use the same number of samples this means that using it to solve Fetari games is completely an overkill so instead of using the traditional environments you can find in Python we will use the custom lunar lander to show how it performs it turns out that this hardcore lunar lander is an environment which is very difficult to solve for humans flying the rocket alone is a very difficult task on its own but when you attach the hook and the rubber band it becomes nearly impossible to control this is due to the bouncing of the hood which creates a chaotic motion and it turns out that humans are not really good at predicting that kind of chaotic movements but well looks like the rocket managed to solve it pretty easily so why don't we make it harder and make the rocks much heavier so more than one rocket is needed to carry them around this increases both the degrees of freedom and the difficulty of the problem becoming a much more challenging task all the examples here were calculated using 100,000 steps per action even though that we only have one core available this was like the maximum the hardest problem that we could solve in this case creating videos which were a few minutes long took several hours but we also tried increasing the number of agents in tasks which were easier to sample to see how this algorithm scales with the degrees of freedom of the agents we are right besides moving rocks around for example you can make several agents collect food which is represented by these green bubbles which are distributed across the environment in this case the agents are greedy this means that they just care about collecting food and not crashing against each other but we can take this kind of task one step further and force the agents to maintain information while gathering food for example here the cars also want to drop food inside the bots and the spaceships just want to keep food level high eating when needed but if you also tell them that they like to be close to each other you end up getting this kind of weird elastic motions these are the results of trying to minimize several cost functions at the same time and depending on how you define the personality of the agent which is how the weights of different objectives are balanced you can influence how rigid the formation will be and how often we will break it to gather food this is an example of how you can actually hack open AI baselines, I mean the only thing you need to do is comment the mechatarian function that the library provides and use the one that we make and when sampling the environment just ignore the actions that you sampled and recover them from inside the infodictionary so pretty much with three lines of code you can use any reinforcement learning algorithm so we finally managed to hack the environment and get data for free and actually only using the code presented in the talk approaching any inverse reinforcement learning problem should be feasible this means that we can approach reinforcement learning as a supervised task without any human intervention at all generating a dataset of 1 million games of Pacman is far way cheaper than training an agent to play Dota or even Starcraft 2 so once you have that dataset you can just overfit a deep network and see how it performs I mean I bet that you can get pretty good performance if you do that if you're bored enough you could even train a model or use that trained model as a prior for generating another million games which would have even higher scores because you already learned before and repeat the cycle until you get the performance that you want but that's the topic for another story for now, let's just have some fun trying out the new hacking techniques presented because this is how it finishes the first volume of my tales about hacking reinforcement learning before moving to the Q&A I'll talk a bit about myself I am Guillen Durand Balliste and I'm a PyData and I'm a Mallorca organizer I studied telecommunications engineering a few years ago and I learned to code in Python so I could hack some AI research as a hobby but I must confess that my dream is becoming a proper scientist and make a living out of my passion so if you happen to know some real AI researchers please tell them that I would be super happy to work with them so yeah, this is it I hope you like it and feel free to take a look at the repository of the talk it was a pleasure to be here thank you very much thank you, we have five minutes for questions if you have a question can you quilt on this side of the room please I'm not passing the microphone can you just come here and ask the question on the microphone, thank you so what is the reward you use for the Pac-Man game what is the reward what kind of reward how do you the reward I think is the score that you can see on the screen so all you just go yeah, I mean in that case yes, you get rewarded if you eat one of the cherries like 50 points if you eat one of the things that are in the coordinate that allow you to eat the ghost and if you eat a ghost depending on how much of them you eat you get different rewards if you eat the four of them you get like a super huge reward thanks for the talk are there any similarities between swarm wave and particle filtering particle filtering particle filtering is some kind of swarm particle optimization it's more like a sequential Monte Carlo yeah, I mean the idea is pretty much the same it's like trying to sample a distribution which matches the true distribution of your state space I mean let me show you here basically you distribute randomly a bunch of walkers but the thing is that the way they are distributed doesn't match the actual reward distribution of the function so what we do is move them around after each iteration so you pretty much end up with a distribution of the walkers which matches the distribution of the reward I mean all the code it's commented and presented on the github repository so you can take a look at it and get me in a coffee break so I can display you in depth nice talk, thanks the benchmarks you did against deep learning TensorFlow, did you run with GPU or what? no, I mean actually I didn't do a proper benchmark because it is required to use 1,500 steps per sample and it's like super expensive so I have not calculated the proper benchmark of it but I'm on it I mean I'm getting access to you have a budget yeah, nice but I mean you want to see how it works I have benchmarked it against two papers but in the repository it's also an excellent spreadsheet where you can find the performance of all the algorithms that exist compared to what we are getting both learning algorithms and planning algorithms nice, thanks I mean I think the only fair way to compare this is to benchmark it against the human records because the difference is like super huge in the scores that you get OpenAI Jim, yeah but I mean it's completely outdated no one uses it anymore I think they actually removed that I don't know that, I mean I used to use like the OpenAI Jim benchmarks but nobody was using them so I used a few scores over there a few years ago but I mean ah yes, yeah we get you can find in this excellent spreadsheet all the references from all the papers that we used to create the spreadsheet and we also yeah, ah the twin games yeah, that's where we got the table for human records okay I want to thank you Gillen for the talk, it was really nice to see the videos and thank you again thank you very much