 Hello. Hi. Hello, everyone. Good afternoon. Good morning and good evening, everyone. My name is Ray. I'm from UberAI. It's my great player to present our most recent work, an open-ended reinforcement learning algorithm called PoEIT. And we have a follow-up called Enhance PoEIT, where we aim to create unbunded invention of learning challenges and a solution in a single run of AI algorithm. So before I dive into the content today, I want to briefly mention why I think machine learning at Uber are super interesting. On Uber, you probably know where it actually operates in the actual physical world. Our manual problems are in the spatial and temporal domain, where we aim to build a very complex network with actual human and with actual economic impact. So we really aim to move in real people, real things in the real world that makes the machine learning problem at Uber uniquely challenged and very different than some of the online applications. Uber has been dedicated a lot of efforts into open source. If you go to opensource.uber.com, you will notice that we have open sourced several, a lot of very top-level projects. I list a few here, which are mostly in AI and machine learning domain. For example, we have Pyro, which is in the space of large-scale probabilistic programming. We have Horovod, which is now an industry standard of people doing distributed training over different frameworks. And also, recently, we have Ludwig, come from Uber AI, which is a code-free deep learning toolbox. And many more. So feel free to check out opensource.uber.com. We are also extensively published our research at top machine learning conferences, and you will find out those publications at research.uber.com. Okay, dive into our topic today. I think generally people think machine learning or AI algorithms as tools for solving given challenges. For example, in the image classification world, where people will show an algorithm a picture and the algorithm will tell whether it's a cat or a dog, in that kind of scenario, machine learning algorithm has been doing pretty good in this famous ImageNet context from Stanford. We noticed that ever since 2016 and beyond, we have computer algorithms that are works now better than human being. For the task of playing video games, now we see that AI and machine learning are doing really good job ever back since 2015 when the DeepQ network was first invented all the way up to 2018 and up to today. We have seen that machine learning has been successfully playing those very interesting games called Atari games and are now achieving super human performance. Of course, I want to mention the game called Go, and you probably heard of AlphaGo from Digmind. In fact, since 2017, machine learning algorithms with a lot of training and a lot of computational resources is now reliably beat the world of best human professional Go players, which in the Go community consider a very top kind of achievement for artificial intelligence. Feel free to check out their movie in the link I provided here about these achievements. This sounds great, I think. It looks like machine learning and AI can do a lot of stuff if a human tells them what to do and showing them a lot of examples. They'll be able to do whatever a human told them to do. Sounds like that. But on the other hand, I want to ask this very interesting question and I would like to invite everyone to think about it. Can machine learning and AI be really creative? By creative, I mean, can a machine learning algorithm invent diverse and interesting problems by themselves and solving those problems by themselves? To solve them, they will probably share and learn from their diverse experiences while they're creating and solving those problems. And hopefully along the way, they will build its own clicker just like humans teach themselves a subject. The machine can hopefully teach themselves through a clicker they created by themselves and eventually solving something that a naive algorithm cannot solve. I just want to say that machine learning and AI does expose some kind of creativity in generation, new media and designs. One of the most famous examples is again, ever since 2014, people are using this type of technology to generate pictures. For example, here, the big game from one of the top institutions recently has shown a very real picture. The picture all looks very real, but however, they were actually all generated by computers or they are fake images. So it looks like machine learning algorithms have some creativity by creating something that humans have never seen before. Not too much difference from what we have seen before, what we can do before, but it is something new and through this process, machine learning algorithms expose some kind of creativity. But instead of diving into the game algorithms, I just want to point out a very interesting experiment that a group of computer scientists have done back in 2018, probably more than that. It's called PIC Rater. So they have, these are a group of computer scientists, they set up a website where a human actually can go to a website and just generate some random pictures at first. And then on the back end, there is some kind of neural network called CPPN, and it is a self-evolving neural network, which means it will grow by itself if you know about neural network and neural network is all about nodes and connections. So this type of neural network will start from very simple ones, a few nodes with a few connections, and they will start to grow by themselves by adding nodes, adding weights. And they also, and this kind of neural network can also do paintings, right? So basically humans start to generate some random pictures and send to the back end, the CPPN gets a picture, and the CPPN paints the picture and they also evolve, become more complex and generate a little bit more interesting picture hopefully, and send back to the human and the same human or someone else which pick up the task will pick someone, will pick the new pictures which are seems interesting to them, and then send back to the system. So this kind of human loop process to generate things together with the CPPN technologies. And in fact, they generate some very interesting pictures, I would say. That was like more than 12 years ago, but if you look at these pictures, they were like real interesting. It's very hard to imagine. They're actually coming from completely random noisy images and the CPPN together with the human, which tells CPPN which one is interesting, that's the only kind of guideline that's neural network God. And eventually generate something that are really interesting. We got cars, we got apples, we got human, we got skull, we got even Jupiter. That's very interesting. So by examining the results, because we know where the picture come from, it all come from some random picture and come a long way to whatever the final product is. So if you take a retrospect kind of inspection of how this image come from, you actually find out that being creative is a very interesting, a very thought provoking process because you actually see a lot of stepping stones along the way. Like for example, this G type of picture becomes a Jupiter eventually and this egg with a hat eventually becomes a teapot. So there is like no kind of actual relationship between these stepping stones, right? So this teapot, in order to get this teapot, you first need to generate this egg with hat. So stepping stones almost never resemble the final product. That's what people observe in this experiment. It's very interesting. And eventually, people might have something in mind, like the people try to find a tip, try to find an egg, for example, at first. And then they find an egg with hat and maybe someone give up, another one pick up and continue to find something interesting based on what people did before, what other people did before. And eventually, they find teapot. So basically this kind of experiment, this kind of neural network, enables people to find things that you are not looking for. So we consider this kind of process very creative. And these stepping stones are actually very important because there are the ways that we solve hot problems, right? Like if you look at the key innovations happening in human history for science and technology, you notice that sometimes if we want to solve some hot problem, we have to create some simple problem and something that has no relationship from the surface to the final problem you try to solve. And you're kind of doing the go switching here. You solve some easy problems, solve some other problems first, and hopefully you identify some solutions you can transfer back to solving your original hot problem. So for example, we have advocates from thousands of years ago. And if people are just keep working on make abacus better, you probably will get a very long abacus, with like more than just a couple of columns. You could probably get a very fancy abacus like people can operate. But actually, it's never kind of, the history of computation is never just take down that path. People discovered electronics and from that people invented computers and that's a fundamental different technology from abacus. And because people invented some technology involved in electronics completely different application and then people invent computers to solve the computing problems. This shown that the stepping stones to solving hot problems, the stepping stone to very advanced technology or capability coming from something that totally unexpected. And it's not like you keep working on one problem it will give you the final solution. It's not always like that. Sometimes you have to do go switching. Okay, so what is open-ended algorithm here? We talk a lot about what machine learning can do. Machine learning, if you're giving them a problem giving them a challenge, this seems good at finding solution to it. On the other hand, if we combine it with a capability of creating new challenges and we hopefully can create some very powerful algorithms that can really do all these things by themselves. So it can start with some very easy challenge. Solving them. And this solution to the easy challenges could become a stepping stone or could give algorithm a hint to create a little bit more challenging problems and find a solution to them. And then they'll become a new stepping stone for new challenges and new solutions. And if we can create an algorithm that can consistently generate new challenges with increasingly challenging level and diversity, at the same time, they'll find a solution to them. We will have a very powerful algorithm and very powerful machine algorithms. If you think this is very open-ended algorithm and it's very abstract, I just want to name one such algorithm that is already exist. That is natural evolution. You think about it like backing up many billion years ago, life on Earth is as humble as a single cell. But in the past billion years, Earth has created tons of new challenges for life. And life and through evolution find its way out and grow in this particular explosive type of forms forming this trio of lives. And more importantly, it created us and our intelligence. So just think this way. If we can create an open-ended algorithm which can resemble what natural evolution do, it probably will take a lot of computation. But if we assume we can get a lot of computation, we will probably create something very interesting. Will this algorithm expose endless creativity? Will this endless creativity give us a better AI? Those are all very interesting problems that motivate us to do the research along open-ended-endedness. We will formulate our open-endedness algorithm in the context of reinforcement learning. So I do want to spend the next two minutes to talk a little bit about reinforcement learning. Bear with me if you already know the basic concept. So reinforcement learning is about solving a sequential decision problem. So the agent here, the little robot, represents the agent, which takes observation from the environment or the outside world. It gets some kind of state or optimization from the environment. And based on that, it needs to take some actions. And this action goes back to the world and then generates the next new state. At the same time, it will also give a reward information to the agent. And the agent's only hint of such action is right or wrong is this reward signal. And finally, the agent's ultimate goal is to maximize the community reward either infinitely or over a finite horizon. For example, if this robot is navigated in this maze, and the only reward he's got is how close he's closing to the target point. And this agent, this robot sees its neighborhood one centimeter around it. And so every step, it will receive this surrounding information from environmental information from this world. And then it will take action whether it's go up or down or left or right or make a turn or do nothing. And then it will move and the world will look different to him because he's moved down in this maze. At the same time, you do get a signal saying, oh, are you closer to your final target? So how actual reinforcement learning is doing is actually involve a lot of mathematical information. But overall, let's put it in a very simple context if you are trying to play with this game. Now your agent is just this simple neural network and your input to the neural network is those raw pixels. The output of the neural network is how probable you should want to move this pedal up or down or stay the same. So you kind of... a very simplified way of how reinforcement learning works is you have this... you have the random agent play in the world, in this game. You collect a lot of sequence of actions and states and real reward. And then eventually, you either kind of get a little bit reward or you kind of lose a game. So the red means lose a game. The green means you win a game. Then you have record this sequence of actions. And then you kind of tune your neural network weights so that... because this neural network decides which action to take. So you tune this neural network weights so that if you are really getting really good results at the end, you kind of tune the neural network weights so that the actions you took along this sequence or make sure that all these actions have a little bit of higher probability next time you take. And then for the cases where you have very bad kind of outcome, you lose a game, right? Then you kind of tune your neural network weights so that next time all these actions along this chain has a little bit less probability to take. And eventually by trial and error, by trial and error, you hopefully learn the agent which can get green outcome more often and more reliable than the red com as the agents start to learn to tune the weights of a neural network. Excuse me. So yes, by using reinforcement learning, people can solve a robotics problem. For example, it's 2D bipodal worker moving this obstacle course case. But people kind of also using reinforcement learning to solve even more difficult games or more challenging games like OpenF5 or something Starcraft. In the real world, outside the games, people have been using reinforcement learning for robotics. Here are some results from Google Research and OpenAI, kind of like using reinforcement learning for solving the manipulation task or solving the rubrics cubic problem. So now let's get back to OpenEnded with reinforcement learning. So why we want to study OpenEnded reinforcement learning? So some people say, back in the future, people are saying, oh, we have deep learning and we have reinforcement learning. Deep learning is really good at learning representations and reinforcement learning is really good at making sequential decisions. So if we combine with people with deep learning with reinforcement learning, now we have an algorithm which can do a good perception of the world and then make a good decision. So we pretty much have an AI. It's kind of questionable whether this will work or not, whether this kind of RL combined with deep learning is really called AI. And probably no one has the answer about what is real AI. But I would argue that at least we should add OpenEndedness into the equation because if AI is similar to what human can do, even barely similar to what human can do, we have to have this capability for algorithm itself to create new problems for themselves, solving those new problems, building, having those stepping stones, be creative and creating new challenges and teach themselves to solving them. So I think because we can't just always give a task for AI to solve and expect AI to really become expert of something. If you let our machine algorithm to do image classification, you can only become expert to image classification. If it's playing video games, you can only work good at playing video games. It's very hard for an agent to really grow some kind of intelligence that resemble human. So we believe that at least OpenEndedness will kind of bring this important new gradient into the equation. We also have practical reasons why we want to study OpenEnded RL. We think that if we can expose creativity out of these algorithms, it can help us creating problems and solutions with increasingly diversity and complexity to solving very challenging problems could help us identify corner cases for safety applications where we actually can't always kind of can't human to identify those. And as you can see here, we hope that OpenEnded Algorithm can build its own curricular, just as we have shown before in some previous experiment. If we have a curricular and then we have the hope of solving really challenging problems that cannot be solved directly or by manually for a human designer curricular, we want the algorithm to create something all by itself automatically. So at Uber and Iowa, we took a few steps along this direction and last year we published the poet algorithm. This year we published the enhanced poet algorithm and at leading machine learning conferences and we win some best paper award. This year we're getting to ICML. We are very happy about the outcome of the research so far along this way. And we also get a pretty good media coverage. People are really interesting looking to this kind of algorithm and kind of start to think about what this algorithm can do for them. So I would say I also want to take this opportunity to really thank all my co-authors and without them this research could not move forward. So I really, really appreciate them. So now let's start to look at what this algorithm will actually look like. So we said we released algorithm called Poet and we want to know what is actually the poet mean? So here we have general slides about what the algorithm is doing. It's called paired open-ended algorithm. Paired open-ended trailblazer algorithm. Where the whole algorithm is about maintain and grow a population of environment-agent pairs. So here we're talking about solving challenges. So the challenge here is an environment. And an agent is trying to navigate or get a good score in this environment. So challenge in this context is environment and solution in this context means agent. So we do have challenges paired with solutions and this whole algorithm is about maintain a population of these pairs and grow these pairs. We're starting with one pair and then we create some new challenges. We pair them with new agent. And then we hope that in the later on we will create more and more challenges and also solving them by pairing them with solutions. So when an agent is paired with an environment it's only go is to solving that environment when it paired with the environment. And here we are adopting reinforcement learning algorithms like as I mentioned earlier like any on-out-of-shelf reinforcement algorithm as our inner loop for agent to solving environment. But this whole algorithm is actually about generating new challenges. So now let's put it in a context where people can understand. So now it's getting more concrete. Our challenges or environment is these 2D bipedal walking obstacles. And we have an agent whose brand is a neural network and they try to navigate through this obstacle from left to right and there's a real physical simulator runs behind. So at least in 2D everything is actually caught into the physics. And so talking about environment we actually here we kind of encode our algorithm using this five-dimensional vector whereas giving the range of stump, gap and roughness. We have mixed some rough surface with some gap with some stump. And to realize this encoding so first you can initialize all this encoding to all there. Then it's a pure flat surface where there's no stump, no gap and no roughness. And then when I say generating new environments I mean permute or kind of mutate this encoding vector. By mutating this encoding vector I mean it's adding a little bit noise random noise to each entry of this encoding vector. And so now you might get a little bit stump. You might get a lot of high stumps based on what I value add to them. Then I got some kind of rough surface if you add some of the roughness and you might get some kind of a gap if you add a range of gap here. And overall like you can also have a mix together because this encoding vector once you take this range and this roughness value it will have a generator a random seed and then it will kind of decide which section randomly are stump and how the stump is hide. What's the height of the stumps in certain range based on the value you're giving. Okay, so we're talking about grow the population of agent pairs agent environment pairs. So this slide demonstrates how we're actually doing it. So starting from existing environment agent pairs we took a few of them which we are making good progress solving them. And then we kind of mutate their environment as you're seeing the last slides. So now we've got a bunch of new environment and we paired it with their original agent corresponding agent. And then we are going to filter them by their difficulty level which means given the current population of agent we want to examine all these newly generated environment and get rid of those which are too easy which means it's definitely below the threshold of what the agents can do now and we get rid of all those too hard where it's impossible to solving as of now. Those are the most impossible ones might be possible as of now but later on we might rediscover them and then maybe at that moment the agent capability is able to solve them so we will bring them back but at a particular moment we want to create a new task we just get rid of the ones that are too hard or too easy. And also we are rank those newly environment by their novelty. This is a very interesting concept about how novel an environment is we define novel as in one sentence a novel environment means it looks very different from the existing one for example a very bumpy one a very bumpy environment will be very different from flat surface so we are actually looking at also an environment with a lot of gaps are very different from an environment with just stocks so we are actually directed to measure the novelty of the environment we are actually looking for where we are mapping all the environment into a plan and then we want to calculate the distance and find the one that are having the average farthest distance from the existing environment and we are actually here in this research actually using their encoding vector as I just shown the five dimensional vector to calculate their distance now we filter by difficulty and we rank everything by novelty and then we pick the most novel one and then pair it with the agent the edge map population so that is the only thing we have here when we randomly promote the current existing environment we filter them by difficultness and we rank them by novelty we are looking for different interesting new challenges they are not necessarily very hard because we already get rid of them if the existing population already have a lot of challenging environment we will basically just do something we will just look for something really new okay so we will the whole system start with very easy tasks like say we have a flat surface and agent learns to work on it and then we start this process now we will be able to create some kind of simple challenge we will have a little bit stump or it will be mixed environment and agent seems also capable of solving them and then if you keep running your algorithm you will find out that the algorithm will be just doing those easy ones the very challenging ones high stump, wide gap won't show up because you are still missing one component this I call it goal switching as I prepare you before the goal switching seems very important to identify innovations for human and for natural evolution so here we also introduce that kind of mechanism into our algorithm we are using goal switching here and we are looking explicitly looking for stepping stones we kind of look for in the algorithm where we practically search for possible transfers of solutions from one environment to another environment so basically if we have if we will double check if any of these agent pair with different environment actually does better in this target environment and if it's better the one a clone of this Athero will replace the existing paired agent we are hoping to identify that some kind of we are doing this because as I mentioned before stepping stones are coming from places you are unexpected like this letter G become Jupiter and also we hope that we hope that agent can learn jump at one place jump over gap so hopefully that capability could help agent to jump over a stump with all these algorithms together the solving slides will give you overall of everything so you have an easy environment with an easy agent so now we create something medium hard and you get a new transfer agent and get a new pair of agents you create something too hard you back off, you create something medium hard solving them and all these process can go in parallel you could go back and pick the easy one generate another medium medium obstacle and transfer agent to it you get another medium hard environment and eventually you grow something just really hard and initially you don't have a very good solution for it however this transfer solutions will enable you to create a better medium hard solution and which eventually becomes really good for solving very hard challenging problems so here if you look at this very challenging, very hard problems and solutions and if you trace where this A6 prime comes from you actually notice that this agent actually get trained at a different implicit curricular or implicit path that leads to agent 2 here those are actually the curricular that this agent went through and in this arguments we'll be able to see agent that now we could navigate through a very challenging environment created by the algorithm and some things getting interesting now we see some mixed environment and we see agent not developed a very interesting, very agile type of behavior navigates through all these challenging environment finally, if you look at almost the machine possible environment, I call it downstairs heroes if your agent that are in this type of environment it seems very hard to solve but actually poet actually find agent that capable to solve it so see how smoothness and nicely this poet find agent can navigate through seemingly impossible challenges like a kung fu master you probably want to ask what if I took some of the very kind of challenging environment and solving them by scratch instead of putting into an open-ended process can we solve those challenges from scratch now this shows that if you took some challenges out of the one that poet can solve and you try to solve it outside the poet process I found that because of the missing curricular your agent often stuck at local minimum and are able to solve it also this shows that if you took a hard challenge and you took the easy challenge and you manually created bunch of like intermediate challenges that are interpolate the challenge level between easy and hard have this linear curricular created by hand and now you are solving from the easiest one one by one towards the hard one and can you always solving the most challenging environment this tells you that maybe for some medium hard one you will be able to solve it using this linear manually designed curricular but for the very hard one you just can't solve it and we have a proof we have like empirical results in our paper showing that so the reason that such a curricular such a linear curricular but poet can solve it is because poet have GoSwitch and we actually have this GoSwitch created a very very powerful implicit curricular for the agent to solving the entire problem so solving this entire network problem to solving the very challenging problem you have to have this so I want to go a little bit deep on how this GoSwitch can sometimes be very counter-intuitive so let's say we have a flat surface and we have agent training on this flat surface you find out that the agent never have any motivation to stand up because it's kind of stuck at local minimum it actually moves at 300 is a pretty good score but it's never get a chance to stand up because it's never need to stand up in order to move in this environment so now your environment will take a little bit you create a little bit stump in this new environment and you can you put your agent transfer to it and of course the agent that are with his knee on the ground crouching on the ground and you have trouble navigating through it so the score will be low because it moves much slower and later on the agent in the stump environment learns to stand up because it needs to actually stand up a little bit in order to go over the stump and you somehow put this agent and somehow at a moment it actually beats the one that are on the ground so it transfers back for the fat environment and then it continue to optimize it and get a very efficient way of standing up way of navigating this fat environment but if you don't go to serious process you just let this agent just keep running for like a very long time you'll still see the agent stuck in local minimum it's never actually learned to stand up I just want to point out that it's a very conditutive quicker people always think quicker as solving something easy enable to solving something a little bit hard enable to solving something very hard but actually in this here in the process of poet we identify that there is a lot of cases where we have to improve in a hot-dry environment here in the stump environment and then ultimately by reusing those stepping stones we ultimately find the best solutions for the easier ones, it's very conditutive that also tells us that the go-switching is really implicit and people really cannot pre-compute those algorithms pre-compute those those curricular paths the ultimate way is to let algorithm figure out the curricular by itself okay so we talk a lot about generating new environment and how to solving them so you probably not satisfied by the environment where it only has a stump or a gap or a little bit of roughness so this all seems really too simple to you and you probably imagine that you will quickly run out of the variations so in our follow-up work to the poet in the enhanced poet work we actually look to solving this looking into how we can expose more open-ended discoveries by these algorithms we find out that we need to really use some environmental recording that are beyond the simple ones because this five-dimensional vector can only describe can only give you such variations and you can quickly run out of it but what is the other way what better way of creating interesting, unexpected new environment cppn I mentioned cppn before cppn can help you paint 2d pictures which are completely sub to your surprise so here we're using similar technology here cppn now help you to paint this landscape and because cppn is in your network and your network can give you arbitrary complicated 2d profile you'll be able to create a very interesting, very challenging environment that way beyond what you can see from this simple five-dimensional encoding this actually created a problem previously we said we want to find a novel environment and we're using the L2 distance of this encoding to tell you how far away an environment to its neighbors to find the ones that are really, really different so here now we have a newer network which can generate an environment how can we measure the environment how can we measure distance between environment so very briefly we have also developed something called environmental characterization where it's no longer dependent on the longer domain dependent or encoding dependent it's provided a way to kind of measure the distance between environment completely independent of domains so the general concept here is we actually we get a new environment we test an existing group agent in there right and we rank those agent based on their performance in that new environment and the hypothesis or the intuition here is if a new environment invoke a completely different ranking of the existing agent then it's qualitatively probably qualitatively new and very different challenge for example for example here we have a flat surface and a spumpy surface right so agent with this lag very high will be ranked very low in this flat surface because you never need to leave this lag very high you will get energy penalty from the score but in a bumpy environment this agent actually ranked very high because it's now really need to raise their lags to go over those dumps and that's why we think that this rank difference in this situation is creating because these challenges are qualitatively different or so I would think that that's a very good measure of distance so we think the technology we invented here captures the distinguished nature of the environment and we kind of just based on performance of a ranking of agents instead of any other domain specific information so that's the way of measuring distance between environment to become completely independent of the domain of the encoding so just want to quickly share some results here so now we were able to generate all kind of all these kind of diversity where agents are actually going through all these different landscape including some some are really big and the agent be able to really land into and all kind of interesting things also we will be able to generate a similar tree of life now I flip the tree so starting from here if we are actually connect like which environment are created by its parent and showing this using the line to collect the agent to its parent and parent to its children and each of these little picture actually is a profile of landscape as you can see here this new kind of family tree of environment really kind of to some extent resemble the tree of life so we actually see that new interesting branches deep deep branches invented which are also far away from each other and different from each other being invented here okay so to wrap up we think that poet algorithms the enhanced poet and enhanced poet really unlocking the endless creativity of machine learning and AI algorithms they are a step to us to truly open it algorithm because they are in a single round they created problem-solving solutions that are increasing diversity and complexity and we have shown that it can discover capabilities that could not be learned in other way like in a direct opening in the direct optimization or manual or manual quicker and more importantly in the enhanced poetry we invented a domain dependent environmental characterization which make it easy to try poet and other domains so future work we definitely can apply poet hopefully in this much more kind of complicated environment in 2D or 3D we can also kind of imagine if there are any real world application of this whenever there's a simulator or something yes so feel free to explore the usage of poet and to facilitate that we released our code in a public domain and we are welcome everyone to try it yeah that's all about my presentation and thank you so much and we can I don't see any questions here but feel free to ping me at the Slack channel yeah that's all thank you yes yes yes yes please please ask a question in the Slack channel and I will be happy to answer all of those and using our code to try to try poet on your own domain thank you