 So today we will talk about the paper introducing the algorithm Mu0. The title is Mastering Atari, Goat, Chess, and Shogi by Planning with the Lab Model. Yeah, so it's a paper by the series by Google DeepMind. So I'm guessing that most of you know that DeepMind made a lot of splash news because of its successes at these different games. So it started with Atari initially, the Atari games in 2013-2012 maybe, something like this. And so the Atari games are like arcade games and the AI was learning by using the raw inputs from the screen, which was very impressive. And then it learned and eventually reached a superhuman level at some games and not at others. We'll discuss this later on. And then there were these breakthroughs in Go especially, AlphaGo in particular, with this famous game with Lucidol in 2016. And then afterwards there were AlphaGo Zero, which was as opposed to AlphaGo who started with a lot of data from human masters at the game of Go. AlphaGo Zero started from scratch. And then there was AlphaZero who not only started from scratch, but also had a general enough framework so that it could play both Go chess and Shogi, three different games or strategy games, very similar. But now the Museo is like the latest breakthrough of DeepMind, which is an algorithm that's able to play the four games all at once. Yeah, exactly. I thought it was actually not only four games. Yeah, actually it's more than four. And what's a bit surprising is also that somehow Shogi, Go and chess, they are very similar, but they are very different from Atari games. The fact that it's the same algorithms that can play well for Go, Shogi, etc. and for Atari games is quite surprising. Yeah, I read this a few months ago, but I think I was quite surprised by the fact that you could manage this because the models seem quite different. So that's the basic difference between so-called model-based reinforcement learning and model-free reinforcement learning. So in the case of Go, chess and Shogi, the classical models had a lot of structures because the algorithm knew ahead of time the rules of the game. And this is important because it allows to construct the so-called tree of the such tree and then you can explore it using Monte Carlo methods, which is much harder to do if you're in a model-free environment where you don't have a prior model of the game. Yes, so in that case, actually model-free it would be also some models that sometimes don't even try to learn a model of the environment. But in the case of Mu Zero, it's still a model-based approach but trying to learn what is the model of the world. So the way it works is that the neural network that is used to predict the future reward and predict the policies what move to do. It also has an internal state that has the role of representing the environment and also in a way that this internal state will predict for possible actions taken how this internal state will change. And somehow it's not an exact representation of the future of the environment given certain actions but it's what information is most useful to correctly predict the future reward. And as the Mu Zero algorithm is trained, this representation is more and more useful and allows to, so after this they still do Monte Carlo research but applying to this hidden latent state in the neural network. Yeah, so maybe we can go back a little bit and explain the basics of reinforcement learning. Maybe there are some viewers who are discovering these words. Reinforcement learning is very interesting because it's a very general framework so the very general framework of reinforcement learning is that you have this algorithm and the algorithm is interacting with some environment by receiving inputs from the environment which typically when you're playing games would be either the screen of the game or how the plateau of the game is like. And also it's going to be receiving every now and sometimes a reward that can be plus one or zero or minus one or different things and typically in the game of Go or Chess or Shogi this will be a plus one if the algorithm wins the game and minus one if it loses the game but what makes these games hard is that most of the time you don't receive any rewards and the same also for Atari games but more generally you have this framework where you have this interaction the algorithm is making some actions so it's saying things essentially so typically in the game of Go it would be saying I'll put a pebble on this position and it then receives observations from what the plateau is like and potentially rewards at any given time and this is very interesting because it's a very general framework in a sense that you can think that what we humans are doing in a sense is quite similar to this like we have all of these sensory inputs from our eyes, from our nose and so on at every instant we also have some reward like sometimes we feel happy sometimes we feel sad, sometimes we feel angry and so on and then based on this we have some decision-making mechanism mostly in our brain that pushes us to do, to undertake some action and what's interesting is that the way you can think of ourselves doing even though we're not always doing it perfectly is that the way we're going to choose the action is usually so that we can maximise some kind of rewards we're going to get in our brains and that's the same for the algorithms they try to find patterns in the way the rewards are given and typically they observe that whenever they do this they get a reward then they will learn that this kind of action is the kind of action that gives them rewards so they will repeat it more likely Yeah and they will also, in the case of re-enforcement learning also we can expect algorithms will try to reach states in which there is more reward to gain so I don't know, for chess for example there are states of the game that even though they are far from a victory they are much closer to a victory and this is what a re-enforcement learning algorithm will learn and it's the same thing as for us humans even though most of the time we would take decisions that make us happy at the moment like eating right now is important for me because I would feel good, it's time for lunch but a lot of the decisions I take are also pushed forward to the future and I'm doing this right now because I know it will lead rewards for our future and this is the problem of re-enforcement learning Yeah and so in the approach of new zero but very commonly in most of re-enforcement learning you have different key objects like one of them is the value function sometimes called the value function is like if I'm at these given states then what's the expected rewards I will get later on if I keep playing as I'm playing something like this so that's the value function if you know this, this is critical in chess or in board games because you don't get the reward immediately so whenever you play you just try to imagine how the future states of the game will be but it's not going to be in the final state because you cannot compute all the way to the end Yeah so one action would have a high value either if it gives you high reward right now or if it brings you to state of the world where it will lead higher reward later Yeah and then there's another important object which is the policy function so the policy function is telling the agent what it's going to be doing at any given stage of the game and you might think at first that it's a bit stupid because it knows what it's going to be doing you could imagine this but a lot of the problems have to do with approximations because you can only know an approximation of what you're going to do next partly because what you're going to be doing next at the next stage of the game will be depending on new computations you'll be doing at this stage of the game maybe you spend more computational power at this point and explore more branches of the tree typically and so what you plan right now on what you're going to plan later on is only an approximation of what you're actually going to do later on which is I think interesting to think about and also a key aspect of Mu0 in particular because what Mu0 is doing is that it's always trying to predict what it's going to do later on and there's actually a loss function so I think at any given stage like if it predicts badly then it's going to change its parameters so that it gets to better predictions later on and then a third important object of at least Mu0 and most of reinforcement learning is a state usually a vector representation, something like this of the state and this is trying to capture everything that's relevant at least in the case of Mu0 everything that's relevant to do all the other computations so you can think of the state of Mu0 as a compression of all the information that Mu0 has received in the past that's trying to describe how the game is like yeah and I think it's more than just a state of the world a representation of the state of the world but also a model that describes how the world evolves when you take actions because it uses this to predict if I follow this policy what are my expected rewards in the future so it not only needs to know the state of the world right now but what will be the state of the world given sequences of action that it will take yeah so somehow in this state is also sort of encoded also like somehow also it's future actions and how they will impact and the key aspect of the learning part of all of it is how do you compute these different objects so typically like for instance the state is going to be computed based on observations that you make and rewards that you receive and so there's a function that's going to transform observations into a state of Mu0 and this function is going to be a neural network typically and it's going to be parametrized by the weights of the neural network and you want to learn the right parameters so that you get to a good state representation and then there's another function that takes the state and then computes the action to be taken for instance similarly this function has parameters and you want to optimize them and you want to optimize actually all the way everything all the way and so you have all these parameters that you're going to learn based on well typically how well your prediction is close to what you end up playing things like this what your estimations of the future rewards you're going to receive are close to the actual rewards you actually received and so on and so that's a very it may seem a bit complicated there's a lot of objects but it then is just basic tools that are just combined together and I guess what's surprising is that it works that well for a long time there seemed to be a gap and it's often said like clearly the game of Go, the game of chess or the game of Shogi or like very specific settings for reinforcement learning it's like not the general reinforcement learning that you want to be doing for instance for self-driving cars or for recommended systems but like this paper shows that like for environment more complex already like the Atari games or you don't have a clear model of what the game is like then you actually make a lot of progress using this kind of learning and I guess a question is like how far can you go along this way? yeah so definitely the the way this model is able to generalize to more types of game it make us think what else can it generalize to and does it generalize to perfectly recommending the videos on the recommender system for getting you to engage at most possible with the platform yeah speak here yeah maybe before getting into this we can discuss the limits of the game of the museum of the reinforcement learning algorithm because it performs superhuman at many many games but weirdly enough it's a little bit weird but it actually fails to do human level performance at some games I don't remember all the games with one of them is Montezuma yeah I think Montezuma revenge if I understand the game well it's you need to navigate through a kind of maze which also has a lot of devices to treat the player and that are dangerous to avoid in the environment and you need to go in such part of the maze pick up a key go in another part of the maze open the door with the key you have collected and find your way through this complex thing and I think I understood yesterday from our discussion that the reason why this new zero does not correctly learn to solve this game is that there is too much distance between picking up a key at some corner of the maze and going to the other corner to open the door and so it's actually because before solving the game one time and observing the first reward it actually has too many possibilities of moves to make so before it can actually have any training signal from the game environment it can't solve it yeah it's very like I'm guessing that there's a lot of iteration after maybe hundreds of thousands even maybe millions of iterations it still has no positive reward so you can imagine an algorithm that never receives any reward like it just does not know what is good and what is bad well I guess it knows that you shouldn't die but it does not know how to progress in the game and that may be one of the reasons why it performs so well yeah and the reason why humans are so good at this game is because when we see the game for the first time first of all we realize it's a kind of maze and we will need to keep in memory what is the structure of the environment in which we are moving and will we need to come back at some point when we see a door we know and we keep in memory that oh I remember there is a door here I will need to come back there at some point and when we see something that has the shape of a key we also very rapidly learn that oh this is the key to go open the door and we can do it right away so somehow it comes from our common sense that we know how to solve this and it's not something you can learn just by interacting with the game it's something when humans play this game they know it because they have been living in the world for a long time and the game is somehow imitation of the world we live in so yeah we discussed that to solve this and algorithms would maybe need some module that allows to have a longer memory like in LSTM to remember key points of the environments with which we need to interact in the future but also common sense things like if I see something shaped like a door I need to look for something shaped like a key and for example you mentioned that maybe something like what GPT2 is learning could provide this sort of logic to be used yeah so on second thought I'm not sure the memory part is necessarily that big of a problem because the way the user is set up is this state where it can encoder a very long term memory so I feel like the problem is more with not having a lot of rewards if any that can be very much of a problem whereas I think if a human plays the game whenever it gets a key it will update it will receive an inner reward like something like this and it will at least think that this has an important priority to be much more beneficial much more positive than just running and doing nothing on the screen I would guess that if you have this sort of local input and say yeah that's good like you just have this teacher that says there's something good here and give you a bit of reward then maybe Museo would get quickly good at Montezuma's revenge and what's interesting is if you think about why we humans when we're playing these games whenever we get a key we get a bit excited we feel like we've made some progress all games are like this and I guess whenever we our parents or our brothers or whatever are surrounding encouraged us when we did this little thing and also there's a whole code with the way games are set up there's a lot of things that have become learned by video game players for instance if you hit something and then the character is blinking on the screen probably it's not good but how do you know this? in most games that's what happens and maybe if typically the algorithm that Museo had played other games where you had more immediate feedbacks maybe if it started with Mario or something like this where the game is simpler then it can learn the function can learn that well if you get this then the expected of future reward is going to be slightly larger because you got the key or like in Mario maybe you get a star or whatever then maybe you can use this in Montezuma's to say yeah I should search for keys and you feel excited so if collecting a key is encoded in the state of the game in the environment model then isn't it already part of the model that it will realize that having this encoded leads to more higher expected rewards yeah but you will have learned this by playing another video game I'm guessing that's what happens for humans every game we play video game we play makes us better for a lot of different video games but we also discussed the fact that when we see a door and a key we humans tend to think that maybe the key is useful to open the door and this is a common knowledge common sense that we have because we open doors every day at least I use my keys every day and so it's like yeah we have a lot of data about this whereas Museo had no data about this so it was much harder for it to know about this and yeah maybe if you want reinforcement learning algorithm that's good and that quickly learns to play well at a lot of games maybe it will need also to have such common knowledge and it's even more important like YouTube if it wants to recommend what mostly beneficial videos it needs to understand what makes a video or mostly beneficial what are the signals that it's not getting directly but it could infer from the equivalent of getting a key like maybe the user I don't know what a good user does when he watching maybe he comments and maybe he goes on Wikipedia I don't know if there's a way for YouTube to know that the user has gone just afterwards on Wikipedia if you use Google Chrome I guess so so yeah maybe this is a sign that there is something positive about the video but maybe this has to be learned in some way and we talked about transformers because transformers are interesting because they so transformers especially GPT2 or these transformers have been invented I think by Google in 2017 something like this and they are a model to do natural language processing and in particular like if I explain this roughly what they're trying to do when they read a sentence when a transformer reads a sentence it tries to find the words that combine world together to give meaning to the words if you think about it like there are many homonyms in English or in any language and the way we understand the different words is by understanding the context in which they are used and that's sort of what transformers are trying to identify and also what's nice is that you can learn to identify to have better grouping of the different words in a sentence anyway so this is a very successful model for natural language processing probably should do something about transformers at some point in this series but what's nice is that GPT2 in particular which was designed by OpenAI was able to then give speeches like you can start a sentence and there's actually a website called talktotransformer.com where you can write a first sentence and ask GPT2 to complete and what's very fascinating is that it's doing quite a good job not human level but it's quite a nice job at completing things and the way he completes the text reveals the fact that it has some knowledge representation inside it so we actually did the test yesterday we asked GPT2 what our key is good for it was funny because the answer was decrypt keys are very good to decrypt and cryptography which is a bias I guess the data GPT2 was mostly based on Wikipedia and I guess on Wikipedia keys are mostly related to cryptography and very basic things about keys like keys open doors are not that common on Wikipedia it's interesting also because it shows that you also need some prior knowledge probably to really understand Wikipedia at least there's some information that we have that's missing or that's hard to find on Wikipedia because it's so obvious that it's not written on Wikipedia yeah this presents to me research questions which I have thought about lately but it's really not my field so it would be very hard for me to make any contribution and of course I think it's very experimental I'm more of a theorist but it's like trying to understand what is the information encoded in transformers can we use it, is it safe to use it like probably there's going to be a lot of biases like a huge amount of biases in transformers like how do you extract the information that is reliable out of a transformer and can you use it for be plugged in some reinforcement learning algorithm for video recommendation for instance as of today I would say just don't do it don't plug it I didn't hear any gender biases related to GPT2 but like for where to vex we right away heard about doctor minus man plus woman equals nurse and this was quite problematic in terms of gender bias but I didn't hear that for GPT2 I didn't hear it for GPT2 but I didn't mean gender biases in particular just mean biases like if you ask GPT2 what our key is good for it's a bias, it's not the answer you would expect if you were talking to a human at least and maybe there are other problems that are more problematic than this and there are things like for instance if I ask you to sum up in one sentence what do we know in one sentence you have to answer the question like in one short sentence you have to answer the question is nuclear energy good or is nuclear energy safe or should nuclear energy be used like whatever sentence you come up with if it's very short it's going to be a very biased except it depends but it's going to be because it's only one sentence you have to give a very biased or at least very simplified version of all the complexity behind nuclear energy and the way you go it can be easily biased like if I ask you are vaccines 100% safe well I'm not sure I wouldn't want to answer this question and you have one bit of information yeah so that's why like information that's in the form of language I think it's very very hard to make it robustly beneficial robustly reliable I think it's an interesting challenge and eventually if you want algorithms you have a lot of good of common sense priors I'd say and I think at some point it's going to be important if you want to make YouTube robustly beneficial you need to understand a lot of the context that humans live in I think based on natural language processing which is by the way the way we have a world model at least academics like us is mostly by reading stuff and maybe seeing a few graphs but reading is by far I think the most informative medium I think maybe for complex things but for the simple common sense I guess we learn it it's baby experimenting with the world like I'm sure you didn't learn that a key open a door just by reading it or even not really by listening to someone you learn it because you've used it I guess I don't it depends which door like this door back here but cryptography doors this I read I guess it depends on a lot of information but I guess for an algorithm it's not clear what is the best way like I guess another way to learn a lot about the common sense of like humans common sense for algorithms would be to watch videos or have images but you have the same problem then like how do you infer reliable knowledge about the world from such data I feel like it's a bit of an overwhelming question I haven't thought about it I guess but it sounds very very hard yeah so another thing we discussed yesterday is the distinction between reinforcement learning and supervised learning there's a so supervised learning is that this what you have typically you're trying to predict if there's a cat in the image and you have a lot of images of images some of them have cats and you're told that these are images of cats and there's a lot of images where there's no cats and then you use these labels to eventually learn what are images of cats and by far most of machine learning these days are by far it's like supervised learning especially what's deployed but this has led to a lot of debates because especially in the effective altruism movement like there's a lot of discussion about like very powerful algorithms and it's typically framed in terms of reinforcement learning and there tends to be a lot of division between these two frameworks reinforcement learning and supervised learning and I guess one question is like do they pose the same problems in terms of safety and ethics I guess that would be an important question for us like is it like when we're working for supervised learning are we also working for the safety of supervised learning are we also really making an important contribution to the future algorithms that are going to be using reinforcement learning like I feel that for many problems we got ready to supervised learning the answer is yes there's a lot to be learned from supervised learning and I think this paper was also interesting that it really shows this I think because in the end the components used to do reinforcement learning learning policies, learning the value function and learning the predictions of future rewards all of these functions are learned using supervised learning in the end they are learned using supervised learning but the data generated to train them has been generated using Monte Carlo research algorithms yeah so it's not the entire part of it but these are key components of the algorithm and like if you can make supervised learning more robust, safer and so on you're also making new zero more robust and safer well new zero is not very dangerous because it's only playing games like games not on the internet so it's really safe but if you transpose it to YouTube I think the YouTube algorithm is doing a huge amount of supervised learning yeah but yeah about this we discussed some time ago that actually we were quite worried that the YouTube algorithms would be doing reinforcement learning it's worrying because if it's doing reinforcement learning it means that the difference between reinforcement reinforcement and very simple supervised learning is that the reinforcement learning algorithms will try to reach states in which it gained more reward and for YouTube to become a system reaching a state in which it gets more reward it means changing the user transforming the user into a user that is more addicted to YouTube spends more time on YouTube so this is quite worrisome that there would be this very powerful algorithm in all of its energy into changing the users into YouTube addicts but in the end of this discussion we concluded that mostly the algorithms at the moment use mostly supervised learning but then from yesterday's discussion we also see that there is not such a clear cut distinction between supervised and unsupervised for example supervised learning algorithms that would ensure it's reward based on the next two weeks of collected data and continuously be trained over time then this one would sort of have the very similar behaviour as if it was reinforcement learning algorithms running behind yeah and as an anecdote I guess so for my YouTube channel I have a YouTube advisor from YouTube to grow my channel and something like this and I've been given a lot of advice to make the channel more popular something like this and one thing I was told again and again is that what I should do is like videos not only that people are going to click on and be watching until the end but also that will make people stay on the platform on YouTube longer even come back the next day or the next two days so based on this like what I inferred with some important probability is that the YouTube algorithm is already doing this like trying to predict every point of time how users are going to use YouTube in the next two days something like this and then use this data to improve the predictions of the algorithm and so I guess when it's only two days I think you already observed some additive behaviour, additive enhancement strategies that are deployed by the algorithm but I guess this is only two days but I guess they will try to expand horizon more and more actually that's what Craig Boutilier who is working at Google AI he came to EPFL, he gave a talk where he talked about a more reinforcement learning algorithm for the video recommendation and he was talking about timeframes that were more in terms of months at least weeks but even months and I guess the more planning there is especially the more the planning is a performance the more important it is that the algorithm is going to be powerful in what it's doing and the more urgent it will be to make sure that the algorithm is aligned doing some planning that in occurrence we would want it to be doing and the thing stopping reinforcement learning at the moment for this recommender algorithm is that the state of actions to be taken is extremely large, there is more than millions of videos to recommend for each user and also the data we collect from users is very different than the data you collect from playing an Atari game because the Atari game is extremely easy to predict it's very predictable compared to users that don't pay attention to everything they see on the screen and even though they would see everything they would not click always at the same place Yeah, there's a lot of noise and even some, well you have these curves where you can see how much people click per video or something like this recommended video and these things you can't click straight over time but it can be a trend but you have to remove the noise which is very hard and makes the learning of the algorithm much slower it's much harder to learn when you don't have clear signals you can imagine doing your PhD and your supervisor like one day they say oh it's awesome I actually know it's better to have clear signals for learning and that's not what the YouTube algorithm is receiving Yeah, so how do you see the future of reinforcement learning Yeah, so what I expect is that we had this question yesterday of when will Montezuma Revenge be beaten by algorithms and then as we also discussed the several ways to do it I expect that we will be quite surprised of when it will come, similarly to the way people were surprised that Go was beaten much sooner that expert were expecting Yeah the history of reinforcement learning I think for the history of AI as well like there has been winters and periods of excitement and over the last decade I think there has been it's quite unarguably a lot of spectacular breakthroughs in reinforcement learning especially AlphaGo, nobody was expecting AlphaGo when it came up and people were predicting it would take decades to beat humans at the game of Go so this is the reason to be to expect a lot of new breakthroughs but on the other hand I think there is like we there has not been that many real world applications of reinforcement learning so far so it's still questionable like how fast can we go from these game playing algorithms to things that are actually deployed and like we talked also about self-driving cars in the previous podcast the more I think about self-driving cars the less I'm excited about this and I'm not sure I would be really excited about reinforcement learning being used for the history of reinforcement learning really crucial for our self-driving car because they I think again you can break down the primary because like I see reinforcement learning more as a framework than as an algorithm it's not clear what the reinforcement learning algorithm is but it's pretty clear what the reinforcement learning framework is and arguably the self-driving car is in the reinforcement learning framework as is the YouTube algorithm as is each and every one of us I see some algorithms that will learn and get better over time and I really hope that for self-driving cars they will be deployed only once they are at the best with the best method possible and not really I would not feel confident that you tell me out there there are cars that will get better by the end of the year yeah that's another execution we had between like continuous learning and I guess what most people do these days is like the model they train the model, they run some tests and the algorithm seems good it's deployed but it stops learning once it's deployed well there's one problem with this is the prime of evasion attacks because if the algorithm has a vulnerability then it can be exploited indefinitely when it's online and people will definitely try and for any recommender system there are people that are working the hardest they can to be recommended as much as possible they're called YouTubers there every YouTuber if there is any possibility to somehow beat the algorithms and get to be recommended no matter what then people will do it yeah so this approach of interopted learning I don't know how to call it like algorithm that stops learning when it's deployed it's not perfect far from it but I still feel it's safer given what we know today especially about poisoning attacks than what I would call a continuous learning algorithm even if it's doing supervised learning but it just keeps doing supervised learning with the user's inputs typically I think this is very dangerous actually because of poisoning attacks so yeah maybe it's a way to fight evasion attacks but I'm not very hard on these algorithms being deployed I think it's very important to keep in mind that these algorithms that keep learning they will change over time and they may have passed tests in the past this does not mean that in the future they will pass the test again if they were exposed to a test so yeah I think it's I'm leaning towards a lot more safety and let's not rush the deployment of these algorithms today okay unfortunately it's not our choice to make right now and what I expect for the future so first of all better techniques that will allow to beat more difficult games requiring some common sense like Montezuma but also I would expect that maybe within 5-10 years most recommended system will also get better at doing a reinforcement learning so pushing even further the boundary of how far in the future the algorithm is predicting it will have impact on the user to increase its reward function so yeah I'm quite confident that this will be deployed already I'm guessing algorithms are going to gain more and more these days the limits of the frontier is a better world model and better planning capabilities and yeah I think they're going to gain more and more of this and that's I guess a good aspect but yeah they can be robust as well that'd be good and beneficial cool so I hope you've enjoyed this video and I hope to see you next time next time we're going to discuss a paper I wrote called a roadmap for robust end-to-end alignment which I think is a good paper but I'm a bit biased