 Yeah, hello, after a long day of like lots of technical, technical talks, breath of fresh air, no technical stuff here, okay. That's why I'll give you an intro about me first, I'm Naveen. I work at Bloomreach, I work as a full stack dev there, I graduate from Tripoli to Bangalore. Fun fact, if you move to the banquet hall downstairs right now, my seniors presenting there also from Tripoli to Bangalore. You can't get away from Tripoli to Bangalore for the next half an hour. Yeah, I think everything else speaks for itself. Anyone who's tweeting out to me questions right now, later, love it, hate it, Naveen files, you can do it. I see people walking in, so I'll just give you an overview of why you should be interested if you should be. This is what we're going to be talking about over the next 20 minutes or so, 15 minutes, 20 minutes or so. We're going to be talking about Super Mario Brothers, obviously, saving the princess, obviously. We'll talk about reinforcement learning, and we had a really cool talk this morning, the first talk that happened. So show off how many people were here for the talk this morning by Vikas. Awesome, that's really cool, because that means I don't have to go through a lot of the base setup that Vikas went through. I can jump straight into it. We'll talk about queue learning, we'll talk about deep queue networks. This is going to be extremely light on math, and it's going to be a practical deep learning problem that we're going to solve over the next 20 minutes. And before I get into it, because there are a couple of hardcore researchers here, this is extremely, extremely, extremely derivative work. No original research here. Anyone expecting original research, please walk up. So obviously, it's this paper. Anyone who says deep queue networks will definitely point to this paper. People have not read this paper, leave this talk, go and read that paper. It's an amazing, amazing, amazing paper which kind of laid the foundation for work that's happening right now over the last three years. The second talk that we had today spoke about GANs, 2016 and 2017, having the year of the GAN. People have just been talking about GAN throughout. Whereas it is 2015, 2016, and has been the year of reinforcement learning for a large extent. So there are lots of improvements over what I'm going to be talking about, and we'll talk about that as we go along as well. So this is the first question that comes up when we're talking about deep queue networks, we're saving the princesses. What exactly is queue learning? So for queue learning, all I can say is it's a type of reinforcement learning. So people who attended the talk in the morning know what it is, right? Okay, maybe not. So I'll just do a brief through it. Reinforcement learning is basically how we grow up, right? When we grow up as kids, say you get an A in an exam, then what do you get? You get a reward for it, and you get an F and you get a timeout. Now, this seems like a very simple process that when you do something good, you get rewarded for it. When you do something bad, you get punished for it. But it's not always that simple. Like for example, if you get an A in an exam, but then you dirty your room, you might still be scolded for it. But as a child, you might think that because I got an A in an exam, I'm getting scolded for it. And that's going to end up having students like this who are extremely bad students because we never really understood that getting an A is important. Whereas the opposite also works where you might get an F, but you clean your room and you get rewarded for it and you end up with the same vicious cycle. So this question is called the problem of credit assignment where what we try to do is we take a set of actions and then we try to give the reward to the action that is responsible for it. So what exactly is Q learning? Q learning basically answers this question which most reinforcement learning answers. The question is do we explore or do we exploit? Now what does explore means? Explore is simple. Just take the action and do it and see what happens. Whereas exploit means try to go for a reward. So reinforcement learning as a whole tries to answer this question and there's always a question of when should we explore? When should you exploit? How much should you explore? How much should you exploit? And so on and so forth. And there's a whole set of papers that we read on this very question. A couple of links are listed on below which answer these well. But we'll get directly to Q learning. The most important part of Q learning breaks this problem down into a function which is called the Q function, hence the name Q learning. So what it says is that Q of a state and an action is equal to R of a state and action plus gamma max of Q next state and all actions. To break it down into simple English all that it says is for a given state, the Q function for a given action is equal to the reward that you get for performing that action on the state plus some factor gamma into max of the Q of the next state all actions, all actions that can be performed at that next state. So that means that if I can move here it's going to be the reward that I get for coming here plus all the maximum reward of all the moves that I can do from this point onwards. Now obviously gamma is a very interesting thing because what does gamma do? Let's say gamma is 0. In that case all that you get is Q state action is R state action. That means that I just go for the immediate reward at every point of time. Well let's say gamma is a larger number. In that case the other term dominates and we try to we try to become more far sighted in our vision with our algorithm. I think this is what Vikas also spoke about this morning. Where you have myopic vision and you have like far sighted vision. So it's all about playing around with this gamma in order to figure out when to explore, when to exploit, how to, should I go for a reward right now? Should I wait for a reward sometime down the line? Or how exactly I should tackle these problems? So I would actually do a Q learning. Q learning started off long, long, long ago as with most artificial neural network algorithms. It started off long, long, long ago with simple tabulation which is dp style. So what you have is in this case I have a simple graph where your start position is 2 and your end position is 5 for example. And all I need to do is figure out a way to go from 2 to 5. So if I'm at 2 I just have one path possible that is to go to 3. So I try that path then from 3 I have two paths possible. Let's say I come to 4, from 4 I have two paths possible. I can go to 0 or I can go to 5 and I just try a lot of these variations. Over time while I maintain a table that stores a Q value that I compute over iterations. Over time once I've explored enough I'll actually get a table that's something like this. So you see that the Q values are listed over each of the graph paths that we see here. So from 2 the max Q value is 64 so we come to 3. From 3 you have you can either go to 4 or you can go to 1 which is 80. So let's just randomly pick 1. From 1 you have 2 paths you can come back to 3 which is the 64 there. Or you can go to 5 so you pick the max again and the max Q value and you end up with 80 you end up in your goal state. So that's simply how Q learning works. For obvious reasons this is not very efficient because if you have many states and you have many actions that table is gonna just keep getting huger and huger and huger. So in comes DeepMind and other papers that came up last year which came for a couple of years back which actually spoke about how the Q function itself can be a neural network instead of actually being a tabular method like this. So this is basically their definition that they use. They say instead of having the previous Q function that I had why not have a neural network that approximates the value that Q has. And because it is modeled in such a way what I can do is I can simply look at this as a loss function because you see the term on the left has to approximate the term on the right so take this on the other side and then use mean squared or some other form of simple reinforcement learning of simple error function in order to as a loss function and then minimize it. So that is basically what deep Q networks do. They take the simple Q learning paradigm, add a neural network into the mixture in order to approximate the Q network and then continue on as well. Next up the interesting part of the talk. Super Mario Brothers, how many people here play Super Mario? If you are in case you are not be super amazed. So Super Mario came out in 1983 amazingly which I still find amazing because it makes me feel super old and it basically took the world by storm. The reason this game is very interesting to me personally is because I play this literally since I was I think 4 years old and saving the princess with my life's motto. And from a computer science perspective the reason I find it interesting is because the game space even though it is constrained it is pretty large and you have a lot of delayed rewards. Then I will talk about this a little more later. Also you have multiple game characters and there is a lot going on on the screen per se. So when the paper came out of deep Q network the main thing that they played was Pong and that was the main game that they played. But what I realized was Mario also is very similar in the sense that it is constrained yet a much more open world. Hence I thought why not use this algorithm to play Super Mario Brothers. So first things first we need to look at Super Mario Brothers at deep Q network problems. How do we do that? For obvious reasons Mario is going to be our agent who is going to be performing the actions. The action can be one of the possible actions that you can play in the game. This left right up down shoot blah blah blah. The state is the current state of the game. For this sense of for this very algorithm the state is actually n consecutive frames of data together. The reason for this is very simple. If you look at Mario right now what do you think the action should be? Should he jump or should he just stand there? Anyone? It's not really that obvious that he should jump because what if both the Goombas are going the other direction? Right? Because with a single frame you cannot really talk about what exactly the action are the Goombas coming towards you or away from you. So as such because you want to have that temporal nature of the thing we actually use four separate frames, three to four separate frames and I've used four for Magma. The score is what's the reward that you want to maximize and the end goal is to have a Q function that approximates that. And the environment is obviously Mario interacting with everything. Just to make it clear once more when we play this game we do not tell Mario anything. We do not tell him that if you touch a Goomba you die. We do not tell him if you pawl in a hole you die. We do not tell you take a mushroom you get a life up. We do not tell him anything at all. And it's over time that Mario himself learns to play the game. So this is a neural network that I use for this. This is very very very similar to the one that was published in the original paper that came out in 2013 and was published in the Nature Journal in 2015. So it's 484 by 84 images which are grayscaled actually which cannot be seen there which are then goes through a whole set of neural networks. It's just a slightly larger neural net compared to what they had in the original paper to take into account for the more complex nature of the Mario game. And then we have a fully connected layer along with 12 possible outputs. Now 12 possible outputs, why do we need 12 possible outputs? There's only left and right up down. There's also the B button which is to run faster like rows of the game. To run faster you can hold B and go right or left to jump higher as well. So you have those combinations as well that are taken care of and obviously there are no operations that is do nothing. So all those together comes up to 12 outputs. So now let's do this. What exactly do we do to do this? So first off maximum views, the main purpose of doing this talk is to let people who have like not really dabbled too much into neural networks know that it's super easy to get started. For this the base that I used was entirely the DQN code. It's written in Torch. It's pretty nicely written code. You can actually read through it to some extent. There's a talk coming on PyTorch as well so I'm looking forward to that to see if I can port it to that. So mainly a lot of stuff has been used directly from their code itself for this. Handling custom bindings and combo moves. Now again because Nintendo has its own custom moves and combo bindings like I mentioned previously we actually made sure that we have different bindings that are separate outputs. We don't have B as one output and left as one output. We have B plus left as an output and B plus right as an output and left as an output and right as an output. So at any one point we'll just be getting one which is a major output and that will either be left, be right, up, down, left, right so on and so forth. Experience replay. Experience replay is actually kind of the core of the deep fuel network algorithm. So what this basically does is it tries to break the temporal nature of experiences. So if I go ahead at one point and then I jump in the next move we want to make sure that the jumping is not affected by the going ahead. We don't want every time the jump to be go ahead and jump for example unless that's actually required. So what experience replay does is it stores these tuples which are the state, the action, the next state and the reward in memory and it randomly picks up those experiences when it tries to pop out. Then we have experimentation and this is where a lot of time went because the games that they played were actually played on an Atari learning environment which is a very, very nicely fledged, full fledged environment to play games and everything that had been tuned for those games. So with this it ended up being a little more complex because first of all finding a Nintendo emulator itself that works recently well is difficult. So I use ESNES at first then shipped it to FCU and stuff like that and along with that also playing around in learning figuring out what exactly works, what does not work, stuff like that. And finally we train our model. So what training our model was super simple I just had the lamest machine out there which is a G2 AWS instance which I used for training and it actually trained for I think around a week with me checking the initial days to see what's going on with it and yeah again a couple of links down there for everything and I think we will head to the demo. You can sing along the Mario track if you want to. Yeah so this is actually I think a little sped up. So you see Mario again just to stretch again Mario has never told what an environment is what is I like the cheeky thing that I did there where I had a couple of steps and then jump it kills a couple of boombas and it finishes the level as well and now it gets into the underworld level which people have played I'm pretty sure it does get stuck here for a while and I'll talk a little more about why I feel this happens for multiple this it will just give it 10 more seconds and hopefully it does work and so that's it so it like managed to get through the first level could not really do too well in the second one okay so when I come to the hurdles that I faced and like we'll talk about a couple of problems that we saw in the video as well the first one delayed rewards everywhere I think this is something that intuitively human beings understand very clearly when you play Mario what's the first thing you do when you start Mario is you take the mushroom right at the start right but why do we take that mushroom because we realize that down the line if somewhere you die then not having that mushroom ends the game whereas having the mushroom makes you continue the game allows you to continue game but that can come much much much later in the game because of that delayed reward it's very very super difficult I actually get the credit assignment set for this same part of it Mario refuses to move so this is this is a bit of cheating that I did like initially if you notice if you're playing the game then initially Mario even if it moves the score does not change at all and it takes a certain level of movement in order for the score to increase or decrease so that's why there was a slight bias that I added right at the start to get it to move to the right over time that bias has reduced but there was slight bias there to move to the right right at the start the second bit was tracking progress tracking progress in a game like this was difficult because because of the explore and exploit nature of this entire thing you know that it's going to be zigzagging up and down so there'd be times when after a day of training I check it's made certain progress but after three days it seems to be at the same level was just slightly better whereas like three hours more and suddenly performing better so I kept track of the average queue value that it was generating and the average queue function so that actually helped quite a bit and finally you answered the question that came in the previous demo as well is this overfitting and I think to some extent yes it is overfitting because overworld training does not generalize too well to underworld and most of this training has happened on the overworld level whereas when you play on another overworld level it seems to do much much better it doesn't just go and die at the first opportunity that it gets or the first one or the first early opportunity that it gets so as such overworld to overworld is much better but overworld training to underworld is pretty bad as well so so talking to associated work there's like Mario which is Super Mario Brothers using genetic algorithms which was a nice nice approach with kind of inspiration for this by this guy named Seth Pling deep learning Flappy Bird Flappy Bird again is way simpler as a game for reinforcement learning because it's just up down up down up down up down and can actually play pretty well CNN 2048 was playing 2048 with a CNN and just at lunch I had someone take me that someone has actually done Super Mario as well with a similar approach to mine so I'll have to check that out but I've not added that here but I'll talk about further work on this right since the DQN paper came out obviously multiple research papers have come published on the same topic so like these are a couple of improvements that they have listed out the first one is a double DQN where you have one network choosing the action while the other generates a target queue for it then there's a dueling DQN where we have two networks where we break the queue function into two actual values that we joined together at the last moment which is the advantage and the value the advantage is the advantage of doing an action over all other actions and the value is of staying in the value of the state that is and finally prioritized experience replay now when I spoke about experience replay I said like we store the S, A, the state, action, reward and next state in memory and then we just randomly pick out of them but we know for a fact from human behavior that there are certain experiences that we learn way more from than others right so prioritized experience replay actually does that what it does is it ranks experiences in a certain order and then picks up from the top instead of picking up from any random place so as such it's much better and it trains much better and it gives better results as well and with that I am up with my talk so the question is over thank you so much for your time Thank you very much that's probably one of the most entertaining talks we've had here it is time for a break coffee, tea and feedback forms please we'll be back here at 4.30 for our last talk of the day