 Welcome to this NPTEL workshop, NPTEL IBM workshop on and I have come up with a name for it. What is it? What is it? Whatever, whatever you may take it. So we have a series of very interesting events close today. We are taking it off with a high level introduction to do the reinforcement learning from one of the experts in the area, Professor Alindran from Computer Science Department in England. He's been working in the reinforcement learning for 20 years now. I go that he did a PhD from Massachusetts Amherst. They have either there, Andrew Bartow, one of the inventors of this area. So really an acquaintance. He gave up what he says is a gentle introduction. Very, very gentle introduction to see the reinforcement learning and show it to be an enjoyable experience for us. And after that, Mr Vishal Sakhal from IBM Research Labs in Bangalore, who's on his way here. So once he gets there, he will talk about the main topic of this workshop. That's the plan for the next PhD that's here to you. Once again, a very warm welcome to all of you. Hope you enjoyed this whole experience. And many of us, I think there are quite a few IBM students also here, but also very few students from outside IBM. Can you show us how many of you are from outside IBM? How many of you are from IBM? Not bad. It's a really decent mix. All of you, I'm not sure if you're used to a classroom where you're going to be silent or not, but I think I can assure you Professor Alindran will be very happy with any introductions or questions. I'll be happy if all the conversations are to me. I'm not among your students. Yeah, so please stop them if you have any questions or you may go to the end of the video. We'll see if we do anything. Okay, thanks a lot once again and over to Professor Alindran. Thank you. Good morning, Professor Alindran. I was saying it's a long weekend. I'm happy that some of you have been sent out. So I'm going to talk about deep reinforcement learning, which is one of the reasons there's a lot of reasons for excitement in AI. And I'll really keep this talk very, very, very light. And that's the reason you see the video. I'm going to not have a single equation on any of these slides. If you have watched my individual videos, you know that it is not difficult, but I'm going to keep it as light as possible. And if you have any questions, any points that you want to raise, just put up your hands and interrupt me and ask those questions. Because usually, I mean, I have a lot of slides and I probably won't finish in an hour. And if you're going to wait for me to finish at the end and then ask questions, there might not be enough time. So the moderator will probably say, oh, please take all the questions offline. So if you have questions, just ask it from me. So there's a lot of excitement about AI, right? So let me read the one at the top. The Navy revealed the embryo of an electronic computer today that it expects to be able to walk, talk, see, write, reproduce itself, and be conscious of its existence, right? There's a lot of excitement about AI and all this, right? In fact, these articles are from 1998. They are nothing from now. It sounds like it could have appeared in these papers today, right? And so why all these things happening? So AI goes back a long way, right? So 1950s, Alan Turing asked this question, can machines think? And then he proposed what is now called a Turing test, right? It was a test of intelligence, right? And then in 1956, so it was one of the first official use of the term artificial intelligence, right? So it's called the Dartmouth AI project. I started in 1956, and then people declared artificial intelligence as a new field of study, and a lot of work happened then, right? Then you could see what was happening then, right? So 15 was Turing test, 55 is something we defined as AI's board, right? And then you get a first industrial robot, right? And then you have a chatbot called Eliza, right? Look at the days here, right? People talk about robotics and chatbots and all those things now, right? We had, we already went through this cycle once, right? And then Sheiky was the first electronic person, right? Then it became a well. Supreme from South India was coming and talking in various things recently, right? But then you have the same thing happened in 1966, right? So Sheiky came out of the SRI lab, right? And then what happened? There are many false starts, many dead ends, right? And about the same time that Sheiky came on the field, right? So people gave up automatic machine translation. That's the first attempt at automatic machine translation using AI. I say therefore it's really not possible for an AI to truly understand language and capture all the nuances of it, right? And then in 1970s, people forgot all about the first generation of what were called neural networks, right? They forgot all about that and they say, oh, it is not going to work. And kind of culminated in 1988 by the US government saying that they will not find any more research in anything called AI. So it came to that point. Unimaginable. Everybody is falling over themselves to find anything that has to do with AI. And so that started the first of many AI users, right? And it has had severe lasting effects. Look at that. That is a statement from 2006. Some believe the word robotics actually carries a stigma that helps the company's chance at funding. And all of you here are attending a workshop on AI and robotics, right? So assuming that all of this is going to change in the near future. So why again all this excitement about AI, right? So we saw this very similar hype cycle happened in the 60s and 70s. And then things plummeted. And now again, it seems to be going through a hype cycle. So what has caused this impact? So one word, something called deep learning, right? So deep learning has been around in some form or the other, right? From the 80s and things. But then we have gotten things to work much better now. And so this is the first kind of big-ticket application of what deep learning has achieved was in this kind of image categorization, right? So I'll give you different pictures and you have to figure out what is being described in that picture. So it's an airplane, automobile, bird, cat. I'll give you a picture. You have to assign a label to the picture, right? And so you could see that they're getting to very, very low error rates. So 3.6% error means out of 100 pictures they make on average error on three-and-a-half pictures. Something to start with. You would do to make up. That is why all the visual delusions and other things are so popular. Because human life can be easy. Cool. And in fact, deep learning has managed to achieve what you would call superhuman performance. And it says that it can be head-to-head against humans. Head-to-be better than humans are. This kind of visual categorization gets nowadays. So that is one of the reasons for a lot of the Khufla. So what really changed since 2006? So there are three things that I would say happened. So there were new methods and new algorithms, right? I don't propose. We have a better understanding of what was causing things to fail earlier. So people just thought that it was a very specific reason called... So essentially we are trying to train too many parameters. We are trying to learn too many moving parts in the visual network and you are not able to train them appropriately. But it turns out that that was not really the reason. And there were several other things that were causing it. So once these understanding kind of came about, right? So we were able to come up with better algorithms that are most stable and could train millions of parameters. And the second thing of course is more data. It's a modern day. Everything is digitized, right? And so you are able to gather data at a much, much greater pace than you could earlier. So that also helps you get better and better solutions. And the last thing, kind of hand-in-hand with the development in algorithms and availability of data, you had better machines. You had faster machines that allowed you to run your algorithms at a rate much, much, much, much faster than what was possible even a couple of decades ago. Or even a decade ago. It's run much faster than what was possible in a decade ago. And so there are a lot of success stories. So I'm pretty sure we hear some from IBM, right? So there are a lot of human-like tasks which AI is supposed to perform like speech recognition, object recognition, pace recognition, machine translation, understanding text and so on and so forth which now machines are able to perform, right? And of course, these are not just things that were done in academic labs, right? These are now developed by companies, there are products out there that use all of these. And IBM is in the corporate of many of these areas. So you hear about IBM side of things. And so this is largely the work of a few individuals. This is Jeff Hinton, who was at University of Toronto, now heads Google brain. And then Yashbo Vendu, who is at University of Montreal, who stays for the University of Montreal as they're building this huge AI empire around there. It's an amazing story. And then Yang Lai Kung, who was at NYU, but also heads Facebook AI research now. And of course, I have to add another person there. So there's another person, Schmidt Lueber, who is in Zurich. He was also contributed significantly to this revival of the neural network. In fact, it's amazing that some of the modern architectures that we use in neural networks have been around since the 80s. We just figured out how to use them correctly now. So it can do amazing things with the neural network. Here is one example. Look at that. So somebody gives you a picture and a style, right? And a neural network is able to redraw the picture using this style, right? So I give you a sketch, right? And the neural network is able to fill in. Oh my God, that looks like a Monet, right? So it can do impressive things. It's just basically trained on impressive things, right? And it's able to give you, given a picture like this, able to give you that thing or it can do this kind of colorization. So a whole bunch of things you saw. Normally, you would associate as creative activities. Something like that. And there are other things like, look at that. This is a caption that was generated by an automatic system, right? That's a person writing a motorcycle on a dirt road. Very clear description. Of course, it still makes some errors, right? There's a little girl in a pink hat and she's blowing bubbles. According to the caption anyway. So it can do amazing things. So this is a picture on how AI agents have been doing an automatic speech technique, right? So you can see that about 2010, somebody understands only 60% of what you say, right? You'll probably think they are from a different generation, right? So, but like you're talking to your parents or parents talking to their children, only 60% of the things will be understood. So it's probably saying something like that, right? But then look at that. Once deep learning got on board, you have reached something like 95% or even better, right? That blue line up there is human accuracy. So right now we are able to beat human accuracy, right? And this has given rise to all kinds of personal assistance, right? So no robotic voices. So you can even understand speech well and you're also able to generate speech, very natural salty speech also. So that's happening. And so apparently, I was not aware of this until like yesterday, we started putting these slides together, that the fraction of voice searchers is going up significantly. So I don't really trust my phone yet to do voice search on it, but apparently it is good, right? So you can see that about 50 billion voice searches got to master what's happening now. That's the level of confidence that people have. And perhaps the next generation will grow up with the assurance that machines will understand them all the time. But let's get moving. So more recently, there has been a lot of excitement about something, right? So there's an agent called AlphaGo. How many of you have heard of AlphaGo? Okay, a good fraction. And what fraction of that was from my idea? Okay, so AlphaGo. So let me tell you a little bit about the game Go, right? So game Go is actually a pretty ancient game. And people have been playing this for centuries. And it has a very, very simple move, right? So the tagline for Go is a minute to learn and a lifetime to master. Because that's how simple the rules are. But then as the gameplay progresses, you'll see a lot of different patterns that evolve. And it has been considered one of, like one of the final frontiers of AI, right? If you can get an agent that can play Go, then we can say that you're making progress in AI, right? So people have been trying this for a long time, right? Until the mid-2000s, right? No real success, right? Even in getting, like, an okay Go player. Then some of them have done that allowed people to get a decent Go player, not necessarily somebody who is competitive with humans, but at least somebody who you can train with to begin to learn how to play Go about things like that. And then, in 2015, this company, DeepMind, came up with this agent called AlphaGo. And AlphaGo started playing really well. And initially it beat the European champions 5-0. And then they said, okay, let us go up against one of the best-known players of Go. And so Go is one of those games which have, like, almost religious following, right? So they have actually records of gameplay going back decades and centuries, right? And Lee Seedal, the guy who is considered one of the strongest players of Go ever, right? He's supposed to be one of those genius level types. And so they decided to go up against him. So how popular is Go? So when AlphaGo had a matchup against Lee Seedal in Seoul, right, they were actually live-casting the game on the screen corners, but putting up huge LCD screens. People used to stand in the screen corners and watch the game, right? That's how popular the game Go is. And then they decided that they'd go up against him. And the initial, you know, the initial press was all about we hope that we don't make a fool of ourselves. We hope AlphaGo is able to hold its own against Lee Seedal and so on and so forth. And by the time the fourth game came around, right, people were actually surprised that Lee Seedal managed to win the fourth game. Because AlphaGo was playing so strongly in the first three games that people had given up hope for Lee Seedal to actually beat the game in this. But it did win. He did win the fourth game. And then promptly, they went back and did some fourth training and came back and defeated him in the fifth game. So they ended up winning fourth one. AlphaGo ended up winning fourth one. And there's a lot of buzz about it, right? So there was this one person who was kind of covering the game, writing commentary about the games and then analyzing the games later. And he said that for the first time in my life, I felt like I was watching an alien intelligence in action because I could not make any sense of the moves that were happening while the game was progressing. But later on, I went back and analyzed it and then I could see what was happening. So that's the kind of impact AlphaGo had. And of course, one of the unfortunate side effects of all of this is it also acts as the hype cycle, right? So now more and more people are getting on board like this. And so to be fair to Lee Seedal, I believe it's the last human being who actually defeated AlphaGo at the Game of Go. Since then, all the other human match-ups that they have had, AlphaGo has been winning five games. Obviously, if you take the current version of AlphaGo and play against Lee Seedal, I don't know what will happen. It will probably be between 5-0 as well, but still. And then closely behind that, there was this strategy game called the defense of the ancient. And it's a lot more complicated in terms of trying to understand the rules of the game than go itself. There are a lot of different components and then they have to do some long-term strategizing and things interact in very different ways and so on and so forth. And so it's a much more complex dynamics than a board game. But then OpenAI, which is an organization funded by Yon Musk, which they built that managed to beat humans in tournament play and win the tournament. It was going against humans and won the tournament. Of course, this was a similar version than the full defense of the ancient game because apparently you can form teams and then you can have coalitions and fight and so on and so forth, but they made it a single player burden. Still, it is a non-trivial achievement. So all of this caused a lot of excitement about what are the boundaries that you can push. So I'll tell you why these two applications were different from all the things that we saw earlier. So that's what this talk is about. So what is it that caused these things to cause so much excitement? So people are familiar with machine learning. Many of you here have suffered through my hands or have seen some version of it on MPTV. So you have some familiar to machine learning. I'll talk about that in a minute. But this talk is all about learning to control. The familiar models of machine learning are all about learning from data. What do I mean by that? So I'll give you some past. I have learned some functions from input to the output. I'll give you some past data. So it could be like credit card transactions and whether they were legal or not. Or customer data and whether they bought equipment from you or not. And then this kind of training data is given to you. And your goal is to learn to predict one unseen data. So if a new customer comes in, can you predict whether they'll buy it or not? Or if a new credit card transaction is posted, can you tell me whether it was a fraudulent transaction or not? So the goal is given this kind of past data, which we call training data, can you figure out how to make predictions on unseen data? And the goal here is essentially to uncover whatever patterns that exist in the data and use that to make these predictions. So here is an example. So I have an email and I want to classify that as whether it's a spam or not. So this was considered a reasonably hard problem about 15 years back, but then this is learning and solved it like anything. And you rarely get any spam landed or involved on emails. If anything, it errors on the other side. Some of the legal mails go into spam, but mostly it's fine. So how do you frame such a system? All I'm doing to do is I want this model. That's going to give me spam or not spam. So what I'm going to do is take a bunch of my own emails that have been labeled as spam or not spam already. So in the olden days, you would have moved, more often you would have moved your emails into spam folders and things like that. Check it as spam or go and check it as non-spam and so on. So you're essentially producing this kind of training data when you do that. So you get this kind of label data and then fill it into a learning algorithm. It is better model. So what does the learning algorithm do? Learning algorithm tries to find a mapping that takes the input that we call it X and gives you a plus or a minus one depending on whether it's spam or not spam. And the goal is to find a mapping such that the fraction of examples on which this model makes a mistake is very much. I'm not going to get into the mechanics of how you train it and so on and so forth, but the goal of what you are trying to do is this. So the crucial thing here is I'm given this kind of training data which consists of what we will call as supervision or instruction. Very detailed supervision. This is the main, then it's spam. This is the main, then it is not spam. I give you very, very detailed for every input that you're going to see and I'm going to give you what the output should be. So we call this as detailed supervision. So this is what I mean by learning from data. So you already have this past data with all these labels and then you're going to learn from that. So now let me ask you this question. How did you learn to cite it? Was it from past data? I can show you 15 videos while you're on YouTube. I can show you 3 million videos of people cycling. And then maybe another 15 million videos of people calling off cycle and getting hurt in all kinds of funny ways. Now after watching all these videos can you get on a cycle and cycle? So you need to do some kind of trial and error. No amount of labeled data is going to help you. Or if you want to go back and look at how we did this spam not spam classification. Actual labeling here should be of the following type. My cycle is tilted at 27 degrees to the vertical and I'm moving forward at so many meters per second. So what pressure should I push down my right leg? So that is the kind of supervision you would need if you were to look at it exactly in this past class. That's not the kind of supervision you're going to get. So usually when I ask this question did you use supervised learning for learning to cycle? There will be a few people in the audience who are saying yes, I was supervised by my parent or I was supervised by my uncle and so on and so forth. But that is not really supervision. That's in an English sense it's supervision but not in a machine learning sense. What did uncle or parent could have done? They would have properly clapped when you cycled properly or when you cycled tilted too much they would have helped you from falling down or something like that. So what you were getting was more evaluation. You do something, somebody comes and tells you hey, that was good or bad. Nobody tells you what to do. You have to try it, right? You have to do this kind of trial and error and figure out what to do. So of course you get feedback falling down parts or somebody claps and it feels good. So you get this kind of feedback and then you learn from that. And it's just not cycling. It's just not cycling. It's a lot more. So imagine, go back. I chose the example of cycling because it's more likely that people remember, right? Because they don't remember how they learn to walk. Unless you have babies and you're watching them walk then you remember, right? But the thing is babies do all kinds of trial and error work. So nobody teaches them how to walk. Like nobody teaches you how to walk. And you learn a lot of vocalizations and then somebody says yeah, good, good, good. Remember, if it was supervised learning I should have told you how to move your throat puzzle where I should keep diaphragm and digs like that, right? I'm not doing that. You try different things and then finally it will give me an output and yeah, okay, yeah, that's good. So you're not getting instructions, you're getting evacuations. So if you go back and think about why people are getting excited about go because it was trained in exactly this manner. Nobody gave detailed instructions. It learns quickly. It plays games. It wins. It loses. And then it learns through this kind of trial and error process. The same thing with the go-to thing, right? There are a lot of other applications I talk about those as we go along. And so the goal of reinforcement learning is to do this trial and error learning, right? So what way of thinking about what reinforcement learning is is a mathematical formalism for this kind of trial and error learning. So you have rewards and punishments and you learn to maximize some notion of long-term performance. The rewards that I get over time. So I want to optimize it. And the key here is you learn about a system through interaction with the system. It has very rich history, right? Reinforcement learning uses ideas from variety of disciplines, right? It uses ideas from operations itself. It uses ideas from optimal control, right? But the initial reinforcement learning explorations were done in a field called behavioral psychology. So people have heard of power of dog, right? So normally when you have a dog, it's alive when you give it food, right? It's preparing to digest the food. Therefore it's not alive. It needs to get ready to actually digest the food in its mouth itself, right? So normally when you ring a bell, there's a rather sad looking dog but it's probably going to be barking at you. If you ring a bell, it will probably bark back at you but it's not going to be salivating. So what I am proud of it was whenever he presented food to the dog, he also rang a bell. After a few such presentations, what happens if you ring a bell, the dog's not salivating. So what has happened is through this experience of seeing the bell and the food together, just learn to associate the bell with a reward. The reward in this case is the food and you know, okay, I hear the bell so I am going to get ready for processing the reward. It learns its kinds of associations through interacting with the bell. Here it's a very simple association because the dog did not have to do anything. It just had to process the input that they were getting but this is a very similar mechanism by which you create a lot of animals, right? You give them small reward, you give them dog's, you know how you create the dog so whenever they do something good, you give them reward, therefore they eventually learn to do what you want them to do, right? So this kind of training is called conditioning, behavioral conditioning and the first reinforcement learning models were actually proposed to explain, mathematically explained, right? This kind of behavioral conditioning. So I am going to use a simple example to illustrate one of the key concepts behind reinforcement learning. Hectochrom. So people are familiar with Hectochrom. So I presume a lot of you have played Hectochrom. I am pretty sure many of you have played Hectochrom during classes of course. It's not something that's not the unruly classes, right? So I am going to say that if you win, you get a point, if you lose, you get minus one and if you draw, you get zero but I am not going to tell you how to play the game. You have to learn to repeat the game. It's a very simple enough situation. You can write down your optimal rules but just for illustration purposes, right? So you start off with a blank. There are three possible or nine possible rules you can make and then each one has more and so on and so forth. This is what is called as a game tree where you are looking at all possible outcomes of the game. And so let's start off with a very simple approach to learning to play Hectochrom. This was proposed by Michi in chambers in 1960. They built something called menace. So what is menace stands for? It's not very menace. It stands for matchbox, adjectable, knots and crosses engine. Nots and crosses is what the Britishers call Hectochrom. Nots and crosses. And so, matchbox, adjectable, knots and crosses engine. Menace. But the idea was that they had this set of matchboxes and some current beads and then they demonstrate. They do live demonstrations of the set of matchboxes learning to play Hectochrom. So the way they did this is follows. Each matchbox is labeled with some board position. Now you can see this here. So there are two x's and two o's and x's supposed to make a move next. So whenever you come to a specific board position, you go and open the matchbox corresponding to that board position. It's going to have different colored beads inside. Each colored bead corresponds to one move that you could make. So for example, here there are one, two, three, four, five positions. So there should be five different colored beads and that would be four beads. There should be five different colored beads in that matchbox. And then you have many, many copies of each bead. Each color. So what you do is you open the matchbox, pull out a bead at random. Don't look at the color of the bead. Just pull out a bead at random. See what is that bead. And make a move corresponding to that. Then you leave the matchbox open and remember the bead that you took out. And continue playing until the end. There's an opponent that responds to your moves. Then you continue playing until the end. And then at the end, if you win, you put back two beads of the color you drew from each matchbox. If you lose, you throw away the bead. If you draw, you put back the bead in the matchbox and close the matchbox. So what is happening here? Whenever you win, what have we done? We have increased the number of beads of that color that cost you to win. So next time you pull the bead out at random, the chances that you pull out that same color has gone up slightly. So what you have done here is whenever you win, you increase the probability of the move that led to the win. Whenever you lose, because you threw away that bead, you are decreasing the probability of the move that led to the loss. Whenever you draw, you leave the probability unchanged. Very, very simple idea, right? They just did this with matchboxes and beads and marbles and stuff like that. And then people were completely blown away. They just walked an intelligent set of matchboxes. Because they learned how to play games. They came from at times. In fact, this became so popular that Michi and Chappies had to actually take it around in the touring exhibitions. They go to different cities. They set up the matchbox and demonstrate how it learns to play. We have scheduled showings. You can come at 6 p.m. and watch the matchbox. Now you can of course go online and look for the menace page. And you should see some video demonstrations of that. And it's easy enough. You can write 10 lines of code and set up the menace engine. And then you can watch it. So let's go back and think about what we were doing here, right? So this is our game three. So what did I do here? What's the following? I started off at the beginning. All plans. Then I played one game, right? Let's say I play a game all the way to the end. And then what I do? I look at the outcome. I'll see if there was a win or a draw. Based on the outcome, I'll go back and update the probabilities of the various moves I took along the way. This is essentially what I'm doing. I'm waiting till the end. But suppose, suppose I encounter the position like this somewhere along the plane. Not like this, somewhere along the plane, right? Or even this is good, okay? I encounter a position like this along the plane. So what can you tell me from here? Is this going to be a win or a loss? Never mind, I'm playing X. This is going to be a win or a loss. It's going to be a win, right? At this point, I don't even have to wait for you to finish playing the game. I know it's going to be a win. So why do I need to wait till the end of the game? If I am able to come to a position like this, already I can say, hey, it's a win and start updating the probabilities of all the game moves that will not be there. Right, makes sense? I don't have to wait till the end, right? In fact, if I start learning, I don't have to wait to see a position like this where I'm sure about winning, right? I can just look at, okay, here is a position. How likely I might win from here from all the learnings that I have already played. And then I can use that to update the previous position. I don't have to wait all the way till the end because as I'm playing, I know more about whether I'm likely to win or not. I just took it one step ahead, right? Then I know correctly. So that's the key idea here, right? If I'm one step further down, right? I have a better idea of whether I'm going to win or lose or what's the eventual outcome. So then I can use the predictions that I can make one step down to update the predictions that I made now, right? Whether I'm going to win or not. So this kind of an idea of using predictions that are further down in time to update the predictions that come earlier in time is called temporal difference learning. So this is the whole notion of temporal difference learning. It turns out that this simple idea actually explains a lot of the complex behavioral experiments that people were learning. It turns out that humans and animals have this kind of a temporal difference learning in our head. And we tend to use this kind of future predictions to update our current models. So this was proposed by Bartow, Sutton, Anderson, and E.D.3 and it's considered the key paper that kind of marks the modern study of Ray Cross McLean. So that's Bartow, Sutton, and Sutton. And so Sutton and Bartow have written the basic defining textbook in RL now. And very recently, last week, they put out the second edition online. So people want to learn more. It's actually a very easy read book. So not just in AI, right? So this temporal difference had a profound impact in behavioral psychology, in neuroscience, in operations research, and in control theory. So this whole idea has been used in many, many different themes. So it's been used in a variety of domains. I'll talk about a few in more detail and in fact, this is something where you're experiencing RL on a daily basis. So whenever you have all these advertisements that are popping up on Google, there's some part of reinforcement learning behind it. Because it's very hard for me to do a super read because different kinds of people are coming to my web page every day, like millions of keywords that are being typed in. And the set of advertisements I have keep changing on a very regular basis. Like every half hour I have the difference in advertisements coming in and so on and so forth. So I have some amount of this trial and error component in my learning. If I show you an advertisement, if you click on it, then I did something good. If you don't click on it, that might be a variety of reasons. And if you say, don't show me the ad, I don't know how many of you have noticed. Google now has this thing about don't show me an ad. Because you can go and say that and that takes it as a negative reinforcement. So if you don't click on that, then you might be busy, but not at the time. It's not necessarily that you don't like that. Now that you mean you're going to say that you don't like that. So that allows you to get better training. So that is essentially what Reef was learning, the role is there. So I talk a bunch about all of these now. So one of the applications of RL that created a lot of buzz was an autonomous helicopter. So people who are familiar with online courses must recognize that guy. So he did this as part of his PhD work, and then came to work on it when he moved to Stanford as a faculty. And the idea here was amazing. So you're training this helicopter or rather they're training a pilot to this helicopter through reinforcement. So there are many, many control actions that you can take, and then if you cost the helicopter to crash, you get a penalty, if you keep it balanced, and then you keep continuing to fly. That's what makes it amazing, amazing stuff. And of course, it is trained on a real helicopter so that it didn't crash, really crash the helicopter. It turned it right on the simulator, which was a very, very detailed simulator. And then once he was confident that it's going to work, you put it on a real helicopter. Of course, a real helicopter had a fail-state mechanism. At any point of time, the learning agent causes it to tip over very dangerously or something it will cut in, right? And it will be treated like a crash so the agent will get punished, but the controller will make sure that the helicopter doesn't crash. There are couple of points where they can do all kinds of amazing things like even inverted flying. So helicopters can't fly upside down. They're not designed to fly upside down. They have a stable configuration, but still they managed to get it to fly upside down. So it's an amazing, amazing experience. It is another one. That's Peter Stone from UP Austin, right? So one of the things I like about this demo is that they use reinforcement learning to solve a specific problem in a very, very large complex domain, right? So here the domain is to train this kind of humanoid robots to play soccer. A lot of work has been done on getting humanoids to walk, like strategizing for a soccer game and so on and so forth. It's not like you want to use reinforcement learning end-to-end in these kinds of situations. So what they did was they looked at what was hard for them to solve using conventional techniques and then figured out that kicking the ball is something that's hard for them to solve so they trained a reinforcement learning agent to kick the ball and you can see how well it learns to kick the ball. In fact, it becomes embarrassing. Quite often it can just kick the ball into the goal from long distances of it, right? And so it started winning football games. It scores off like 14 to 0, 20 to 0 and so on. So they ended up scoring 88 goals or something like that. It was like embarrassingly large margins. They started winning and of course, they don't want people to figure out how to go to this but it's just one of the next goals, right? It will be from really far away and then it skips all of it. So it became so embarrassing. At some point in time, if you watch the whole video, you can go to the UT Austin, Austin Villa page. There are a lot of videos. At some point in time, it will actually keep from the half-way mark into the goal. It becomes so good at it. But the point was when they tried to write rules or write it using traditional mechanisms, they were finding it hard to get it to kick and control the ball and so on and so forth. And they sent it up as a resource to learning problem. So what is the resource to learning problem here? Well, if you kick the ball in the direction that I want you to kick it in, then you get a very high reward. Farther away, you are from the direction, your reward becomes smaller and smaller. So if you want to maximize the reward, you have to kick it in the direction. So that's basically it. And so I'm going to talk about one more application here. It's the game-playing agent. So there's this game called back-end. So one of the things you guys should have realized if you have learned AI and things like that, that AI folks like a lot of games. So games are nice because they are easily simulatable. And they have enough complexity in it so you can actually think about solving problems that are not really. So we'll end up talking a lot about games in AI. But then the learnings that you get from here are what are getting transferred into playing into controlling humanoid or into controlling helicopters and so on. These are just places where not only your agent learns but you also learn about the whole process of learning. So the game back-end is again that was considered a very hard game for computers to solve. And so people who don't know what back-end is, think of it as a two-player Pluto. Everyone knows Pluto, right? So Pluto, you have four players and then you take coils around the board and you take your opponent and then your goal is to finally get your pieces off the board. So similarly back-end is like two-player Pluto but they have a lot more pieces per side so you can do all kinds of strategies. And so you can't cut the pieces. There is more than one piece in the same square. So you can kind of protect your pieces like that. You can do all kinds of strategies. But the main problem is you throw a die in the playing of the game and you saw the initial game tree that we talked about for tic-tac-toe, right? The game tree had like a branching of nine which was at the root, right? But the average branching factor was very small, right? After a few steps, you have like seven, five and so on. So the number of branches becomes smaller and smaller. But in the game of back-end, the number of branches are of 400 because of the way you throw the dies and the number of different pieces that you can do and so on and so forth. So it becomes a very hard game for computers to play and so Jerry Tissero, who was at IVF, who still at IVF, came up with this player called Kitty Gammon, which uses the temporal difference rule to try to play by Gammon agent, right? And of course, he beat the best human player back in the 90s. But what a very cool about the back-end player was it learned completely by what is called self-play. So what do you mean by self-play? Agent played against another copy of itself. There was no human being involved, except in, of course, setting up the whole thing, right? And then it just kept playing game after game after game against itself. And then you don't have to play at such a level that soon we started playing seed moves that were not, you know, considered good moves by humans, but it's actually resulting in better chances of winning than what you would see it, what was recommended by humans, right? I can say here. New moves that were not recorded by humans in centuries of play, of back-end. So back-end is a very popular game. It's just that you don't play it that much. It's very popular in the Middle East, and that's even the world championship of back-end. And in our written books about back-end strategy and things like that, even there, these things haven't been recorded and this is the main issue, right? So I'll come back to back-end in a bit, but I didn't have a slide that's how much to keep in mind because they almost single-handedly, as a company, revived interest in horror that happened in 2014, they had, like, this video demonstration of, like, live demonstration in one of the conferences, yeah. So in case of back-end, in case of self-play, how did they ensure that both the AS don't go with the same level of time? Because they are learning from the same situation. But they are playing for, one is playing for white, other is playing for black, so they can't make the same move. Yeah, so there is no way to ensure that, right? If they're going with the same move, what's going to happen? They are not going to win by a huge margin, right? So you're going to say, okay, I played that move last time, so it didn't lead me to a big win. So let me try playing something else, right? So you're going to start exploring. That's why I said that trial-and-error part is very important in doing coaching. If you're not having that exploration component, you will get stuck in something sub-offering, right? The second thing is that even though I said you play against another copy of yourself, both the agents won't learn at the same time. They will back everything for setup. So what happens is, you take your current agent, right? Make it operant and keep him fixed for say 100 games or 1000 games or something while the white guy alone learns, the black guy is fixed. And after a while, you take the current version and then replace the operant with that and go open. So the operant progressively becomes more and more adept at playing the game, but it's not changing while you are playing. So that way, the white player although can do the exploration and find all this news that is useful for the agent. It turned out to be, in fact, there is no theory that says it should work. One can start to do really well. And the newest version of Alpha Zero actually learns to self-play. So the new version from DeepMind actually learns to self-play. Not the one that I put on earlier, but the latest version such that. Anyway, going back, so they actually had this demonstration of playing Atari games. Atari games are simple, if you think about it. But what they did was, they trained these Atari games from scratch. They just gave it raw pixels as input and they were expecting the agent to give. George Stig combines this output. So that was pretty tricky. And then, of course, they did the AlphaGo player and kind of made the whole world go crazy. So let's talk a little bit more about the Atari game. What they did, right? It's learning from raw video input. So just think about it. So you have at least couple of decades now of processing visual input, right? So you look at it and you say, okay, those are numbers there. They seem to be colored bars. These are some guys moving around. That's a ball. As soon as you see it, go around. You know, okay, that's a ball. I've seen enough movies, right? So these look like aliens and I'm firing them and so on and so forth. So you have this huge history that you're bringing to bear and you kind of look at the screen, move around for a few minutes. You kind of understand the dynamics, right? Imagine an agent learning this from scratch. So I have no idea that these white pixels correspond to numbers, right? So I have no idea when I move my joystick what I'm moving is this green thingy up and down. This green thingy, not this orange one, not the white one, right? I don't know what is responding to my joystick even. So we are very good at figuring it out because we know what is normal dynamics and then all of this has to be learned from scratch. All it got were raw pixels, that's it. That's why it's such a hard learning problem to think about it. It's a very hard learning problem. What they did was they went, fell back on deep neural network. They used a complex neural network and then used reinforcement learning on top of it. At that time it was considered one of the hardest learning problems so I can feed it as because it did it with almost zero knowledge about what it was learning. So other things had a lot of knowledge built in. This one worked with zero knowledge. But more importantly, unlike the back-end player or the helicopter examples that we saw earlier, this was widely reproducible because the Atari game simulators have been available for a long time. They were built long before the DeepMind guys came on board. And also the algorithm was very simple to reproduce. So they actually published algorithm. In fact, they released the original code. So all of you could just download the simulator and the code and if you have a decent GPU, you could train this and have it play. And also it gave you a complex enough domain where you could try out different things and see where it takes you. So that was one of the big reasons that the kind of interest got revived. So let's look at what happened here. So imagine, I still have no idea that I should move the joystick and I start off from there. I just start somewhere and then I'm starting. And then I keep learning. I'm trying to foster this. It took a long time, right? So we started learning, right? And now it's learning really well. Now it's playing really well. But this is after several thousands, hundreds of thousands of frames, right? And likewise it can learn a bunch of different games. Okay, so this is... Here they say, laquira. So people know how to play breakout well. You dig a hole and then you let the ball go through the top and then it will just keep bouncing back and forth at the top and then keep knocking the things off. So then you don't have to move anything at the bottom. So you'll see that happen now, right? It went up and then it started knocking off the blocks there, right? So at this point of time when this video was made, it was called a lucky run. But nowadays, right? Now more training has been done. So it learns to do this every time. It's no longer a lucky run. It actually is learning to do this every time, right? And I'll move forward. Yeah, good. So here is a graphic from the original 2015 paper, right? I forget about what the statistics mean, right? Just pay attention to this line. So these are all different games that they were training the agent to play. And this line indicates human performance, right? To this side, the agent was better than humans. To this side, humans were better than humans. This was in 2015, right? And now the line is here. So there's still one game on which humans are useful, right? So they wanted to be completely replaced, put the game plays, set everything up to play. But still, boundaries are getting pushed, right? People who can't read it, the game is more than them. So what is the secret sauce? What cost us to succeed now in 2014 that we couldn't do anymore? So going back to the backgammon player, so what Jerry Tessaro did was used a very complex, not a complex, he used a neural network. Now in modern car lines, there's no longer complex. But he used a neural network to learn how to play the game. I forget about the actual mechanics, like I said, I'm not going to get into the details of it. So he learned how to play the game within this neural network. But the point was for input, right? He used this 198 dimensional vector. So each position in the board, he would describe using 198 different features. What are these 198 features? They were certainly not just the board position, right? How many pieces of that in each location? That's certainly not it, because I can encode that with just 30 features. He had 198 features. So there was something more thought had gone into it. But Jerry himself is a very good player of backgammon. So what he did was he kind of put a lot of thought into it, designed all these very complicated features, and then he used that for the first time in the board. In terms of that, this is very crucial. One of the reasons that we are not able to reproduce Jerry's performance otherwise is we don't have access to those features. So IBM kind of made it proprietary. So we don't have access to the 198 features that Jerry used. But then remember, I said that you're learning to play it from video to scratch. So what about the features in that area? Why not the features? The features are coming from a deep neural network. So this is a lot more complex. It has got millions of parameters, while the backgammon, the TDGammon network had hundreds of parameters. So the deep Q network has millions of parameters. But you can eventually break it down into a portion of the network that learns these features from scratch, right? And another portion of the network that learns them from scratch. So this feature learning is the power that deep learning brings to RL. So that you don't need to have somebody sit very happily in handphone. The same thing happened in the helicopter case. The same thing happened in the humanoid robot case. Somebody had to sit in front of the people. In other words, you have a lot of these applications where the features are learned automatically using this kind of deep learning approaches and you're able to solve more and more complex problems. So one of the hottest things now is something called AlphaZero, which DeepMind claims is a general AI algorithm. But then there's a quote from Vishwanathan Anand. So Stockfish is a chess playing program at which grandmasters used to train against. In fact, Stockfish wins against the grandmasters. It's that stronger program, right? To watch such a strong program like Stockfish against whom most top players would be happy to win even one game out of a hundred, be completely taken apart, it's certainly different. So when AlphaZero played Stockfish, right? It just basically won, defeated and won by huge margins, right? And remember I said that AlphaGo beat LeaseItAll, right? AlphaZero beat AlphaGo, 100-0. And so DeepMind has gotten to a point where they don't test their algorithms against humans anymore, based on that. We just tested against other programs. So they have to have the process with them. So it's a single algorithm that not only learns to play chess and go, but also another Japanese board game called Shogi, which is a strategy game again. And, of course, the next thing, what? We could do things like autonomous vehicles. Well, we are too late to get on board, right? So Waymo is starting out for 200,000 vehicles in a few months or something, and Arizona is going to be crazy, right? So, but there are a whole bunch of other things that we can use RL for, right? So we could use built filtering system to help relieve well, I mean, better use for all the infected vehicles, right? Or you could think about using this one. I think that, in fact, there's a company in Israel that actually uses reinforcement learning to decide on watering schedules and plans and so on, so forth. Or we could use it for smarter energy. So Google actually uses reinforcement learning for controlling our power consumption in their data centers and so on, so forth, right? So a lot of excitement in the AI community. Also a sense of responsibility because we truly believe that we can screw up things now. So there are a lot of regulations that I think that could need to come up. I mean, asset exclusive game of chess. That's not screwing up, right? You could really screw up and mess up the world. And so we believe that we are closer than ever to functional AI, but switch of solve. We already believed this before, so we believe ahead that we are closer to functional AI and we also believe that AI can go ahead and work space with humans and, therefore, that gives rise to things like this. So what's next for AI? It's a robot called Flippi, I think. And they actually started putting, pushing it out into... I mean, that is one place in Australia where actually Flippi search you, makes you barred. So you can't even say if you are useless at anything, you at least go work in that barred one. Even that you can't say anything. So better figure out what you're going to do with your life. But then, of course, you don't have to be scared. I don't believe in a dystopian future like this, right? Look at that. AI is potentially both dangerous than nudes. Why does he fund open AI then? But anyway, I don't like that. And I would like to see AI coming to the news more like this. AI helps old lady cross speed, hands around with kids and cooks food. And, of course, if you want to know more, if you want more pain, you can go to this URL which has links to all my NPTEL videos on RL and ML and other things.