 So, good afternoon, my name is Usheen Boydell from Cedar at UCD. I'm the principal data scientist and I lead the applied research group. So, I'm going to be talking today about an overview and introduction to a kind of really interesting aspect of machine learning called reinforcement learning. So, we're all kind of familiar with this kind of loose-term AI. It's not very well-defined, I think. Kind of fuzzy what it means if you look up definitions in online and so on. They're quite restrictive, I think, in terms of what they really match and what the broad extent of AI is. But definitely something that is associated with AI is machine learning. In fact, I think at the moment machine learning is kind of equated to AI. It's a very important area, which is there's a lot of interest and activity in currently. But within AI as well, must mention also other aspects of AI, the symbolic reasoning, expert systems, causal inference, and there's many, many others that are under this umbrella of AI technologies. And then within machine learning itself, we're also used to things like supervised learning, which is a lot of the kind of machine learning that we'd hear and be familiar with, where we have label training examples, and we're learning based on those. We also have unsupervised learning, which are things such as clustering, where the data is really driving the learning and the outputs on that. And then finally, this paradigm of machine learning called reinforcement learning, which is a little bit less understood, I think. It's a kind of newer area. Research has been going on in reinforcement learning for quite a number of years, but it's not something that's really being used and being deployed or applied in many real-world scenarios at the moment. But as we'll see, this is very much changing, and there's been a lot of innovations in reinforcement learning over the last number of years. So this is just a closer look then about these three paradigms in machine learning. So we see supervised learning, which you have labeled data, direct feedback, and it's about predicting outcomes or future predictions based on those labeled examples. Then unsupervised learning, we're talking about data where it's not labeled, so we're not giving the algorithm labeled an example and telling it what to make of it. The algorithm itself is looking at the data and driving its own clusters or its own groupings from the data. And then reinforcement learning again, it's much more around decision processes. So it's an algorithm making decisions, making decisions to take certain actions and understanding what those actions mean in that environment. And just to give a bit of background then, so we're all very familiar with supervised learning. So basically, with supervised learning, we have a labeled training set. So this can be things like, for example, you might have a collection of documents and you provide it to the algorithm. And some of those you've labeled as, well, these are sports articles. This set here is current affairs. This other set here is maybe scientific publications, for example. And you want your machine learning algorithm to be able to learn the patterns within this label training set. So for new examples, it's able to infer and provide what that category is. So typically, this label training set, you have a machine learning algorithm. It works over your label training set. It finds these complicated patterns in the data, produces a trained model. And with this trained model then, you can give it examples, new examples of unlabeled data. And the model will give you an output of what it thinks is the correct label for that data. So this is the typical supervised learning paradigm that we're all mostly very familiar with. It's the typical type of machine learning that's used these days. It's used all across different type of applications from images, documents, outcomes, even things like business processes and so on to make predictions and recommendations from labeled input data. However, if we look at what reinforcement learning is, it looks quite different. It's quite a different way of working. So this is a kind of high level view of it. So some of the key concepts in reinforcement learning are we have this agent. So a software agent and we have an environment that the agent is acting in the environment. Now the agent can take various actions and based on that action, that will cause a change in the environment. And there will be some reward triggered from the environment that the agent will get as feedback based on the action it took. And also it will see, well, what is I did this action? What is the difference at the output? What is the new state based on this action? And it's always easier to kind of look at a sort of more real world example that we can comprehend to sort of put some of these more abstract concepts against. So if we think of challenge of a mouse learning its way through a maze. So we have the actor or the agent in this case is the agent which is the mouse, making the decisions and learning. The environment in this case is a maze and we can see that there is a reward which the objective is to find the cheese at the center of the maze. And in this case the actions are things like go forward, go backwards, take the left turn, take the right turn and so on. And out of this feedback loop here we have when the agent takes a specific action we see, well, that's changed its state, changed the state. So the mouse is now in a different part of the maze based on the action that it took. And the reward could be something like, well, have we got any closer to the cheese? Are we closer to the end of the maze? And based on this circle, this feedback circle reinforcement learning the agent can learn then that what actions according to different states give the higher reward. And some of these concepts, they're very much related to learning from experience. So we can think of reinforcement as this kind of concept of experience learning. Whereas supervised learning is much more learning by example. You're showing the algorithm, showing the system the correct answers for those examples. Whereas reinforcement learning, we're actually the agent learning by doing, by trying different things. And some key aspects are the reward. So the agent, its motivation is to trying to optimize this reward at each step. So in the maze example, it's taking an action and seeing, have I got closer to that end goal, to that cheese? Is that reward maximizing and optimizing that reward at each step? It's very much focused as well around concepts of trial and error. So the agent is learning to trying actions, given the current state it can see, and it observes the result and sees the reward that it gets from that. So it can try different actions, see what the outcome is, and learns in this scenario, in this state, this action led to this outcome. And another key aspect related to that is iteration. So the agent learns by trying out many of these actions in the environment and seeing what the outcomes are. And we can think of this as very conceptually similar to how we understand how humans ourselves, how we learn, and how animals, that's why I used the mouse example in this case, how we learn. So in an unknown scenario, what we might do is try different things. We observe what happens. We learn based on our past experience of how those actions operate and what they mean for some particular end goal or objective we're trying to achieve. And it's very much mapped onto this kind of intuitive way of how we learn. It's also, from this, there's another key concept, which is this exploration versus exploitation. So explore versus exploit. So you can imagine the agent at each step. Should it try something new? So take a random action, a random movement in the maze to observe the reward and learn based on that action. So should it take a certain corner in that maze to see what would happen? Has it got any closer to that cheese? Or do what it knows produced the highest reward currently? So it might know that it took this particular direction last time. It produced this reward. So should it just do that because that was a good opportunity. And there's very much a mixture of both. So it's this trade-off between exploring the environment where there's many unknowns and exploiting the knowledge that it's already gained over the environment. And this trade-off is a key part of many reinforcement learning algorithms. And a useful strategy is very often to favour this exploration-based activity early on when less is known about the task or the environment or that particular scenario. And then the agent can increase its exploitation behaviour as it's built up more knowledge and more learning about the particular task it's trying to do. And in very dynamic environments where things change all the time, it's not a set maze, you can imagine. Maybe in that maze certain kind of corners open up. There's certain new ways through that maze that might open up at different times. Then you can imagine if it's purely based on past experience, it's not going to be optimum strategy. It's very important to continue a certain level of this exploration behaviour to discover new optimum ways in a changing environment. And in reinforcement learning, in normal culture, we call it the epsilon, which is the Greek symbol epsilon, is often used to represent this ratio of exploration to exploitation. So this is a very typical graph we'd see on when a reinforcement algorithm is trained. And we can see that early on, so towards the left there, low number of iterations or episodes in this case, we see that the epsilon value is at one, so that means we're in full exploration mode. So 100% of the time we're just taking random actions because there's no pre-built knowledge of what is the best action in a particular scenario. And we're very much exploring and testing out different scenarios and what state they lead to and what reward they lead to. And then as the agent learns more and more through different iterations and different episodes, we're gradually decaying this epsilon value. So it's gradually replacing more of this random exploration behaviour with exploiting that the knowledge that it's gaining as it's going along. So this is very much a kind of key concept of reinforcement learning. And to bring it back to human-type learning and animal-type learning, you know, we see this again in a new situation. We might take random actions, just try things out. But then once we've learned what works and what doesn't, we move much more to based on our experience that we're building up. Another key concept is longer-term objectives. So in many scenarios, we don't have these intermediate rewards at each step. So even in the mouse and maze example, if you imagine the mouse, it's not going to actually know, has it got any closer at each step to that cheese because it doesn't know where the cheese is. So we just have a kind of final objective in that case. In many scenarios as well, they require accepting short-term losses to achieve better long-term rewards in the end. So this is this concept of delayed gratification. And that image there is of typical experiment. There was an experiment called the Stanford Marshmallow experiment where they gave a range of children of different ages, and they said, they put a marshmallow in front of them and said, you can take this marshmallow right now, and you get that one reward of that marshmallow. Or if you just wait 15 minutes in front of this marshmallow, you'll get two at the end. Now, it's found that for children under about five years old, they always took it the first time. It's very hard to learn this delayed gratification. And I mean, I think even us as adults, we still have a lot of challenge learning these kind of delayed gratification behaviors. And it's no different for reinforcement learning algorithms. It's very difficult for a lot of reinforcement learning algorithms to model this kind of behavior. But a lot of the algorithms attempt to do this with short-term trade-off for longer-term gains. So I'm going to just run through an example of reinforcement learning, Q learning algorithm. It's one of the kind of commonly used learning algorithms because I think it kind of helps to show real-life examples, real-world examples of how this actually happens in a very simple scenario again here. So this is somewhat similar to the mouse and maze example where we have, on the upper right, we have a car. That can take different routes to different states. So each state has a number here. The ultimate objective is to get to the bottom left, get home here, where there is a reward at the end of that. There are certain states that you fall down a manhole here. So obviously you don't want the agent to maneuver the car into that. And the objective is to get to that end state and the agent needs to learn how to move through that maze. It's a very trivial example, I think it helps to show the concepts here. So with Q learning, we basically, the algorithm maintains this Q table. So for each state and each action, it maintains what is the Q value for being in that state and making that action. Now Q value is very similar to reward, but it's a backtrack of the final reward, backtrack to that point where it is in the current environment and the actions it will need to take to get to there. So as you can see, when the agent first starts learning, it doesn't know anything about the environment it's in. So the Q table all has zero rewards for each step. But then after a few episodes, after a few trial and error moves where the agent is moving the car to different subsequent states, it's starting to build up knowledge about that environment that it's operating in. Some of these states here and action state pairs have numbers that are representing the benefit of taking that action in that state. And eventually then, after it's been trained for a while, we see that it's built up a much more detailed picture where every state has a particular action which is the optimum action taken in that state to get to the end reward. And you might see this as the Q table and you have a particular state and action as the input and we can see, well, what is the highest Q value to take, which infers, we're in that state, what action should the agent take? Now, you might have heard about deep Q learning. So deep learning is, you know, it's been applied to everything in machine learning these days, reinforcement learning as well. So with a deep Q learning algorithm, this Q table can be massive. We could have thousands of different states. There could be many, many actions for each of those possible states. And as well, to train the agent through all these states might be impossible. These states might be changing all the time. So with deep Q learning, what that does is it replaces that Q learning, that Q table with a deep learning supervised learning model. So essentially that model is approximating that table, which means for new unknown scenarios which the agent actually hasn't encountered before, but are quite similar to ones that it's encountered before, that deep neural network can make a good approximation of what would be a good move or a good action in that unseen state. So this is where we can get reinforcement learning approaches to do much more complex tasks than these simple sort of maze or car moving examples that I've shown so far. So here's, I'm going to give a few examples of where reinforcement learning has been used so far. And a lot of these you'll see are quite kind of simple toy, not simple, but they're toy examples, maybe not real world high value solutions. And that's because reinforcement learning, it's in quite early stages of being applied in real world, in complex scenarios, in real world actions. It's really a sort of cutting edge approach in machine learning, which is very important and holds a lot of promise. So a typical scenario that models are tested in and algorithms are trained and tested to see what works well is in scenarios of video games, and particularly these kind of old fashioned 8-bit video games. As we can see, these have very quite specific environments where it is a very simple mapping of reward function. So obviously in the game you're trying to increase your score, you're trying not to lose a life or to get killed in the game, for example. So they map quite well onto these reinforcement learning scenarios. And a lot of these are these quite simple games. And we can now produce reinforcement learning algorithms that trained on hundreds and hundreds of hours of gameplay of these games without any supervised examples to learn off of humans playing or anything can actually get to human level and above performance on these games. And we can see even like the bottom right there is Doom, which is a lot more complicated in terms of the gameplay and the rewards and the objectives in the game than something like Breakout on the top left. So these algorithms can get much more advanced in these types of long-term deferred reward objectives that I was talking about earlier. AlphaGo Zero is another impressive example of reinforcement learning. So the original AlphaGo in the Google DeepMinds AlphaGo in 2016 played a very historic match against Lee Sedol, who was the number one player in this ancient Chinese game of Go, which for computers to play is vastly more complicated than chess. There's many more potential moves that can be made and it's much more complicated. And the original AlphaGo beat Lee Sedol in this historic match. This original version was based on supervised learning mostly, though. So many, many human matches were shown to the algorithm and the algorithm basically learned how humans play this game. And from that, it became better than humans, the current top level of human playing in this game. That was pretty amazing. But what's come since is a couple of years later, AlphaGo Zero, the next version of AlphaGo, was developed using purely reinforcement learning. So this means that AlphaGo was designed purely around the rules of the game, what the objectives in the game were, and then it was given hundreds of hours of, or many hours and days of learning where it could learn to play the game itself by playing itself in this game. So after 36 hours of training, it reaches the level of AlphaGo Lee, which was the supervised learning-based approach. After 72 hours, it could beat that 100 games to zero. And then after 40 days, AlphaGo surpassed all previous versions and becomes the best Go player in the world. And the really interesting thing about this is some of the moves and the way it plays are just unknown to human players. So Go has been studied over thousands of years. The methods of playing the way people set up moves, they're quite well-known in terms of the general approaches. But AlphaGo Zero came up with whole new approaches because it wasn't based on seeing how humans played. It actually worked out how to play optimally itself. Other examples, more kind of real-world physical examples, is in robotics. So reinforcement learning to control robots is a big area of research. So we see the top left there. We have these different robot arms. They're trying to do some specific tasks with building different blocks together. These robots are going through this trial-and-error scenario of trying to improve by taking actions, seeing what's happened, and trying to improve their actions to complete this task over time. We see it top right trying to flip a pancake. There must have been a lot of iterations and a lot of pancakes on the ground, I think, to train that. That's why usually reinforcement learning would be done in simulated scenarios like these video games, or learning the game of Go where you could simulate that fully in a machine and run literally thousands or millions of iterations. But in the physical world, obviously this is a lot harder to do. And we can see the bottom left then. Here's some physics simulations of learning to balance poles and move arms with real physics. So it's much easier to simulate these things than to use real-life robot arms to train these. A nice example is DeepMind. Again, they're very at the forefront in the reinforcement learning, the cutting edge. They were able to reduce cooling bills in Google's data centers by 40% by using a reinforcement learning approach to optimize the cooling and the energy management in those data centers. Reinforcement learning is also a lot used in financial trading, because if we think of the trading scenario, it's very much taking specific actions. Do you buy a certain stock when you hold that position, when you sell that position, and so on? So there's a lot of interest in this trading scenario where we have these agents that actually can be out-trading in the financial markets and learning from the behavior of those. So there's a lot of interest in finance as well. And just finally to wrap up with some of the advantages and benefits and also the challenges of reinforcement learning. So firstly, a key advantage is it has this end-to-end learning. So the agents learn the best actions to take given the environment. It's not like supervised learning where you have an outcome from the supervised learning model, and then you take an action from the external to the machine learning. With reinforcement learning, it's actually the agent itself is learning those optimal actions. It's much more sort of automated than having a supervised learning component in a pipeline. It doesn't require label training data, which is a huge plus as well. If you can model that particular environment and model the rewards functions and the objectives correctly, it can just work through its own exploration in that environment. And it can discover more new optimal ways of doing something like in the Go example. The challenges. So it requires tasks to be able to be represented in these very specific ways, which has an environment, states, rewards, iterations. And designing these reward functions that are aligned with the objectives is challenging. You can get a reinforcement learning agent to optimize those rewards perfectly in some cases, but it might be doing actually what you wanted to do, what your objective out of the system is. There's the ability to run many training iterations that you need, and the algorithms are often very complex and hard to tune. But working with Cedar, we're doing a lot of research into applying reinforcement learning to real world problems, and we've recently completed a Lighthouse project in reinforcement learning, and we have a demonstrator application that enables you to experiment with deep queue learning in a financial trading scenario, and you can supply your own data to this on your own time series data. So if anyone's interested in reinforcement learning and maybe how to imply it and learning more about it, please come and contact Cedar. Thanks very much.