 I guess we'll just kick it off. So obviously, Sunday morning, we might have some fewer riders trekking in, so the assistance can just ensure that it goes smoothly. This is the artificial intelligence revolution session by Kaising. So I got to learn that his background was in from Lazada, he went to Dublin High, and he has had quite a bit to share on machine learning and effectively his data science approach. So let's go around for about 25 minutes and we'll be recording the session. So Kaising, thank you. Thank you. So thank you all for coming today. On a Sunday morning, this will be a lot. I'll try to talk softer. So today I'm going to talk about AI, evolution, and part of the whole point is because there's a lot of talks that we have been to. Let me stand back a bit. They focus a lot on the mathematics and the coding, but there is very few that actually focus a lot on the logic and the thinking process behind AI. And the thing is, I've been doing machine learning and recommendation systems for a couple of years, and basically I feel that anything that can glorify lookup tables where we basically pre-generate a lot of the recommendations, a lot of the relationships between people, there is no intelligence in it. The model is kind of just looking up whatever we generated beforehand. And why today's talk will focus a lot on reinforcement learning is because reinforcement learning is the closest thing we have to at least what we have seen in movies where we actually have models that think. Models that can learn. I think I need to deal with the mic. The feedback is pretty bad. Sorry. Okay. Okay, so do we need models we've learned, right? So what we want is testing. Can I just shout? I mean it's a small crowd, right? But I think the mic. Yeah, let me see. Yeah, we'll do the mic. Can you hear me? Yeah, okay, that's much better. Okay, finally. Okay, let's talk about reinforcement learning. Okay, so today's technical talk will focus on the game playing in OpenAI Gym. So you can see there are two games here. On the left is my implementation of Q-Learning, blah, blah, blah, experience. We'll go through some of these concepts later. And on the right is the A3C. So one thing you need to learn about all this is the environment keeps changing. There is no physics. So when I share about the rocket landing, people ask is there physics or is there information provided? There isn't. So the agent will learn to play the game based on the reward feedback. So all we are telling the agent is that if you land, you crash, you minus point, you land between the flag. We don't even tell them the flag concept. Just say land somewhere in the area will give them different variable amount of rewards based on how close they are. So think of it as a feedback system. We don't explicitly train the model for any particular scenarios. We just give them a feedback that if you do this, you get that reward. And that's the foundation of reinforcement learning. Similarly for that, you see the entire map is generated randomly. So the agent won't know if there's a block coming soon or there's a hole in the ground. You scan the environment and try to react and walk as best as you can. That's actually a very difficult problem. There are a few people that actually managed to solve it until HVC came out. And we'll talk more about that later. So I want to dive a bit to the sidetrack, the philosophical side of it. So when you see a training thing, there is always an exploration, exploitation cycle. So initially you can see there's a lot of negative score. So how to read this is when there's a whole bunch of negative score, basically the rocket in this case is crashing all the time, it's figuring out the world. And then we train it from a blank state. Think of it as like a baby that just got started, basically just bumping around and trying all sorts of random things. Random things. And I call it the so cute and so dumb stage. Basically you just see that it's doing really stupid things. Yeah. So eventually you'll start exploiting the knowledge. Once you build up sufficient understanding of the environment. And that's where you see the cut off the tipping point, where you start exploiting the environment and it shoots up. So basically it's learning, it has learned enough about the environment to start executing on what it knows. So you want to land as smoothly as possible in between the two flags. It figure out by itself, we never tell him that. And eventually you reach beyond human level. So you can actually play the game yourself and it's very hard because the rocket just starts in some random state and you need to play the left thruster, right thruster, main thruster and it starts crashing even as a human. But machines once you reach beyond human level, it never forgets. So it's kind of like go through many generations of learning and eventually it builds up to the capability of where humans can never be as stable in performance as a robot. We'll get tired, we won't be able to focus all the time, but machines never do. And so wait but why as an article about the AI revolution. And the thing is initially when we are looking at AI in a state where it's like stupider than an ant. So I have a friend here that tells me, I asked him about RL and current state of AI. He says Raymond says that it's stupider than most insects. Which I agree, insects have their own kind of intelligence and the way they work ends for example, building structures and stuff. The current state of AI is really stupid, it's very far away. But once you reach the stage beyond the very cute and stupid stage where it starts exploiting the knowledge and once the first human equivalent AI is invented, it takes a very short time for them to train itself to reach Albert Einstein level. And once you have a room of Albert Einstein, millions of them teaching each other mathematics, teaching each other about the world. It takes a very short time for them to acid everything that human can come out with and go far beyond that. And that's what this article is about and that's basically some of the fears of super intelligence would come into play. Having said that, like I said, we are very far away. We are at this starting stage of not the end level yet. So we are very far away. So let's start from the basic one. Why play games? So today's thing we'll talk about games. And games are good because it's fun. You can see the walkers wherever people get excited. And at the same time it's a very kind of contained problem. You don't have a lot of like extra outside factors affecting it. We can kind of like have a simulated environment and we just play well inside it. But the complaint, of course, is it's not generalizable. We play one game. The things are useless to bring across. And that is true in terms of the raw code. It's not true in terms of thinking behind the raw code. And a lot of things we share today comes from the AlphaGo, the team that actually played Go. And we are transferring all these things into other domains because of their improvement for the agent over time. So the knowledge, the thinking can be transferable. And that's what we should focus on today. So OpenAI is Elon Musk and gang that started this environment to understand the importance of this before that people that do RL and AI spent a lot of time generating environments to test. So actually we spent a lot more time building all these things before we even start running the models. So that's a waste of, it takes days or weeks depends on how complex your environment is. Now they have Atari games. They have robotics games. They have 3D games that it gives you a very simple start. Look at this, it's like six nights of code. What's happening now is I'm just telling you to take 1,000 random actions. You're supposed to balance the poll because the actions are all random. It's just spinning around. But it's not hard. I mean, anyone who can write Python can just import it, import the gem, and then start coding straight away. I mean, the more complex environment, of course, there are various stages, but it's not hard. I mean, most people can learn it within half a day. And even if you're not interested in coding, there's a whole bunch of resources on the open AI where they talk about the latest development in AI and environment. So this is useful still as kind of read up and understand the industry better. So let's talk about the simple zombie game. So you live in a terrible world where the only gift in life is ice cream. And you drive through hods of zombie to get your gift in life ice cream. And obviously if you drive in them, you get eaten. So it's a bad idea. So the simplest thing you can come up with is a four square. So there's a zombie on the left. You can't go diagonal. You can't go diagonal. You can go four directions. And you go forward, there's nothing. Your goal is to reach the ice cream. So you can see there are already various states, action and reward. So you have a start state, your initial location. Action, you can go forward, down, you crash into the wall, right, you crash into the wall. Left, you crash into the zombie. And there are various rewards. So driving upwards seems to give you the best reward. And left, you get eaten. So what will you do? Class, what will you do? Yeah, drive in the zombie. No, no, you go forward, okay? You move forward. Yeah, I mean you can feed the zombie your charitable or something. So a logical agent. So a logical maximizing agent will drive forward because to maximize long-term reward. We'll talk about long-term and short-term later. So we want to maximize long-term reward. We drive forward. So now we have ice cream on the left. Give us a good score. You can drive backwards. It's a positive score. I mean, you're not losing much. You can just crash into the walls and get nothing. So a logical agent, again, will maximize the reward and just drive into ice cream. And that's all you need to know about the fundamentals of reinforcement learning. And to be more precise, so as much as equations, whatever, scary stuff. But it's the same thing. So look at it closely. It's the same thing. State, action, reward. So in the Q-learning environment, we'll talk about beyond that later. It's always state, action, reward. In all sorts of combinations. So in this case, the Bellman's equation, the most basic of all, state, action, reward, where we actually have like a learning rate. We make transfer into the Q-learning equation and we'll go through the code later. It's a loop that basically you want to maximize the long-term reward. Basically the new action, whatever the value we have is the old action plus the learning rate times the information gain. Code-wise it looks like this. So it's literal translation of the equation. It's a loop and we have a state, action, state, action, state, action. And we have alpha and gamma, basically to adjust the learning rate and how much decay we apply onto the future reward. And we take the max of the state and actions. So it's a literal translation and the code itself isn't super complicated. It's more like the thinking. So it's a recursive calculation. What this means is we are going around, let's say a thousand times around this four squares over and over again until the entire thing completes so there are different ways of ending it. Either it's stabilized or after a thousand loops, we figure out after a thousand loops what's the total score of each square. So obviously the ice cream square is the highest and the zombie square is the lowest and then we can use that to decide to get the initial reward, to get this score right here. So that's a very stupid way of doing things because you basically have to do it a thousand times and if the space is very big, you just loop through everything. It's extremely stupid. Which is why we have other techniques that come to play. So I'll go through them over the rest of the time and you can see all these papers are released by AlphaGo people to kind of like, not just them. I mean it's a combination of a lot of smart people. Over time we build this knowledge of what we have today. Okay, so I mean the last few years all this Go and AlphaGo thing and one of the things that they say a lot is that the state space, the possibility of games is more than the galaxy of stars and so on. So that thinking is a hyperbole of how many stars are there in the galaxy, in the way of the science part. The fact still stands. You cannot do the whatever you just did and loop through every combination on Earth over and over again because it's not scalable. One query will probably take a long time or even training it will take years. So instead there's a thing called functional approximation. So we change it to a supervised learning problem that we are very familiar with. We want to have the best approximation for the loss function. Using what, as the agent move around, like just now we drive around the map, we build up an understanding of the map. And can we use that to kind of approximate what will happen if we move into spaces we have never explored. So that's the main idea. And where all this comes in is the bottom part. Basically the model maximization is the typical thing that we have for the loss function. In this case I use momentum, you can use gradient descent or whatever. So it's a very familiar concept for machine learning people. And machine learning people love neural networks and stuff. In this case I use a simple neural network and use deep learning or whatever fancy thing. So this is the machine learning part. Basically you just treat it as a sub machine learning problem in your reinforcement learning. So that's how they help each other. Now, but even then, even if you treat it as a machine learning problem they come back to the point of exploitation, exploitation. Machine learning is not just simple supervised learning or unsupervised learning methods. You need to be able to have a tipping point where you tell it to stop trying random things. Initially it would be just randomly exploring to build up the knowledge then you can use your supervised learning neural networks to play with it. But how do you say when do we swap over to start exploiting your knowledge? So that's where we use the idea called epsilon. So again it's either from simulator annealing or from like supervised learning methods. We learn this a lot where we have a decay factor and initially we set it very high just doing random actions. After a while the factor decays and we do less and less random action. So imagine if it starts 100% random maybe at the end it's like 5%, 1%, 0% random. It's full exploitation mode. And in this case I use some base-stand or whatever but the main thing is temperature and the chance or whatever. So basically we're using that as epsilon to change over time as the loop goes by. So again it's not like groundbreaking concept or groundbreaking code. It's bringing the concepts together and building on top of that we start talking about once we have that we can have a functional approximation to deal with the space. We can have epsilon to decay over time. Now we talk about long-term versus short-term optimization. So this idea is very good and very powerful because when we look at for example the ice cream game let's say the zombies move because zombies tend to let's say you're driving on ice your car is not 100% steering the same direction. Sometimes you skid, you move too far forward sometimes you skid to the wrong direction Is it to your long-term benefit in that case to always take a shortest path? The shortest path in this case you notice is in between two zombies. What if they happen to randomly stumble you up? So in the short-term there's a value function whereas the long-term and the short-term is called advantage. We want to optimize the value the long-term survival, the long-term ice cream eating we might want to drive around them even if the advantage even if the short-term go is bad you may take a lot more moves to reach the ice cream but it's a lot safer in long run assuming you're going to eat ice cream or whatever. So that's the concept behind it duoling and again code is simple you literally split it into two networks optimize it and add it up at the end it's not like forget science but it's more like this concept of duoling allows us to think about why if we break apart a problem into long-term short-term and why if we break apart a problem into more layers why do we stop at two so in this case we have multiple layers of things and I think each layer do different optimization and that's where we start talking about separate targeting network, double queue network so from this part onwards it gets less and less I would say obvious the previous things are pretty obvious and straightforward so from here onwards we get more into some of the conceptual things where people start thinking deeper like what humans do humans don't think of it we break apart a problem we break it up into components before we arrive at a decision so this is what I was trying to do with similar and multiple networks doing different things so in this case we talk about separate targeting network storage of ways and double queue network let's make it simple so what I was trying to say is instead of just having a single network that basically decides what we do what we get from it let's break it up and the reason being is the feedback loop similar to what happened in the mic just now if I talk into the mic and the mic goes back in and it amplifies my voice and it goes in the feedback loop continuously if I train a model on the data and I use the data from the model to train the model and grid a thousand times a million times it creates a self-fulfilling prophecy it just keeps generating data to feed itself and it has a huge danger and it also has optimal solutions so we need to break it apart such that we introduce in this case the separate targeting network is trying to say we introduce noise on purpose so it's unintuitive for machine I mean for machine learning we use kind of break it up into various training sets and stuff in that sense it's similar so basically we introduce noise we purposely use a separate targeting network and we don't use all the data we purposely have two Q-networks in this case one to train the data and one to train on the to value the viewer on the data and then we merge it back together and then it brings to the next concept of deep learning so breaking up convolution cycles in this case we are talking about the computer vision part of it where we use pixels where we use many other attributes to bring in to merge together so all this I have resources of the thinking behind it so all this you can click on the resources we'll share the slides later and you can check it out more and the code looks like this so it starts to get a bit more interesting from a programming perspective so in this case I'm using a base network I drop some of the random nodes randomly to create noise I have the two Q-networks that call each other have gamma this is a simple thing where basically we're breaking apart the problem into many layers so I mean you can digest it later but moving on beyond so all that is kind of the old school way of doing reinforcement learning so again this is a course that will probably take like a semester to go through so I'm condensing it a lot now we are talking about H3C which is moving beyond Q-learning and it's probably the last two years of work from the AlphaGo guys that has huge basically it just changed the whole industry before that all this Q-learning thing was cool but it was not scalable because we were just having one agent it's very cute and we're training over time H3C is like map reduced so for people into like Hadoop into parallelism big data and stuff think of it as having a puppet master with many puppets below each puppet doing their own thing and the puppet master coordinates their effort so this makes it scalable to the infinity in some sense you can run billions of of various kind of training and then a puppet master to coordinate everything it sounds simple in theory but combining it is extremely hard that's the hardest part of the H3C to be able to combine it in a logical way and to build the knowledge over time so there are of course three parts to H3C asynchronous so we train each environment so think of it as just now we have one rocket independent asynchronously running by itself and then there's actor critic actor critic is the puppet master so the actors or the puppets will give feedback to the critic critic will talk say this is a terrible idea you're stupid, learn from the other guy try this thing, try that thing so over time the critic will tell the actors to come to a possible conclusion maybe yes maybe not depends how you build it and basically to optimize their view out based on the grand perspective of a puppet master advantage is came from dueling, so just now we're talking about advantage short term gains so in this case the critic will take care of the long term, everyone else take care of the short term, you just want to optimize for a little environment, get your ice cream out, the puppet master will take care of the zombies and randomness and whatever other problems so that's the gist of it there's a quote and explanation up there but it's that easier way so talking about this, we went through an entire kind of revolution over the past reinforcement learning has been probably 20 years, the past 5 years has been super exciting so a lot of things to see is past 5 years there is, I mean Keras RL, well I always talk about deep learning double QN, dueling, HEC so you can just call a library again the whole point of me being here is to tell you don't be a code monkey, don't just call a library and oh HEC yeah and then just fill up the blanks no, you need to understand the power and the reason why it works and a lot of the theories behind it the best way is to combine some of these concepts together which is quite hard because like HEC by itself is a pain to optimize and code up, you need to combine the rest it's difficult but that's where the understanding of the logic will come in helpful TensorFlow RL as well so all these are libraries that are in development and this is less less activity in it but it still has most of the things that we need so the fun thing now is HEC with GPU GPU clusters and nvidia labs have the entire project on HEC with GPU nvidia nowadays have the cloud based nvidia GPU as well you can always speed up AWS to do the same thing and then basically all the deep learning stuff GPU, you want to play Super Mario Bros you can contribute to Super Mario Bros project so all these are available out there at the end of the day is the famous call if not now, when and the reason being now is there are a lot of tools out there that can help you it's no longer like 20 years ago all these are conceptual there is an environment open AI for you to play with there are libraries for you to call if you need a mathematics understanding and even then there are also a lot of support in forms and stuff it's probably the best time AWS hardware there is no longer an excuse for us not to try it out at home try it for a bit, at least understand what this thing is eventually it's understandable at least at the basic level you want to go deeper of course it's a lot harder so I mean it's the best time to be alive to try out AI on your own laptop spin up a cluster of technology and with that I'd like to end my talk thank you so much for coming Sunday morning thank you hey, we got time for questions yes, question any questions? yes, yes it's the same thing, so I do it on my local PC but the reason being it's more like the I'm a cheapo and also the it doesn't matter it doesn't matter until you touch this part, this part obviously matters because Mac being stupid the GPU is useless so obviously when you do this part you need the the cluster, if not everything else above can do in a local PC like even for this again it depends on the challenge you're taking so for the Rocket it's considered simple medium let's say you want to do the running thing running thing is there's a version of it, there's a hardcore version there's a soft core and hardcore, the hardcore version you do need a cluster because it's also difficult about it, completely random and unlimited, I think the soft core one it ends after some time, this thing we just keep running it's a dungeon, endless dungeon with unlimited thing, there's also flat people and all the more difficult games you can play, so those you need cluster yeah, yes okay, so that's an interesting question like, I think the closest thing we talked about it was when I was still in Lazada, so I was talking with some of the friends there and we talked about using reinforcement learning in recommendation systems to answer that, clearly it's at this stage I think RL is fun to play and but I think it's the reward, the reward for business it's not there, it takes a lot of time to set up, you need to have a very good feedback cycle, so playing games is one thing, you have very clear feedback cycle you do this, you can eat the ice cream and get a reward, in the business world, you click, people use it for advertisement, they use it for investment you click on advertisement and get a reward but you can do the same using traditional supervised methods, unsupervised methods so at a much cheaper pace so at this point in time I would say maybe not maybe not beyond, not in the commercial space at the moment, in future maybe we will see more and more of such things when we really need it, like even for, let's say we look at e-commerce and we recommend it stupid methods that is very fast, using graph using pre-computed relationships it's good enough you don't need a full-time reinforcement agent that is very expensive to run, very expensive to maintain and build to kind of recommend new stuff, you can just pre-compute everything and just catch it with all the hardware and whatever and just pump it top the result out, so that's really good enough to make a lot of money for the company, so I would say at the moment not yet, this for fun and laughter, but maybe give it a few more years, it will be different at the moment we don't have something that is very stupid for the industry thank you so much