 It's in sync with what we are talking. OK. There's some blinky ideas that go blinky-blinky every time you talk. That sounds good. So we're good to go. Thanks, everyone. Thanks for coming to the talk. This paper is about what I just call it, the Q3ACTF paper, or the Quake 3 UNICEF paper, because the actual name, human-enabled performance in first-person multiplayer games and population-based deeper enforcement learning, it takes a bit too long to say. So I just call it the Q3ACTF paper. And that's what I say, that I send my Dropbox just to find it. I will never remember that name in any case. So my name is Dave, Dave Ingra. That's my email address. By day, I run a software company that I founded. And by night, I learn about machine learning stuff. So I'm not a professional in the field, but I've been doing quite a bit of learning on it. So for the past three months, I've been trying to learn quite a bit about machine learning when I have time. And before that, I was just trying to learn about artificial intelligence in general. So I think at this point, I might have enough knowledge to do this paper justice, but I probably won't be able to answer all the hot technical questions. Not even close, but hopefully I can give you guys some good knowledge of the paper. Good. So who's played Quake 3 Arena CTF? All right, so a few of us. Cool. So for those who haven't, there's a quick little embedded video that just gives you an idea. Now when you see this, OK, let me first say what the paper is really about before I jump into this. So the paper is about teaching agents, or not teaching, but allowing them to teach themselves how to play this game, which is a team game where you're competing against another team and you have teammates. And the training is driven by on-screen pixels and rewards. So it's not like when the game is released, obviously you could play it against bots. This was in the 90s, and it was released with bots. But these bots actually hooked into the internal game logic, and they knew what was happening in the game. And they actually had perfect knowledge, that they could actually see through walls and theory if they wanted to. So they could outsmart you in various ways. Whereas these bots that are trained in this paper only can only see what humans can see. So the game doesn't look like this normally. So what they did is they took the Quake 3 Arena engine. It's the same engine with the same mechanics. They disimplified a few things. So they generated slightly simpler maps. They're all procedurally generated. So they're only just trained them on one map, because obviously the bots will learn, or the agents will learn one map and not be able to generalize. Normally the maps look much better. You can see there the structure is sort of always perfect. Well, they should be symmetric normally, but it's also quite a simple structure, as you can see on the screen. And the graphics themselves, the graphics have been simplified quite a bit. And there's only one weapon, which is the rail gun. And there's no ammunition pickups, no health pickups. So it's a bit simpler, but it's still the actual game. And I recently read on their site a few days ago, they posted that they're actually trying this out on the full-scale original Quake 3 Arena map. So you'll probably be hearing more about this in the next year or two as well. So here's just the basic intro to the game. So you run to the opponent's base. It's marked, it's red. So the opponent's base is the opposite color of yours. There's your teammate. Sorry, so the blue right now is run by the machine? Yeah, we're seeing through blues. First person perspective right now. And then blue must return the red back to their base and then they score a point. So if the opponent has your flag, you shoot at them and they'll drop it and it'll return to your base again. And that's the basic mechanics. So you score a point every time you capture the opponent's flag. And in the five minutes, the team that scored the most flag captures is the winner of the game. All right? So it's really a team thing. Like any team sport, your individual performance, while important, isn't really what matters. And that's what makes the paper. That's one of the things that makes the paper interesting is this sort of emerging cooperation between agents without them having direct communication. So it's all implicit sort of seeing your opponent is hitting off and then thinking, well, maybe I'm not thinking. Well, being sort of probabilistically reasoning that maybe you should stay in the base and it all emerges over the period of the training. The other thing that I find interesting about the paper is the fact that it's, the whole first person thing is a very good era to develop for machine learning in general because hopefully eventually machines will be able to learn in the way that we learn about our natural environment by just having a camera where we have eyes to sort of learning in that way. So that's why, another thing I find interesting. So on that note, here's another version. So there are two versions of the maps, all right? Here's the second version of the map. So they used, the one was the indoor that you just saw. The second is the outdoor and it's slightly more complicated because it has a bit more visual interference and a bit of elevation stuff. So let me show you that one. The other thing you should notice is that, you might notice that the motion of the agent is slightly jittery. They kind of doing like, going like, not moving as smoothly as a human player might move. And you'll see in a bit that that's because of the way they structured the controls is they didn't give the agent the option to move on a continuous interval. And I can only assume that's a limitation of the training algorithm at this point. They couldn't, the agent has to pick to move a little bit or a lot to the side. It can't actually pick exactly how much it wants to move. So that was, that's another interesting thing to keep in mind. Okay, so that's what the actual game looks like normally. And so you can see they simplified it a bit. But it is, so whereas you saw the teammates and opponents that we just saw, they kind of were somewhat symmetrical, spherical, that's spherical symmetry, axis, sorry, axial symmetry. And so if they were looking the opposite way, whatever, they kind of looked the same. So that's kind of a bit easier to recognize for a neural network than this. Because now if this player were to be pointed the opposite way or to the left or the right, they would actually be, it'll require more training to recognize this versus the other sorts of players. So that's another simplification. Okay, so does everyone know what reinforcement learning is? Okay, well, so I'll just spend one or two minutes just talking about the reinforcement learning aspect. So all that is your agent really doesn't have a model of how the environment works. So it's given states and has to make decisions. Like that part is not specific to reinforcement learning. But what happens here is it doesn't actually know how its actions will be rewarded until it receives the reward. So that's reinforcement learning. And here's the hello world of reinforcement learning is the cart pole problem. And that is just you have a pole in the cart and if it falls too far to the side, you lose. So you get one reward point for every second you hold it upright. And if you go too far to the sides, you episode ends. And then you, so you maybe help for three seconds you get three points. So the goal is to get like as many points as you can. So that's like the hello world of reinforcement learning and it's not a difficult one. You can apply the simplest algorithms to it but it illustrates the concept. So it's kind of learning and eventually it learned how to balance this thing without knowing anything about physics or anything. It just got inputs about the angle of the pole and it was able to move it to the left or right. So you're optimizing in this case if you had a neural network you would optimize the how well you predict the rewards. That's what you're trying to optimize. Any questions so far? Okay. Okay. Whoa, sorry, okay. Okay, so then I just wanna quickly talk about reward shaping for a bit. Cause that's one of the things that they mentioned in the paper and it's one of the things that they build on as well. So in this game we had the winning the game like I said is a team sport and it's like a soccer game where you only win the game at the end of both halves, right? That's when you won the game and no one cares how many goals you scored if you didn't win, right? No one cares about all those individual stats. The big thing that's important is the end result. The thing with that is these games are five minutes long and these agents, so this cart pull problem I just showed you, the feedback is quite immediate because after three seconds if you couldn't balance it you immediately know you didn't do a good job and you can try again and again and again. But in this case you have a sparse reward problem which means after five minutes you find out whether you lost the game or not and now you have to think back, you know, so to speak over the past five minutes and figure out what did you do to win or lose and that's a very difficult thing to solve because all these actions are building on each other and it's a whole chain of actions so it's incredibly difficult and they actually tried that in this paper and the agent didn't learn anything when they just had the final reward as the final match outcome as a reward agents could not learn anything. So yeah, so the reward shaping and they use the variation of reward shaping for the final agent which is called the for the win agent or FTW agent. It's sort of in a more, they call it reward, reward, it's like a learned reward function. I can't remember the exact name right now but it's like a learned reward function so it's a bit more advanced in reward shaping but they also use reward shaping for some of the simpler versions that they compared this to in the ablation of the, so they compared with the baseline they would have some reward shaping and then with their version of it they had something a bit more advanced for the final version. So the reward shaping is essential because the agents couldn't learn without it and I'll explain to you now how the reward shaping actually worked. So this will show you how the reward shaping worked. So on the right there you'll see W quick and those are the points that the game actually comes out by default. So players do get points just like soccer stats or whatever but it doesn't impact the outcome of the game. It's still about who captured the most flags but players do get points in quick. So you get six in the visual capture the flag. So because of the reward shaping, if you were doing reward shaping you would then get the six points and then you would be able to reinforce whatever behavior was there and that's how they can learn. So they get the pixels on the screen and they get this but they don't, so this is what they get in the sort of the baseline type agent but in their agent they slightly tweaked it so they'd introduce one, two, three, four, five, six, seven things that weren't actually in quick. So there wasn't a reward for being tagged without the flag, there wasn't a reward and obviously it can be a negative reward in that case. There wasn't a reward for the teammate picking up the flag or returning the flag. There wasn't some sort of a negative reward for when the opponent captured the flag. So they did actually introduce this as well. So it's slightly more information than what's in the actual quick but it's not a lot. So in their agent what they did is they sampled a sort of reward function that looked, so if you look at those P values in the list there those are the base ones and then they're just sampled with the log unit from distribution. So they kind of tried to, what the agent is supposed to do is to learn to weight these things differently. So not just straight reward shaping but learning which rewards to place more value on and it'll constantly place that value on that interim reward, that's what it's doing. So it will learn that perhaps being tagged with the flag is worse than being tagged without the flag. All right, so and it can learn that but over time as it tweaks the weightings and it learns the weightings. Any questions on that? So the W weightings were pre-programmed into the agent. The W weightings were actually, if you're in Quake and you press Tab and it brings up like a scoreboard of the teams and the players those are the points that you actually see there in the original game. So it'll be the accumulation of those, that's what you see in the game. But with the agents know about these this is how you rate everything or they just didn't know all the other points of what the points just give them when they perform an action. Yeah, so they didn't know the meaning of the points. The points weren't attached to a meaning of any sort and they didn't tell the agent you're getting this point because you captured the flag. So they just got the points when you have that and you have it. Exactly, so they get a reward, no idea why. Yeah, so they say we're doing their stuff and then randomly they just get whatever points. Exactly. And then what you're talking about is it's up to them to then try to figure out what that implies in a way or what that point implies. Yes, so then when they get the point they need to then see how can I reinforce the action or the set of actions that led to me getting a reward regardless of what the reward actually means. Yeah, that's it. All right. Okay, then another, so population-based training that's actually in the title of the paper so I thought we might wanna cover it. It's one of those ideas that's actually quite simple when you see it. And I think I read quite a bit of the population-based training paper. Not like I'm not sure if I read even half of it but I did read enough to form a good occasion understanding of it. So with this, obviously you need to put a lot of computing power into this and you wanna do a lot of iterations of training your agents. And so the first thing you could do is the sequential but that doesn't really scale. So you can't do Google scale machine learning by doing something in sequence, seeing if it works or not and then throwing it away and trying again, right? So maybe you hit some sort of a local minimum in minimizing your loss function or something and then when you figure that out, you start again with new hyperparameters and you start learning, you really initialize your weights to something else of your network and you try again. Then there's a different approach which is parallel random grid search where you just do a bunch of these sequential things all in one or simultaneously maybe using throwing a lot of computing power at it and then at the end, you pick the one that performed the best. So what population-based training is, it's also running a lot of things in parallel but it has this notion of at any point an agent can be replaced by something better. So any underperforming agent can be replaced and it doesn't have to happen at any set time. So let's see if I've got a slide on the details of this. So it's asynchronous. So whereas maybe in this parallel case, you would just train them all and then at the end compare them, the population-based training is asynchronous. So at any point you can sort of do some sort of comparison between them. So every member needs to expose how well it's doing to know that how many games it's won as well as it's the weights in its neural network and the hyperparameters it was using. Then if you're underperforming, so it can figure out, so periodically every agent will look at another random agent, compare itself and see what its win probability will be against that agent. If the probability is less than 30%, then the agent gets replaced and what happens is a better agent is chosen. Well, if that other agent was better, that agent is chosen and then the weights and the hyperparameters and the internal rewards that we were just discussing, we were talking about the reward shaping. So the internal rewards is the more advanced form of the reward shaping. So how it learned to value different sort of things is also copied. But this is still trying to improve learning, right? This is not inter-agent communication of sorts. It's still figuring out one agent. Is it improving, training a single agent at the end of the day? Because you talked about, let's say there's five players, right? And how do they communicate with each other? This is not that part yet. No, so I'll get to that. I think I'll get to that now, how they actually train it. So there's 30 agents being trained simultaneously. Okay, and I'll skip ahead a few slides. Okay, so let me just talk about the two tier optimization that they're doing. Or I can skip ahead and see if there's anything. Yeah, there's a training detail. I'll go back to those slides for a bit. So there's a hundred, sorry, there's 1,920 game processes that are running simultaneously. Okay, but that's games, right? So there's that many games running. Every game has four players, so two per side. Now they did say that that's how they train them, but they also happened to generalize to three in the team, also worked with the same, seemed to work with the same, I don't think they retrained them for three in the team. I think that actually just worked. Like I said before, every game is five minutes. The agents are given 15 observations per second and then can do 15 actions per second. The game actually runs at, can run at 60 frames per second, but I don't think the internal logic actually runs that fast. But the visuals can run at 60 hertz. Okay, so if you work it out, if you work out the fact that you only have 30 models that you're training, but there's 1,920 processes, that means 1,920 times four is the number of actual agents that are playing. If you buy that by 30, I think you get to every agent or every model is being used in 256 games. So they're taking part in this, in the 256 games and the results are being streamed back to this one agent of all the games that it's playing and they can then use that information to up, to do its back propagation, all the stuff that you need to do for machine learning. Good. So I'm just curious for your hardware, what hardware you're using in order to run this, that is complete? What are they using? Oh, I actually, they didn't specifically, I don't think that they mentioned specifically what they were doing in the paper to be honest, but I do know that some of the stuff that's happening like the LSTMs and things are a bit more difficult to do on TPUs, so the Google TPUs, so it's not as easy to do on those. So they might have just had to use a bunch of TPUs, but I can't tell you how many. I guess someone is googling it right now to see if they can find it. That's a long, university research, right? Sorry? It's the university research. University research, was it a university who did this? Oh, this was DeepMind. DeepMind, okay. Yeah, this was DeepMind, so and all the cited papers, most of the cited papers were also DeepMind, so you'll see when I get to those. So the agents, now I use the word war clock playing time, I'm not sure I should have said war clock playing time, maybe I should have said sort of actual game time. So the total game time that each of the agents were trained with is, they were trained for 450,000 games, which is about rough calculation, 4.3 years of playing time. These FTW, for the WEN agents, as well as the lower baseline agents were all trained with that much. But it did reach average human performance after 120,000 games, and I bought the graph to get 120,000, so that's roughly 1.14 years, and strong human level after 1.71 years of game time. So that's game time. So if you have good GPUs and you have enough of them, you can obviously do that in a reasonable amount of time, but still I'm not sure how long that took. I'm not sure there's anything else on this slide, so I think we can go back to the previous slides. What was the tree observations per second? What was that, sorry. Persecute, so there's something else. So there's an auxiliary network for, and we probably won't go into the details because I'm also not to clear on all the details, but basically the reward prediction auxiliary network is doing a experience replay. So it stores a bunch of past experiences that it can use to do another batch of gradient descent. Yeah. Okay, so the two tier optimization is an empty slide because I just quickly want to talk through it. I didn't do a slide there, but the two tier optimization is the fact that the bottom tier is the agent learning to optimize its own rewards. And then the higher tier is simply if your team doesn't win and you keep on, your teams keep losing, you will be replaced by something better. So it's two tiers, okay. So you can't just be a, I want to win myself. You have to be a team player, can't you? Yeah, yeah, but that team behavior isn't, well it's somewhat programming, but you will evolve it eventually over time. As you sort of mutate those parameters, you will be more likely to make it to the, you'll be more likely to pass tier one because you will now do the cooperative things. Right, okay. So here's just the progress during training. So it just gives you an idea of how the rankings of them went over time. So that's what I described just now is when it reaches the strong human level, when it reaches the average human. So that's the agent in the paper, the for the win agent. This is the one I said earlier where they just had it playing on its own with no reward shaping and it couldn't learn anything. So with the sparse reward of getting a outcome at the end of the five minute game, no learning. Here is a slightly higher baseline where they did. So the only difference between this one and this one is the reward shaping. So the reward shaping itself actually does a lot on its own. It can't get you to a strong human level, but it gets you pretty close. So then there's two more graphs missing from this that they didn't graph here. And that is the slightly better versions of this one. So what they did is, they took this one in red and then they added another version where they used the population based training that we described. And that one will do better. So that's the next level up. And then the next higher level is the FTW agent, but without one of the key components which we might discuss in a little bit. So these are just looking at the high parameters. Obviously the learning rate going down is normally a good thing. You would expect to see it. We won't really discuss the other ones. This is sort of how much time it's read. So there's sort of a hierarchical LSTM and this is basically how it's tweaking, how much time it's considering in terms of history for looking for patterns. So that's my intuitive explanation of it. That's learning that as a hyper parameter due to the population based training. Yeah, so those hyper parameters are just for the FTW agent in blue. It's not applicable to the lesser agents. So this is what they actually see. So when I showed you the videos, that's not what they were trained on. That was just maybe when they were demonstrating what it does and they put it on a higher resolution. So you can get it, so it looks nicer. But this is what they've been trained on 84 by 84 pixels of three color channels. And this is also what they used in previous work. So when they did the Atari games, they did the exact same 84 by 84 pixel. And that's actually quite interesting, but I'm not sure how much more computing power it uses to use more pixels, but clearly it wasn't worth it. It was more worth it to introduce more complexity in the structure of the neural network than to just do this, to increase the definition. Did they mention about how they come up with the numbers 84 by 84? No, but like I said, they use it. So the Atari paper was, the paper on Atari was in 2015. So they might actually, you know, with all the stuff, it's usually just they tried something, they tried something else and they saw this work best. I mean, that's usually what happens. It might also be that it, because it's a convolutional neural network, it needs to look for little patterns. It learns to spot like edges and textures and things. And it could just be the less resolution that it could do a reasonable job. So I'm not sure about the actual reason, but it's normally just them trying different things and until they found something that worked. I'm guessing they probably tried different ones for this one as well, just to see and then settled on what they've been doing always. Okay. So this is what the agent can do, is they can turn left or right by 10 degrees or 60 degrees. They can look up or down by five degrees. They can strafe left or right. They can go forward or backward. They can fire their rail gun, which is tagging, or they can jump. So like I said before, the game can do 60 frames, but the agent is only running at 15 hertz. And that's it, they can do action simultaneously. So they can turn right while jumping and firing and moving backwards. So they can do everything simultaneously. So that's why there's 540 composite actions. Okay. So this is what I was referring to just now. So this was the 2015 paper. This to show you how things have progressed. This is also DeepMind. And this was just learning to play breakout. And this was the same thing where you had the 84 by 84 resolution. It's a Muslim per game, obviously, because it's not just because it's too dimensional. That's not the only reason it's simple. The other reason it's simple is because you have perfect knowledge of the world in this sort of game. Whereas in the Quake Theory arena, you don't know where what's happening. Someone could be behind the wall, someone could be behind you, and you can't tell from the on-screen state what's happening. So here's where it's learning to just randomly doing things, not having much success, but clearly learning something. So 120 minutes, I think that's on the GPU, but like one GPU, not like a whole... So it's getting better. There's quite a rare possibility inside. And here's the thing where it's gonna learn even better. We're doing it. What are you doing? So now it's gonna learn that it should get the ball behind, and it's gonna... Brilliant. Yeah. So in four years... Oh, nice! So yeah, in four, I guess more or less, I'm not sure if it's four years, but they went from this to... Two minute papers. Yeah, that's two minute papers. Three papers. Yes, I came across this, I think, a couple of years back. Someone from NPD actually gave a talk. I think the same thing. And then, I can't remember exactly what sort of learning is going through, but it's actually, somehow it became very smart how it should play to the level whereby I think it's very difficult for a human to replicate. Yeah. Yeah, and they try that with many games, and some didn't do so. Any games where like Montezuma's Revenge is like you're on this screen, you go to the next, you pick up a key, you go back to the other screen. It doesn't know that it's in a different state because it just sees the screen again and thinks it's in the same state, so it can't tell. Anyway, here's the network they used in this 2015, that was published in Nature in 2015. It's, I mean, it's so simple that I can understand it, and I'm not a professional in this, but it's very simple. I mean, it's just a bunch of, on the right I just said what the convolutional layers are. So all it's doing is it's being trained. So the first thing you need to know is it's getting four frames at a time. The reason for that is just so that it can kind of know what the velocity is of the ball. So if, or acceleration or anything else, it can tell that from four frames. So it has a perfect sort of state representation because one frame you can't tell anything about the velocity. So it doesn't need any memory of what came before. It just uses the four frames and it can derive it directly. So it's a very simple set of convolutional networks. So there's like an eight by eight one. So it just takes like an eight by eight tile and sweeps it across. And then with the back prop, it'll kind of learn what sort of thing it should have been looking for in that eight by eight tile. And then eventually it gets some meaning from that. And the same with this four by four and three by three tile. So it can sound like maybe pick up like edges or maybe some texture or something. And there's a bunch of them. So there's 32 of these eight by eight tiles being slided and then 64 by four and 64 by three by three. And then some fully connected layers at the end. Very simple, much simpler than the neural networks that are used for advanced image classification. So those are much deeper, a lot more going on. This is actually super simple. I mean, when you look at it, it's quite simple. But then we get to where we are now. Now don't be scared of the legend on the right, but it's actually, it's gone to a whole new level now of complexity. Now this blue, let's just quickly, the A on the top left, so I'm not sure if everyone can see, but on the top left, it's telling you the overall architecture of the agent. And it consists of visual processing, recurrent processing, these auxiliary networks, and then the policies. And it's color coded. So this blue visual embedding here corresponds to number D here. So this is there. And this is actually what we, almost exactly what we just saw. This component there in this paper is almost exactly what was used in that Atari thing. So that part is still useful, but they had to add a bunch of other things to actually get it to do this. And it was a few years of extra research and things to get that working. So the one thing they put in is, so on the baseline agents, they would put an LSTM, which is just like a recurrent neural network that allows it to sort of reason over time rather than just so it can have some hidden state that it's carrying. But then they also did something way more advanced, which is a temporal hierarchy, which is two LSTMs connected, running at different time periods. And I'm not gonna go into all the details because it's just too much, but I don't know all the details myself. But then they also did something else, which is that little hexagon there, the DNC memory. So they plugged the DNC memory into the LSTMs. And that little hexagon is this. So there's a whole other thing happening. So actually the architecture is quite a lot more complicated. And this, that hexagon is the differential neural computer, which is just sort of learning to, that was a 2016 paper by DeepMind. And it's learning how to save stuff at arbitrary locations, or not arbitrary, but it's learning to save bits of information in memory so that it can retrieve it again later. It's way more advanced than just an LSTM. So it can retrieve it, so it has a right head and it can learn to position that right head, put some data in there and it can position a read hit and then the read when it wants to, and it can either keep the data there after reading it or it can delete it. So I'm not gonna go into the detail of this because I didn't read the whole paper, but it just shows you how much more complicated things have gotten in four years, which is actually the point I wanna bring across in this talk. So that hexagon plugged in there on the recurrent processing with temporal hierarchy, you'll see it there in the variational unit at G. Okay, and that's the architecture comparison. And then they also took a few things from the Unreal agent, which was from the Vimba 2016, which is those auxiliary networks that picks control on their real prediction and that just results in faster learning and improved performance. So that they could also run in Atari and just get like much faster learning and also on this sort of 3D thing. Yeah, and I already discussed the temporal hierarchy and the differential neural computer. So that's pretty much covers, gives you some, hopefully some sort of a sense of what's going on in terms of architecture. I'm not really gonna go into the training methods used for this, the training itself. So, because the architecture, so when you look at the stuff, you see the architect, you think this is what's happening, this is the neural network, but actually there's a whole different algorithm for how do you train this? Like how do you get the right actions and view them correctly and store all the right Q values for the right states, all that stuff, it becomes, and recently like a few days ago, they published a bit of code for this. I think on Thursday or Friday, when they published this in Science, they published some code and it's a zip file and it contains two files. One is called pseudocode.pdf and the other one is called pseudocode.py. So you can get, you can choose if you want Python pseudocode or not. And yeah, that was that. And then that's most of it and then just one short little note on this is that also, so they found that the agents do well and part of what they do well is could be explained by the fact that they're just faster than humans. So they have better reaction time. So they said, well, what if we handicapped the agent a little bit and said we're gonna give them a 267 millisecond delay and you'll see the results of that here. So what you wanna look at is you wanna compare maybe the strong human performance against delayed agent opponent and you'll see that it's still performing quite well when you compare it. So that still means that only 20% of the time the strong humans could win after they're delayed it. So they couldn't just explain the performance purely by faster reaction times. And that's most of it, that's pretty much it. I can, if you guys want, I can just quickly show you a part of the video that accompanies this that they released. So I already showed you that. Let's see if there's anything. So here's just an illustration of how they're training everything concurrently, right? So that's just to give you an idea. I mean, it's just like a, just to give you an idea. This is for presentation sake. This is for presentation sake. Yeah, like graphics is just graphics time. Yeah, like someone, they must be having, they must have someone full-time just doing, or a few people full-time just doing this graphics, you know. This is how it's growing. So you can see those are the 30 agents, those dots and you can see how they're progressing over time. That's 20,000 games. That's how they perform. Yeah. So I can probably find something more interesting. Here they kind of try to analyze the reward signal to decode it. So they try to decode what the agent is, what's happening inside the agent. And it's a bit more interesting when you look, sorry, there was another, I mean, these as well give you sort of a neural response map. And then that's pretty much, I mean, just trying to give you an idea that, there's actually, you'd expect it, that there are sections of the network that are responsible for the behavior. You'd expect that, but just trying to show that, yeah, we can also do this graphically and give you an idea. And then the rest of this video is just sort of some games being played. And it just gives you an idea of the behaviors that emerge. So you can just watch this video yourself. And that's it. That's all I got. So that last architectural diagram that you showed us, is it applicable to all kinds of first-person visual games? I think they can... They find it so well that it's usable for most games. I'm not sure if they've applied it. So when they did that stuff with Atari, they didn't apply it to all the, like a bunch of Atari games. To this, I'm not sure with this, but what they did with this, yeah, I think mostly it's all on the same engine. So they did try and do it on different types of games. So not just, so they built their own games inside Quake, like something called Labyrinth, where it needs to sort of go and find things or I'm not sure, which is not like this team type game. But as far as I can see, they haven't adapted it to other first-person games, but yeah, it could work. Cool. So the training, the one problem is actually all on a different layout, like the map. Yeah, so they generate all these, every time they generate, and there's also, I didn't cover this, there's also different size maps as well. So they make them, yeah, they make different sizes of maps a bit bigger as well, and they generate new maps, and they also kept some maps separate for validation purposes, so that they are sure that maybe those are maps that the agent haven't seen before when they want to validate it. Yeah. Whoops. I think just navigating through the map is not easy. Yeah, I mean, knowing how to get from your base to the opponent's base, and then knowing how to, that you must reverse that direction to get back, that is quite impressive. It's the temporal differentiator. But some parts of it, what you're asking me about, how much they can be used, I mean, the first bit of it was, what you said was exactly Atari, no, it was the network, right? And then they kind of tacked on all these things at the end. I wouldn't call it the end anymore. It's just like, they took that as a front end, and put all these other things together too. So I guess one way you could think of it as, maybe it will not, even if it may not work for all kinds of 3D shooter games, it could potentially be a starting location where you start tweaking from here to maybe... But the back end of the diagram was complicated. It looks very specialized for certain aspects of some games. But this is the first time I say from big mind they have a very business opportunity to derive from what they have done. So it's like, for example, for war games, for the military, they always have war games. What is the best way to move my soldiers' tanks? There's one opportunity. Another opportunity is that they will have, using maybe the similar engine to FIGAR, what's the best way for me to actually get my people working in the factory? What's the best way so that I can get the most efficient... work, the most productivity based on this setup? But you need example videos to show it. Unless you mounted it on a trailer with a camera and let it walk around, and then don't tell it. But yeah, I mean, then we need to find some sort of... Remember, reinforcement learning, you do need a reward. So it would need some sense of, for this... I mean, there are lots of sort of machine learning algorithms, but for this specific one, you do need some sort of reward. Not only that, as you started to say earlier, it can't be a reward that you get at the end of the things. You have to have some mechanism to get the reward continuously through the process, right? Yeah, compete two spars. Yeah, compete two spars. But for the factory stuff, I think there's a lot of people working on that. I think there are probably better ways of doing that. I've seen, actually, the talk, because we love as well, about dealing with the routing of robots within an automated range, or automated supermarket, picking, like Amazon warehouse, robot picking algorithm, navigation algorithm for robots that's going to a warehouse and picking things. That stuff is... There's already people, that's a very different solution than this, but it's not as complicated as what they've done. Yeah, it doesn't need to be this complicated. Yeah, yeah. Probably a much simpler solution. I mean, if you wanted something to learn... You're on the same engine. Yeah, I mean, if you want something to learn, you're whereas loud, then obviously, you would need this sort of visual, because this is a very visual neural network, right? It's very much based on vision. That's what the first part of the network is, just the same as the visual convolutional network. But other problems, maybe you wouldn't apply this if you could get a model of the world. So if you can get a schematic of your warehouse... Or you can have some mechanism to bring in some location data of your agent's 3D location within the schematic. Yeah, or you can get location rather than having to know. But yeah, I still think vision can probably be applied to many problems. Once we have a few more years of research, and because some of these algorithms also make it more efficient that they've been doing. So maybe in a few years, we'll be able to apply vision to more problems than we thought, hopefully. That'll be cool. Yeah, I just thought of something. Maybe I can actually use this and apply it into basketball games in terms of strategy. Certain people are just straying in certain things. How do I actually put in a strategy to win it? So you can actually put in these to actually think of strategy whereby no humans can actually think of. Yeah. I think about, yeah, there have been some start-ups, especially in the sports. I think for... In sports, game, those type of playing complete games. There have been some companies trying to use AI to really help them to win in the sports. There's also those dogs, though, what is it called? The Webo dogs or something where these little robot dogs and it's an AI challenge, but they're actual physical robot dogs and they play soccer and I think it's also reinforcement learning. I think it's robot soccer. Yeah, and that's like, yeah, that's sport in a sense. A more complex version of that AWS thing. They have a sport nowadays, they do it at every of their conference events where they have these little robot cars and you see them go right around the track. Oh, okay. Something challenging or something. I saw it. Isn't it a Jetson challenge, more? No. This one is like, you just train the network and then the influence runs on them. Free skill is somehow involved. Maybe they might be doing it for them. All right, thanks very much, Dave. Thank you. Thanks for coming. Today I learned a lot. They actually shared some of the quotes that they have done so far. Yeah, recently they put out a bit of code. You can find this code, just Google DeepMind. DeepMind has been sharing quite a bit of things.