 I'm excited to get to talk about two of my favorite topics at once, machine intelligence and robots. They go together pretty well, but they're not the same thing, you can definitely have one without the other. First some disclaimers, I'm not going to give you the answer to human level intelligence. I would if I had it, but I don't. Next, these are my own personal opinions, they're definitely not those of any current or former employer, and they don't reflect those of many experts in the field. Take them with a huge grain of salt. If they are useful, you're welcome to them, and if they're not, please discard them. Also, this story that I'm going to tell you is not rigorous. It doesn't have any equations, it's conceptual, and I just intend it to start a discussion and foster ideas. Throughout this presentation, we'll be talking about intelligence. The working definition that I propose is that it's a combination of how much one can do and how well one can do it. Notionally, you could be extremely good at only one thing and not be that intelligent. Also, you could do many things, but do them all very poorly, and also not be that intelligent. Intelligence is the combination of being able to do many things and to do them well. This is a functional definition of intelligence. There are many other potential definitions, some of them can be measured experimentally and some can't. This particular definition has the advantage that if we wanted to reduce it down to a measurable set of tasks, we could. Because it is, at least in theory, observable, this allows us to have a scientific discussion about machine-level intelligence. It lets us form hypotheses that we could potentially falsify, and it lets us compare the relative intelligence of two separate agents. In short, it's practical and useful. Undoubtedly, some will find it philosophically unsatisfying. That's a separate conversation, but one that I would be happy to have in another forum. Using fake math, we can say that intelligence equals performance times generality. It's only fake because we haven't defined performance or generality yet, but assuming we do, you can imagine plotting them and putting them on a set of axes like this. Although we haven't defined it quantitatively for human performance, what I would like to propose is that human performance is the level at which a human expert can do something. This can be measured in lots of different ways. It can be error rates or the time required to execute a task, the time required to learn a task, the number of demonstrations required before one learns a task, the amount of energy expended when one performs a task, the subjective judgment of performance from a panel of human judges, the speed that someone does something. There are many aspects of performance, and I'm not going to try and specify or quantify them all here. I only list them to illustrate that we are considering performance in a broad sense. We're considering performance in a broad sense rather than in a narrow machine learning accuracy leaderboard sense. If we consider human level performance to be something of a baseline, we can place it on our x-axis and then chop up the rest of the axis in equal increments. We'll make this a logarithmic scale to enable us to compare a very wide range of performances. Equal steps along this axis represent equal multiplicative factors of change. Human level generality is the set of all the tasks that humans can do and have undertaken. These include things like writing stories, cooking pies, building cities, gathering and transmitting information all around the world, and even exploring the origin of the universe. It's a very broad set of activities. We can represent human level generality on the y-axis. Roughly this is the set of all tasks that a human or group of humans can do. We'll make the y-axis logarithmic as well, so an equal interval is a factor of ten in performance, either multiplied or divide depending on whether you're moving up or down. Human intelligence can be notionally represented by the area that's entailed by this point. Just want to point out there's no reason to believe that machines might not exceed human performance in some areas. Humans have a number of limitations that are built into the way in which we've achieved our intelligence through evolution. Things that may have been very useful at one point or may be useful broadly but now may not be useful in pushing the limits of intelligence. Things like limited attention, instinctive drives, how every part of us fatigues, and a host of cognitive biases, all of which put some distance between us and perfectly rational or perfectly efficient or optimal behavior. Humans by comparison have a more nurturing environment in which to take root. They don't have to evolve, we're trying everything we can to encourage them. Now you can imagine on the performance generality axis another agent that can do a much larger set of tasks than humans, although do them all more poorly than a human could. So it might look like this, the area under that triangle, the overall intelligence, would still be comparable to that of a human. So we would call that human level intelligence also. You can also imagine an agent who can do a subset of the tasks that humans do but do them so very much better that the area under that rectangle is also about the same as a human's. So again human level intelligence. Now if we take the set of all of these agents that have about the same area under their intelligence rectangle, then we get this curve representing human level intelligence. Any agent who falls along that curve would be comparable to a human. Any agent down and to the left of it subhuman intelligence and any agent up and to the right superhuman intelligence. Now let's look at a few agents that you might be familiar with and see where they fall in this scheme. Chess playing computers have been around for going on 30 years now, which is a world champion chess playing computers have been around for almost 30 years. IBM's Deep Blue beat Gary Kasparov in 1989. This was a task that people assumed computers would have a very tough time with. It involved planning, strategy, thinking, mental models of your opponent. It seemed to take in the very peak of human cognition and it seemed unlikely that such a thing could be done by a fancy calculator, but it did and now a chess program running on your phone can do about the same as Deep Blue did. The current state of the art is a program called Stockfish, which has an elo rating which is like a chess skill score of 3447. Compare this to the top human, top rated human player of all time, Magnus Carlson, who achieved a 2882. The program and the human are not even comparable, they're not even close. It's worth noting that Stockfish is an open source project with code freely available and has a number of contributors scattered around the globe. Now in terms of generality, Stockfish understands the rules of chess and in fact it understands them very well. It has a bunch of hints and tips and tricks built in by humans that are specific to chess. It uses a point system to evaluate pieces based on where they are and what the stage of the game is. It uses a full table of end games. Once there are just a few pieces left on the board, the number of possibilities for how the game will play out are few enough that they can be completely enumerated. So there's no search, there's no solving, there's just essentially looking up in a giant table and figuring out what to do next. There are hand tuned strategies for each phase of the game and then what the computer does is uses tree search to evaluate future moves. Each choice of move is a branch on this tree and it can look and say for each move what's the likely outcome. For each of those outcomes, what are the possible moves my opponent can make? For each of those outcomes then what are my possible responses? And by fully going down this branching tree and looking at all the possibilities, the program can then figure out what its best choice is. Now one of the things that makes Stockfish so clever and good at its games is it's very good at pruning this tree and ignoring moves that are unlikely to lead to any good end and not exploring them very far. But all told, this program is pretty useless at anything that is not chess. It is the opposite of general. So on our plot, we would put chess well above human level performance but also very low on the generality axis. Now compare that to Go. Also a board game, if you're not familiar with it, there are 19 by 19 grid and opponents take turns placing white and black stones. The rules are actually in some ways simpler than chess. At each turn, you pick a junction on which to place a stone. Now the strategy, however, some argue is even more complex and more importantly, more subtle. What is undoubtedly true is there are more possible board configurations where chess has an 8 by 8 board, Go has a 19 by 19 board, where each piece in chess has a small prescribed number of motions. Any stone in Go can be placed on any open junction. So when doing a tree search, the number of moves explodes much more quickly than in chess. Now despite this, two years ago already, AlphaGo, a program built by researchers at DeepMind, beat Lee Siddle, a professional 9-dan player. A little 9-dan would put him in the all stars of the Go world. A later version, AlphaGo Master, was clocked with an elo rating of 48-58, compared this to the highest rated human player, Park Jun-Won, who's rated at 66-69. So not only has Go beaten the best players of the world, but now already by a very healthy margin. Now as we mentioned, the program understands the rules of Go or knows the rules of Go and uses the tree search to evaluate moves. Because there are so many possible board configurations, however, it can't memorize them all. So it uses convolutional neural networks to learn common configurations. And most importantly, it can see patterns that are not exactly repeated but are similar to things that it's seen before. This is an important innovation. We'll come back to convolutional neural networks in a few minutes. It also uses reinforcement learning on a library of human games to learn which moves are good. Reinforcement learning in very straightforward terms is looking at a configuration, an action, and the outcome, and learning the pattern. For a given configuration, if I take action A, good things happen. If I take action B, bad things tend to happen. After I learn those patterns, the next time I see configuration A, I can take the action that leads to good things. So using reinforcement learning on a library of human games then kind of bootstrapped AlphaGo and let it learn from the history of human play and human insights to get started. But like stockfish, it's useless at anything that's not go. So while it has a few tricks that allow it to be more general, it is still very narrow in its applications. So on our plot, we might put it here. Again far exceeding human level performance but still very low on the generality axis. Now let's take a jump to a different category entirely. Image classification. There is a wonderful data set called ImageNet which has many thousands of images classified by hand by humans into a thousand predefined categories. These categories include household items, cars and doors and chairs, includes animals, giraffes, rhinostresses. It also includes for instance, many different species of dog. So it's not a trivial thing to take an image and accurately categorize it. And in fact, a typical human score on this is about 5% error. About one out of every 20 images, a human will classify incorrectly. So it's a tough task. In 2011, there was began a large scale visual recognition challenge in which teams got to put together their algorithms for classifying these images. And in 2011, the very best one got 26% error. So about one in every four images was wrongly classified. Still three out of every four was correct, which was a pretty amazing performance. Each year this error rate decreased by close to half, which is an amazing rate of progress. Only in 2015, the error rate got lower than human. So we had a computer program classifying images better than a human does. Now in 2017, one of the more recent competitions, more than half of the teams got fewer than 5% wrong. So now machines are routinely beating humans at this task. Pretty impressive. In terms of generality, this task is definitely harder, more general, more challenging than a board game. There are more variations, more possibilities. And to get at it, it uses convolutional neural networks, which are a deep neural network architecture specifically designed for finding patterns in two-dimensional arrays of data, like a pixels or squares on a chessboard or on a go board. They're very good at finding patterns that could be visually represented. This is good, but it has been shown to break easily outside of the set of images that you train it on. To give an example of this, if you look at the images on the right and the left hand column, we see soap dispensers, praying mantis, a puppy. These are all images that were correctly categorized by a convolutional neural network. With the addition of a little bit of distortion shown in the middle column, which you're seeing that correctly, it's just a little bit of noise, but the gray is no change at all. You get the images on the right. Visually to us, they look identical or very similar. You might be able to see a little bit of warping and distortion there, but for whatever reason, convolutional neural networks confidently predicted all of these to be ostriches. This is not to say that they are not powerful and good, but they see something different than we are seeing. They are not learning to see in the same way that we see. Since then, the fragile nature of convolutional neural networks has been demonstrated through other ways in some images, changing a single pixel, the right pixel to the right value, can change how that image is classified. Others have found that you don't even have to go into the digital domain. You can take carefully designed stickers and affix them to something and have that object be confidently classified as a banana, whatever it is. In my favorite demonstration, a physical plastic turtle was rotated and from every direction, the convolutional neural network confidently predicted that it was a turtle. Then after just repainting a different pattern, not symbolic or representative of anything, but carefully chosen, that same convolutional neural network categorized it as a handgun. These examples show that, at least as currently done, our image classification generality is not quite where we would like it to be. Definitely higher than human performance, but classifying ImageNet is a much narrower task than it might appear on the surface, so we'll put it pretty low on the generality axis. Here's a really fun example of video game performance. Again, the folks at DeepMind put together a deep cue learning or deep reinforcement learning architecture to play video games. We'll talk about more about what that is in a second, but what they did is they took 49 classic Atari games and let the algorithm just look at the pixels and make random moves and the algorithm didn't know if this was supposed to be a move-left or a move-right or a jump or a shoot. It just took moves and then used reinforcement learning to learn from the outcome and then learn this pattern of, oh, when I see this and I do this, either something good happens or something bad happens or something neutral happens. After doing that for long enough, it learned the patterns that let it choose the right thing to do. In 29 of these 49 games, it did at or above human-expert level play, and that was super impressive. This is not just looking at a picture and saying this is a cat. This is looking at a picture in the moment and saying, for this particular instance, the right thing to do is jump, and then by jumping, that changes the image and then having to respond to the new configuration and doing that again and again and again and doing this better than a human. Now, the other part of this is there were 20 games then at which it did more poorly than a human. So after using convolutional neural networks to learn the pixel patterns, for which this is perfectly suited because the pixels are big and coarse, there's no noise, they don't change, the patterns are clear. So what the algorithm is seeing is very close to what we as humans see. And after using reinforcement learning to learn what actions to take in each situation, 20 of these games it wasn't able to match human performance on. And the pattern among those games is they tended to require longer-term planning. The one of them was Ms. Pac-Man, and if you ever played that, you know you're trying to eat all the dots in a maze while avoiding ghosts who are chasing you. It involves planning routes out, several turns ahead, anticipating where ghosts are going to be, a lot of things that you can't get from a single snapshot and without thinking ahead, several steps. In its current state, this algorithm didn't do that. And in fact, the game that it did the poorest on, a game called Montezuma's Revenge, required much more extensive planning, going to one location and grabbing an object so you could go to another location and open a door, and the computer just was not able to make those connections. So we'll add video games to our plot here. Again, more general than image classification, there's more going on. The task is broader, and performance is about human level. Now you may notice a pattern here. These fit roughly into a line or a curve, and we'll see this pattern continue. Looking at machine translation, taking text and changing it from one language to another. If you've ever gone to an online translator and typed in a phrase or copied a phrase from a language you weren't familiar with to one that you were, you will probably notice that the translation is surprisingly good at getting some of the sense, which is even five years ago was complete science fiction, to be able to do this in a reliable way. You'll probably also notice that the result is nothing that a native language speaker would ever be likely to say. So it's okay, it's definitely in the right direction, but it's far from perfect. Now what's really impressive to me about this is that the state-of-the-art language translation takes over a hundred languages, and instead of having specific models to translate from each language to each language, all of these languages are translated to a sum, Uber intermediate representation, which is then able to be translated back into any one of these hundred plus languages, so all-to-all language translator. So the sheer scope of that is really impressive. Now in order to do this, it uses long short-term memory, LSTM, which is a neural network architecture and it actually uses several deep neural networks in concert, one to carefully ignore parts of the input, one to choose what to remember, one to choose what to forget, one to choose what to pass on. There's quite a bit of computation involved. This architecture uses several levels of those even, so the amount of effort and computing power thrown at this in general is, if we use one of our metrics as efficiency, it's a little bit, could be considered a little bit of a hit in addition to the inaccuracies. It is worth noting that this uses an attention mechanism. So I called out attention earlier as a possible limitation of human performance, but it also proves to be a really useful tool when dealing with a massive amount of information, too much to look at in depth. And so by pre-filtering and focusing down on what's most likely to be helpful than any an algorithm can be much more efficient in how it handles it. So for machine translations, amazing performance, still short of human. And for the wildly ambitious scope, it gets a little step up on the generality axis, a little bit of a hit on the performance axis. Translation is still a very small part of all the things that humans do, but I would definitely say this is more general than playing video games. Now, looking at recommenders. So if you think back to your last experience with an online retailer, probably the recommendations that you got were maybe one in 10 was really, really relevant. Some of the others were close, but near misses, and some of the others were obviously way out in left field. So this is still pretty good. This is a tough task. If you can imagine like back when there were video stores going to a video store with your friend and trying to guess what your friend, even a friend you knew very well, would want to watch on a given night, you would be hard pressed to do better than like one in three or one in four. So one in 10 is not terrible, just ballpark. It's common to assume among these algorithms that order doesn't matter. So it just looks at everything you've ever bought today, yesterday, last year. And it doesn't think about how these things are related or how many you might have or how many you might need or how something that you brought previously might be related to what you might need tomorrow. It just looks at what people have bought in the past, what they've bought together. It also doesn't adapt to the fact that your selections might change with time. So even if you bought one jar of mayonnaise a year ago and then another one six months ago and another one a few months ago, it might not track the fact that your preference has changed. One of my favorite examples of these came up from Jack Rainer, a Twitter user who said, dear Amazon, I bought a toilet seat because I needed one, necessity, not desire. I do not collect them. I am not a toilet seat addict. No matter how temptingly you email me, I'm not going to think, oh, go on then. Just one more toilet seat. I'll treat myself. So recommenders, they do okay relative to humans. And I would argue that the knowledge of the world required to do really well is pretty deep. So we'll boost it up on the generality scale that takes a hit on performance. Now we get to robots. Finally, something physical bumping around in the world, self-driving cars. Their performance is impressive. Taken overall per mile driven, self-driving cars accident rates are lower than humans. And this is pretty amazing when you consider all of the things that a car has to deal with, construction, pedestrians, bicycle riders, changing weather conditions, changing road conditions. They're not perfect, but they're surprisingly good. Now in terms of generality, there are a few things that make self-driving cars less general than they might at first appear. And in fact, the biggest tricks to making them successful is to reducing the difficulty of the task and so reducing the necessary generality of the solution. So one of the things that happens is especially during training, humans still have to take over in some challenging situations. When the human gets uneasy or the car signals it doesn't know what to do, then it falls back to the human. And while driving is still a pretty complicated task, it's still very simple compared to say walking on rough terrain, while eating a bagel, and walking a dog who's pulling on the leash. There's a lot more to consider there and it's a lot tougher than a car who is statically stable on four wheels on a road that's flat and mostly straight and mostly marked and where the rules are well prescribed. To further simplify the task and narrow the scope of what needs to be learned, self-driving cars driving style tends to be cautious. They definitely do not tend to speed. They tend to not follow closely or turn aggressively or do anything else that many human drivers do. This is absolutely good practice and should be lauded and as a model for all of us to follow. But what that means is that the raw driving skill required by a self-driving car is in general less than that, required by a human. And it should also be noted that solutions are custom engineered for driving. The selection of sensors, the algorithms used to process them, the way everything's put together is not updated on the fly. It's gathered, evaluated by humans and then very carefully and deliberately the heuristics, the rules behind how that's interpreted and processed are then updated and tested and released again. This makes sense for deploying anything that has such a high consequence as a car. But from a machine learning side, it means that the solution is actually not as general as it seems. It's very specific to a given car with a given set of sensors and sometimes even to a given environment. Some self-driving cars, at least early failures, had to do with being deployed in climates that they weren't familiar with, for instance. So until their set of training data encompasses all of the conditions in which they will be deployed, they will be even narrower than human drivers. So all these things considered on the task of driving in general, I chose to rate self-driving cars at lower performance than human. Still it's physical interaction and it's interaction with other people in other cars, so it's quite a lot going on and definitely more complex than machine translation or even recommendations. Now, humanoid robots, the apex of cool applications. If you have not yet done it, get on the internet and search for robots doing backflips and check it out. When you see something like this, it is easy to make the jump, to believe that robotics has been solved. Like when a robot can do physical feats of acrobatics that I can't do, then it's done. I'm ready to call it and it puts a smile on my face that I can't wipe off. Now in terms of generality, do another search for robots falling down and you'll be treated to a montage of really humorous shorts of robots trying to do very simple things like open a door or lift an empty box or even stay standing up. And they really struggle with this because the systems are so complex, because the hardware and the sensors have so much going on. And because most of these are deployed as research projects, most of these activities are fairly hard-coded and pretty fragile. They make a lot of assumptions about the nature of the hardware, what's going on, the nature of the environment, and if any of these assumptions are violated, the robot's performance fails. So as a result, plotting them here, the generality and the sense that the types of things that they have experienced taken together as a whole are now getting to be a non-negligible fraction of things that humans can do. Maybe it's 0.1, maybe it's 0.01, somewhere in between, but an amazing set of things that can be quite hard. But performance is still sometimes laughably low compared to human level. We can compare humanoid robots as agents to humans and see that it's much less. More interestingly here, our trend now is quite clear. There's a fat line here that runs roughly parallel offset from the human level intelligence line. As solutions tend to get higher performance, they also tend to get less general and vice-versa. But it's rare that we get big steps toward the human intelligence line. Now this is what I think is cool, this is the whole point of this talk. There is one example that I would like to show of this before I jump to the conclusion, which is again, from DeepMind, a program called AlphaZero. So AlphaZero is like AlphaGo, except everything it knows about Go has been taken out. It doesn't know the rules of any game now, it just sees visual patterns, tries actions, and learns to see what's successful and what's not. The way it was used is you can think of a brand new AlphaZero instance as being an infant in terms of the gameplay, and two AlphaZero infants were created, and they started to play each other. One was allowed to learn, the other one was not. So the one that learned gradually got just a little bit better, stumbling into some good moves by accident until it became an okay beginner player of the game. Then it cloned itself, one of the two learned, the other did not, and they played and played until the one became an intermediate level player of the game. And it repeated this process of cloning and playing itself with one learning and the other not, and used its intermediate steps as scaffolding to get better and better. And it turns out that when using this approach with Go within four hours, it was as good as the best human player, and within eight hours, it had beat the previous best computer, its uncle, AlphaGo. Because it did not build any rules of the game, it was also able to learn chess and beat the current best chess playing program, Stockfish, and another board game called Shogi, and beat the current best Shogi playing program as well, all of which beats humans by a wide margin. So this is cool because it does both better performance and it's more general, it's not specific to any board game. And presumably if there were other board games that had two dimensional grids and set of rules that was not wildly, wildly different, it could learn to play those as well. So generality and performance. So what we have now is a point that is both farther to the right, higher performance, and farther up higher generality than the original that it was from. So this is a real increase in area under that rectangle, an increase in intelligence. This is the direction that we want to go. So it's worth taking a step and taking a moment and thinking about what is it that allowed us to step in this direction? Well AlphaZero made many fewer assumptions about what was going on and it also was able to practice as many times as it needed to through self-play. In general, assumptions are what prevent generality. They enable performance. So if I build in knowledge of the rules of chess, I'm able to take advantage of those much more quickly, but it also prevents me from doing anything that's not chess. So if I turn that around, making fewer assumptions might mean it takes me longer to do something, but it means I might be able to learn to do more things. So some common assumptions. Sensor information is noise-free. We have ideal sensors. That makes sense if we're playing chess. When we sense that a piece is on a given square, we expect that it will be. If we're dealing with say a self-driving car, maybe there's a smudge of mud on the camera, maybe the calibration of the lidar is off a little bit. We can't assume ideal sensors when we're interacting with the physical world. There are too many things we can't control for. Another common assumption is determinism. That's when I take an action. I know that it will have the same outcome every time. Again, it makes great sense when I have a board game. It makes sense if I'm classifying images. If I say an image is an image of a cat, I know that it will be labeled as a cat image right or wrong. However, if I'm a humanoid robot and I make an action to reach for a doorknob, the motor might not perform the way I expect. My feet might slip on the ground. I might have unanticipated challenges to my balance. The action may not turn out exactly as I expect. I need to be able to adapt to this. Another really common assumption, unimodality. All of the sensors are the same type. So this is an assumption in convolutional neural networks, for instance. It's great at bringing in a two-dimensional array of information that is all the same type. It's all pixels or it's all board squares. A general solution needs not to make this assumption. Another assumption, stationarity. This is a very common one. It's that the world doesn't change. The things that I learned yesterday are still true today. The things that I learned five minutes ago are still true right now. And we have to make some change, sorry, we have to make some assumptions about continuity. Otherwise, what I learned yesterday doesn't do me any good at all, but we also need to allow for the fact that the world has changed a little bit. Maybe the lubrication in my ankle joint is a little low, so it's going to respond differently than it did yesterday. Maybe there are clouds covering the sun, so the lighting conditions I learned to operate in yesterday have changed as well and I'll need to be able to adapt to that. Another common assumption is independence, which is the world is not changed by what I do to it. Global interaction violates this entirely. If I am a robot operating in a household and I bump into a chair and I scoot at six inches sideways, then whatever map I've made of that house will need to be changed a little bit. I have changed it myself. If I pick up a mug and move it from this table to that table, I have changed the position of that mug. The things that I do change the world and I need to keep track of that and any algorithm I use needs to be able to account for that. Another common assumption, ergodicity. Everything I need to know about how I operate I can sense right at this moment. This is a common assumption, also known as a Markov assumption, but it's also commonly broken in physical interaction. For instance, if I can sense position, that's great, but that doesn't tell me anything about velocity. And sometimes I need to know velocity to know how to respond to something. Another assumption that is very common is that the effects of my actions become apparent very soon. This is something that does not hold true, for instance, in chess, where the opening move will affect whether or not I win many, many time steps away. There are different tricks for handling this in chess. For instance, assigning point values to intermediate positions of pieces on the board, but in physical interaction it's much more difficult to do this, to know that given a set of actions that I take right now, which is most likely to result in something that's desirable five minutes from now or a day from now, all of these assumptions are very common in the algorithms currently being used that we call AI. These algorithms are not sufficient for achieving human level intelligence. These assumptions will prevent them from doing that. So one thing that all of these assumptions have in common is that they do not hold when you're working with humanoid robotics, or in fact any robot that's physically interacting with the world. So my proposal is that focusing on physical interaction is a great way to force us to confront these assumptions, to find out which ones we can bend, to find out which we can avoid altogether, and to drive us to create algorithms that are less fragile and able to accommodate a much more general set of tasks. That will then take us one step closer to human level intelligence.