 Cool. Thanks for having me, and especially thanks for having me first, so I can talk about a lot of the different topics and technologies that the other speakers are going to talk about. It's a wonderful day outside to be talking about how to take over the world. So it's a pretty ambitious, crazy, deep, and maybe even a ridiculous title from Cups to Consciousness, but let me justify it. So on the right is consciousness. It's mysterious, it's elusive, it's eluded philosophers for thousands of years, neuroscientists, psychologists all find it very confusing, it's very debatable, no one really understands it. On the other side, it's kind of like the holy grail of AI research since the 50s, true thinking machines which are conscious of themselves. So it's a bit complicated, but on the left, Cups. They're quite simple. Everyone understands Cups. And more specifically, by Cups, I mean cup picking. The literal act of a robot bringing his gripper to the cup, closing his gripper, and lifting. Seems simple to us, but computers and robots have struggled for the last seven decades on this task. Before Consciousness, we can introduce a new term called artificial general intelligence. This is maybe a bit simpler, more formal, more quantifiable. I'll get to that later on, but generally it's a task-based view of AI. A machine which could do thousands of tasks that human can do would be considered an artificial general intelligence. One of these tasks could be, for example, setting and clearing the table robustly, or cup picking. So I've laid out the start and the end. We know the journey, we know the start, we know the end. We just have to fill in the gaps now. So what I'm going to be talking about today is M-Tank, the non-profit I founded. And I've given individual talks on computer vision, multimodal methods, and reinforcement learning, but here I'm trying to condense it all into one talk so I can give you a whirlwind tour of a lot of different AI technologies. So forgive me for being brief in parts. Then we're going to work backwards from consciousness to AGI to cups. This leads us to robotics. I'll talk about our goals and eventually a field called SLAM. So M-Tank, we're a remote team of AI researchers, some in Germany, some in Ireland, Luxembourg, and Italy. And really we just create unique resources for everyone to learn about AI. The first project, AI distillery, I won't have much time to talk about. But essentially we're using AI to map AI progress, using AI to understand how fast AI is progressing. And the second project is the topic of this talk. These two projects, they map onto our two visions. The first one, model and distill knowledge within AI. That's hopefully what I'm doing right now. I'm explaining AI topics to hopefully every type of audience. And secondly, we want to make some progress towards creating truly intelligent machines. I think cup picking is necessary on that goal. You can find us on our website, medium or Twitter. Okay, so let's begin. Deep learning, I'm sure you've all heard of it. It's an old algorithm, actually seven decades old, and it's loosely biologically inspired. And it's the dominant algorithm in computer vision, natural language processing, and reinforcement learning. On the top right, you see Lenette from 1998. And on the bottom, you see AlexNet from 2012. So why is this such a popular algorithm? In 2010 and 2011, every algorithm that competed in the ImageNet challenge used handcrafted features. And in 2012, all of this changed when AlexNet came along and massively increased the accuracy. And since then, 2013 onwards, every team uses deep learning in this ImageNet challenge, which requires machines to categorize 1 million images into 1,000 classes. Is it a cup? Is it a dog, a German shepherd, et cetera. And there's really four reasons why these algorithms are doing well and why you see so much AI high hype since 2012. Algorithms, like convolutions, convolutional neural networks, big data, I mentioned 1 million images in ImageNet, infrastructure, libraries like TensorFlow and PyTorch, as well as compute, GPUs. It can take weeks or months to train these neural networks. The more GPUs you have, the better and faster these train. So our first report about 2016 was a year in computer vision. We covered 15 different subfields you can see on the right in computer vision. Fields like object detection, segmentation, human pose estimation, et cetera. So just to give you a quick overview of the bread and butter tasks in computer vision. Classification is pretty simple. Given an input image, we want to classify this into the correct class. We assign a label to the whole image, for example, cat. Classification plus localization, we draw a box around the cat to find out where the cat is in the image. Object detection takes this even further, and now we're trying to classify and localize many objects in the same image, cat, dog, and duck. And then this instant segmentation, which takes this even further, and we're now trying to color in every pixel of every object. We're trying to classify every pixel, essentially. And on the right, we see face detection example, which is object detection with one class, which is pretty impressive since there's so many tiny faces in this image. And we concluded the report. I won't have time to cover the order parts. We concluded the report by a look to the future of computer vision and how it's going to affect many different fields, automotive, consumer, robotics, and medical, for example. Okay, so that's vision. Next, language. What's happened in the last few years is we've been able to use neural networks to embed words into a space, a high-dimensional space. Then we can visualize this space in two dimensions and find that toilets and bathrooms are close in the space. Similar words appear close in the space. That's been super useful, because then we're able to plug these embeddings into recurrent neural networks, which deal with sequences. So they're able to take sequences of these embeddings. One task could be language models, given a sentence, I hate this. It's able to predict the next word, which might be I hate this movie. That's what a language model does. Other advances in natural language since 2013 has been sequence the models, sequence the sequence models. What these do is they take a sequence on the left, for example, are you free tomorrow going into an RNN, which is a specific type, LSTM here, and it's able to embed the entire sentence into an embedding. And finally, the decoder RNN is able to output a reply saying, yes, what's up. But if you had the data, the same sequence-to-sequence model could be used for machine translation. I could output, are you free tomorrow in French on the right-hand side. And we're all familiar with Alexa and Siri and chat bots, a lot of them use this technology. Okay, we covered vision, we've covered language. What about vision and language combined? So we released a report called multimodal methods, and covering visual speech recognition and image captioning. I'll go through those quickly now. So lip reading, what visual speech recognition is, is just using the lips, an image of the lips, a video, we're able to predict what the person is saying. No audio input. And this is pretty impressive. Recently it's done better than humans. What surprised us most is that humans were actually quite bad at lip reading in the first place. So deep learning comes along and changes this field forever, since 2016. Here's an example of a BBC data set, and it's able to just use the lips to predict what the person is saying, which is pretty impressive. And it's a combination of convolutional neural networks and recurrent neural networks, CNNs and RNNs. Example here is that it pretty much gets the message across. And since 2016, it's improved a lot more. Could I have a volunteer from the audience to just describe this image? Just shout out to anyone to shout out, describe this image. Little girl sitting on a bench holding an umbrella, pretty close. And that's the goal of image captioning, and that's what computers can do now. Before outputting the word girl, we can see that it's actually attending to the part of the image where the girl is. This gives us interpretability and gives us confidence that it's actually looking at the right part of the image. Other examples in the middle, a zebra standing next to a zebra in a dirt field, or in the bottom right, a man riding a bike down a road next to a body of water. So pretty impressive. And generally what we do here, it's the same sequence-to-sequence model I showed you. And instead of an encoder RNN, we're just replacing the encoder RNN with an encoder CNN. And because of this, I like to compare deep learning to Lego. We can take out a part and replace it with another part and get really impressive results. And essentially, we're translating the image into English. Okay, we've covered vision. We've covered language. It's very helpful, languages too. But intelligence isn't just perception. It's control as well. And a typical example of this is animals. Here, a dog is trying to get a stick past two stumps, and he's failing dramatically. And on the right is a simulated example of the same thing, where we're trying to train, it's an animal Olympics, and we're trying to train simulated animals to do the same task. So really what I'm saying is that for real intelligent agents, they need to be in the real world. They need to have a body. And they need to make decisions. And more specifically, they need to make many decisions sequentially. And what is this framework I'm talking about? That's reinforcement learning. So we released a course on this. You can find it on our website, over 100 slides, and there's a video of me talking about this. So if you want to go into reinforcement learning more in depth, check it out. Part one covered many different fields, like model-based approaches, model-free, et cetera. And part two was combining deep learning with reinforcement learning. And here's an example of grid universe. It's kind of a grid world that we created. You can find this on GitHub, where I have levers in the bottom left to open doors, as melons, lemons, and apples, and mazes in the top right. And just a few examples of what reinforcement learning could do. 1992, we bet the world champions in backgammon. And sadly, Lisa Dahl has to be told here on the right in 2016 that he's lost the game against AlphaGo. Since then, AlphaGo Zero came out AlphaZero. More examples, 2D video games. And just to explain what reinforcement learning does, just given a reward, it's trying to maximize the score, maximize this long-term reward over the episode and the environment. So it just has to do the right actions. We don't tell it what to do, we just give it reward when it does the right thing. And it learns what to do correctly. Then there's simulated 3D robotics. Here's a stick figure from DeepMind that's able to just run forward. We don't tell it what it is or how to get forward, we just give it reward when it goes forward. So it's able to jump over chasms, which is pretty impressive. And then what combines vision, language, and action together is a really fascinating field called language grounding. Here the goal is to pick the green object next to the red object. It has to ground itself in what green or red or next means. On the right we see go to the tall pillar, go to the green short object, go to the green object. It's pretty impressive, I find it great because it's combining so many different fields together and it's actually understanding in some sense what blue means. Okay, so vision, language, action, what's next? Let's work backwards from consciousness to AGI and then to CUPS and build a plan. Okay, so this slide is going to get deep, I'm just warning you. If the brain creates a kind of perceptual radio program and uses that to orchestrate the behavior of the organism, what is listening? What is listening to the voice in your head? Rather than the universe itself, as some panpsychists believe, or some entity outside the physical universe, as doulos claim, I'd like to suggest that conscious experience is a model of the contents of our attention. It is virtual, a component of the organism simulates itself model and produced by an attentional conductor. This is what Joshua Bock wrote in one of his recent papers explaining what consciousness is, his theory of what it is. But it's a bit deep, you could think about that for a few decades and still be confused. So let's take a step back. Artificial general intelligence. This is way more formal or even just intelligence. The ability to acquire and apply knowledge and skills. Shane Leg and Marcus Hutter in 2007, they wrote a paper trying to formalize what machine intelligence, animal intelligence and human intelligence is. And really, I'll summarize it here, it's this task-based view of AGI I talked about previously. Create a large number of tasks, for example, cup picking and general agents are good at all these tasks. It's very quantifiable, it's very formal and it's not debatable whether the machine can pick up the cup or not. It doesn't need to have consciousness to pick up the cup. And maybe for the purposes of the stock, 80% of human tasks is getting close to AGI. And the first task is obviously picking up cups. And we have a blog series, part one, we were kind of giving the justification what I'm doing right now. Part two, we covered many different simulations and how we get to the real world with real robots. And part three, we're releasing this or next week, which is about mapping your home with SLAM. And this leads us quite quickly to robotics. Here's Atlas on the right doing a backflip. But robots could do many more tasks eventually, for example, opening doors or cooking or eventually anything humans can do. I personally can't do a backflip, so Atlas has one up on me. And generally what's happened in the previous decade is there was a lot of classic methods for doing speech recognition and computer vision. It took hundreds of thousands of lines of code, thousands of engineers with these handcrafted components to be able to recognize speech. Since then, all of it's generally been replaced by a big deep neural network. The question remains whether the same could happen for robotics. Maybe perception, world model, planning, control should be separated. It shouldn't be a big neural network. That's up for debate. Next, I'll talk about why simulation. Essentially, I can't afford Atlas to robot. It's worth 500,000 or more. I can simulate thousands of Atlas robots, though. And it's infested in real time. The main thing we want to do then is transfer from the simulation to the real world. That's called a field called sim to real. So here's a few examples of these simulations I'm talking about. It's a lot to take in in this slide, but I wanted to kind of blow your mind. Top left is the Gibson environment. You can have a humanoid or a car running around. Top right is Habitat API from getting from A to B in navigation tasks. Bottom right is Robotrix, putting an apple or an orange into a bowl with an Oculus headset. And then the robot has to do the same task, etc. So that's a lot of simulations. The first simulation we chose was AI2 Thor, mostly because it had cups. And language grounding here on the bottom right we're trying to pick up as many cups off the ground in the bathroom using reinforcement learning. You can check out our repository on GitHub if you're interested. And our first real robot was Vector from Anki, which was very impressive because it was able to fit everything in this tiny, tiny robot. For example, a quad-core CPU, voice recognition, vision, etc. And now that we had our first robot, we wanted to simulate it. Using the PyBullet physics simulation, I created a game called Cup Carnage. 10 versus 10 vectors face each other. The cups fall mid-battle to cause carnage. And there's not real point to this, but what I'm showing here is the power of simulation. I can simulate any situation that comes to my mind. And as you see, it gets quite violent here. Another real robot we bought from a failed Kickstarter campaign is a Rigibot 2. On the left, I'm controlling with an Android app to pick up a cup manually. And you see, the cup has been picked up. In the middle is a Flask server app where I can see through the eyes of Rigibot and control it as if I was a Rigibot. Other robots we've been working with, Fernando in Germany, he's attached a Brachio arm onto a monitor arm, so it's an arm on an arm, on top of a trolley with a selfie stick so he can wheel it around to pick up objects. And on the right, I attached my depth camera with duct tape to a drawer which I can also wheel around. Pretty simple. So why am I telling you all this? Our goals, we want to create a useful robot. A lot of Anki Vector actually fired all 200 employees a few months ago. A lot of robotics companies are failing because it's hard to find the economic value and I argue maybe it's because these robots aren't useful. Low cost, the shadow hand, which is just a hand, cost $100,000. Pepper costs $14,000. It's a bit ridiculous. So we're actually aiming to make a robot for below $1,000. And I mentioned general. We don't want it to just vacuum the floor, like at Roomba, we want it to vacuum the floor and take out the garbage. And robust, by that I mean, it accomplishes the task 99% of the time. You wouldn't use it otherwise. And that, we're announcing our startup. We're trying to make a household computer robot, consumer robot, and really we're trying to reimagine robotics. Imagine if your fridge went over the coffee machine to make you coffee. And it's going to be affordable, useful, and robust. So we aspire to make one thing for everything you need at home. A lot of people wish they had one extra hour in the day. The day is too short. We're going to try to give you that extra hour by automating tedious tasks that you don't want to do. And what have we accomplished so far? So mapping is done. I'll talk about that next briefly. Manipulation, I've been building these arms. It's in progress. That could take us a few months. Navigation, relatively simple compared to manipulation. One task, fetch objects. Fetch me a beer, fetch me my keys. Where are my keys? And eventually, really to get to the point that this talk is general, five tasks. But we know the journey we have to take. Let's just follow along to see how far we get. So mapping is a perception problem. I won't have too much time to cover this. But essentially, we're trying to find out where the agent is and map the environment as it moves through the environment. And that's super useful for finding a cup or anything else for many tasks. And generally, what's done is we're track points, the same landmarks, and using this, you can work out where the camera went and where these points are. So part three and four of the blog will be on SLAM. Here's an example of OrTab map in my house going around tables. It's building a 3D model of my house in the middle in real time. Top left is the image coming in. Bottom left is odometry working at the difference between two frames. And bottom right is the map top-down view of my house being built in real time, kind of like an architectural view. And this is me wheeling drawer bot around. So this works really, really well. And generally, what's done in SLAM is you have a front end and a back end. The images come in and we try to tell the relative pose between each of the images to work out how far the camera has moved. And this goes into the back end and we try to make a global map. This is super useful again for tasks. I won't have time to cover this, since you're sweet. But the checkout part three and four of the blog, which are being released soon, if you want more details. Okay. So navigation and manipulation, that's next. You can generally have classic approaches, like path planning, A-star search, or data-driven approaches, like I mentioned, with reinforcement learning. Or hybrid approaches. You could recognize a cup with a CNN and then use path planning to get there. And more specifically, you can use a CNN to find the object and find the grasping points. And then you need to move to the goal. You need to make a motion plan and actually then follow this plan to get to the goal. And then you have to close your gripper and lift up the cup. Seems simple, but robotics has struggled for decades on this task. Okay. So that's been quite a journey. We began at the cups, ended the consciousness. I've been trying to fill in the gaps. And now I can kind of explain why we chose this juxtaposition of these two terms. Consciousness keeps us very ambitious. It's an aim really high kind of goal, even if we fail, we'll hit halfway. Whereas cups keeps us very humble. The field as a whole and our team, they're struggling to do the simplest task in robotics, to pick up a cup. Personally, I'd find it pleasant if a robot made me tea, but maybe that's just me. Anyway, to conclude, the one thing I'd like you to take from this talk is that I'm trying to introduce you, invite you to the conversation that's happening, that's happening worldwide. The more informed we are, the better we can predict how these technologies will unfold into the future. For example, a lot of people instantly think this is a terminator scenario. Maybe likely. I don't think it's unlikely. Or it could just be creepy robots, like Sophia. Or robots will just keep failing for the next 100 years. Or, or maybe it's just all hype. Or maybe it's human machine symbiosis, a techno utopia, a happy ending for all. Because why else do we make robots if not to help us to do the things we don't want to do or can't do? I don't know how the future will unfold. Maybe no one does. But let's aim for this future by building it. Thank you. That was an incredible journey through all of the different aspects that are needed to be done. Quite humbling, I suppose. But I'm going to ask you quite a couple of questions. First question. It's, you've set up a not-for-profit, and you have this very ambitious goal to have a sub-thousand-dollar robot which can generally help around the house. How much do you think it will cost to get there? And how are you going to find the money? Okay, so we're aiming for the robot to be less than $1,000. Less cheaper than the latest iPhone. How to get there? Since it's so cheap, it won't cost us too much. I left Accenture recently, and I have a two-year runway to just make this happen. So follow along in the blogs to see how far we get. Great. Second question, slightly more challenging. So, although humbling, we are making progress towards these artificial general intelligence machines and computers. So it raises the question of the singularity, which we discussed at length in a couple of previous predict events. And what are your thoughts on the singularity? Is it inevitable? And the singularity is when the computers become more intelligent than humans and start to do things that we cannot predict maybe. So do you think it's inevitable? And if so, when will that happen? Probably is inevitable, but the real question is just when. As I mentioned, it's, we're going to be programming these machines for a long, long time. We'll be able to very specify exactly what happens. But at some point in the future, maybe they'll recursively improve themselves. Ray Kurzweil thinks 2045 for singularity, and he thinks 2029 for a human level machine. He's been pretty accurate in all his predictions in the past. So I'll go with that, but no one really knows. So is it 2045? 2045 for the singularity. Okay, there you have it. Thank you, Ben. Excellent talk, thank you. Thank you. Excellent, thank you. Excellent. Thank you, let's thank Ben. So.