 OK, I think if we should make a start. Good evening, ladies and gentlemen. My name is Peter Bruce. I'm the physical secretary of the Royal Society and one of the vice presidents. It's a great pleasure for me to welcome you all to Carlton House Terrace, the home of the Royal Society, and to this Bacariant award lecture. Now, let me just say a little bit about the Bacariant medal and lecture. It's one of our premier lectures. It is the premier lecture and medal in the physical sciences. It was established through a bequest of Henry Baker, who was a fellow of the Royal Society, who made a bequest of £100, and I shall read you the remit, if you like, for the Bacariant, as written at that time. For a narration or discourse on such part of natural history or experimental philosophy at such time and in such manner as the president and the council of the society for the time being shall please to order and appoint. Now, you'll gather that it's not a relatively recently introduced lecture and medal at the origin 1775, so it's been around for quite some time, and the medal as well as the medal, it's accompanied by a gift of £10,000. Now, it's a real pleasure for me to introduce the medal winner and our lecturer this evening, Andrew Zimmerman. Andrew is one of the principal architects of modern computer vision. His work in the 1980s on surface reconstruction and discontinuities is widely cited. He is the best known for his leading work in the 1990s establishing the computational theory of multiple view reconstruction and the development of practical algorithms that are widely used today. This culminated in the publication of a book in 2100, a turn of the century in other words, with Richard Hartley already regarded now as a standard text. His laboratory in Oxford is internationally renowned, and its work is currently shedding new light on the problems of objective detection, object detection and recognition. So without any further ado, you want to hear from him, I'm sure not me. Let me welcome Andrew to the stage to present his lecture as you see the title there, Computer Vision Learning to See the World. Andrew. Okay, good. Thank you for the introduction Peter, and thank you for the prize. It's a great honour for me, it's a great honour for the computer vision field as well, so very grateful. I'm going to talk on computer vision learning to see the world, and I'm going to start off by saying what computer vision is. The aim of computer vision is to extract visual information from images and videos so that the computer can understand images and videos much like a human would understand them. That's what we aim to do. So in this image here, we'd like to be able to answer questions like what is in the image, the objects in this case of person that's in the image, where are the things in the image, meaning the spatial layout of the scene, the pose of the person, and then what is happening. Nothing's much happening at the moment, but if I play the video, it's more interesting. A clip from Singing the Rain. The objective is to be able to carry out tasks like this to answer these questions. The field has made quite some progress. We can do quite a few things now, especially since you had learnt of deep learning. I'll show you some examples. The first one is I'm going to show you a video sequence in the top left and then various views of the 3D layout of the people and what they're doing as they collide with each other. That's the where task. The next one I'm going to show you is the what task. Here you're seeing object detection and tracking. These boxes are detecting the objects. Moving through the video, that's tracking it. There's also recognition going on. Maybe you can read the labels, but it's recognising various animals and objects. It recognises the tiger, deer, also some vehicles you're seeing at the moment. All of these are being recognised directly in the image and then tracked. That was sort of what is in the image, what is the video. The next one I want to show you is action. We can recognise human actions now. We can also recognise actions of humans and other animals. I'm going to show you animal examples. This is what you're going to see is chimpanzees. This is how computer vision can support other fields. Zoologists have hundreds of hours of videos that they want to analyse and annotate. With computer vision you can annotate the actions of the behaviour of chimpanzees automatically all through the video. Here you're going to see the behaviours of nutcracking and eating. Each of these tasks I've shown you, recognition tasks of actions or what's in the image, has been done by a deep learning model. A deep learning model for a visual task is trained in three steps. The first step is you construct a very, very large data set of images or videos that you label for the task you want, like recognising an image. Then you choose or you design a deep model. This is where the deep part comes, a neural network. The third step is to train the model's parameters on this data set. You train it by predicting the labels on the data set. I'll show you an example. For the recognition, I was showing you earlier, recognising the animals and the vehicles, we were using a network which had been trained to classify a thousand different object categories. This is a picture of the network. It takes in an image, in this case an elephant, and then the output is a choice of one of the thousand categories that's been trained to recognise. This network has 30 million parameters. To train it, you need a very large data set, and it was trained on a data set of a million images, a thousand images for each of a thousand categories. What that means is that somebody had to find a thousand images of monkeys like this, a thousand images of dogs, a thousand images of elephants, across an amount of work, and that's what the network was trained on. I'm going to the core of this lecture. This is not how a baby learns to see. A baby is not shown a thousand images and says, this is a dog, a thousand images, this is a cat. That's not how a baby learns to see. What I'm going to do in this lecture is explore how computers can learn to see more in a way that an infant learns to see, and that is to learn directly from the data. That's what I'm going to show. What this means is that we have these three steps for learning a visual task, and we're going to throw away the first step. We're not going to have a large data set that somebody has to construct. Instead, we're going to obtain the supervision to train the network directly from the data. Now, this is not a new idea. Turing in 1950 wrote a paper, and he said, instead of trying to produce a programme to simulate the adult mind, why not rather try to produce one which simulates a child? If this were then subject to an appropriate course of education, one would attain the adult brain. Just following what Turing suggested in 1950, it also ties in with what's been found by psychologists who studied cognitive development in infants. What they found is the importance of data in order to develop intelligence. Data from the physical world is needed. This paper, which is very good to read, gives six lessons in order to develop intelligence in children, and they're for us. Lesson number one is be multimodal. I'm going to be multimodal. For the next part, what I'm going to do next is I'm going to show how machines can learn directly from the data without having to have a labelled data set. I'm going to particularly pick out the theme of correspondence between modalities, and that's what we're going to learn from. I'm going to show three different examples. This method of learning from the data is called cell supervision. The first one I'm going to do, the modalities will be audio and visual, and I'm going to learn from the correspondence of those. Example number one. Here's an example, first of all, of what a synchronised signal is. This is what we have in the world. We have these synchronised signals coming in, the audio and visual are synchronised. We can, of course, make them unsynchronised by shifting the audio, and then it sounds like this. Just this difference, this synchronised signal for three, we can also make it unsynchronised. That's going to be the supervision we're going to use to train a network. I'm going to illustrate this with talking heads. The reason I'm doing talking heads for two reasons. One, if we think of the Turing baby, what is the baby going to see first? It's going to see its parents speaking. In terms of, that's one reason. The second is, we as humans are very, very sensitive to lack of synchronisation in talking heads. I'll show you an example. I was in the camps yesterday talking to people. There are 1.3 million earthquake survivors still living in those crowded camps. I hope you can see that. That was how to say, well, here, it's how to say, it's very annoying. Now to how we're going to train the network. We're going to train it to tell if the lip motion in a video sequence is synchronised with the audio or not. That will be its task. We're going to give it a video sequence, a video clip, and the audio and say, are these synchronised? That's going to be the training signal. The network itself is going to have a part which takes in a video clip, and this network is going to be called a visual encoder, takes in the video clip and produces a vector. We're also going to have an audio encoder that takes in an audio clip, produces a vector. These vectors, these are lists of numbers, maybe 512 numbers, but you can think of it just as a point in 3D. The visual encoder predicts a point. The audio encoder predicts a point. The training is going to be that if the audio and the visual is synchronised, we want these points to be closed together, and if the audio and visual are not synchronised, we want them to be far apart. It's very, very simple. Where do we get the data from? We get it from just natural signals coming in. We have some frames in the corresponding audio. That will be synchronised. We'll call that a positive sample. We can get any number of positive samples from videos of talking heads. If we take the frames and take a temporal displacement, the audio and visual will no longer be synchronised. That will be called a negative sample. We can generate these positive negative samples effortlessly from any video coming in of a talking head. We can get millions of these. That will be our training data. We take this network, which has a visual encoder producing a vector, audio encoder producing a vector, and train it on all these samples which we know are synchronised and not synchronised because we're creating them. We train it so that if it can tell, if it's synchronised, the points it produces are closed, and if they're not synchronised, they're not closed. Imagine we've done that and trained this on millions of examples. What we'll end up with is points which are close together when it's synchronised. What I'm showing here is called an embedding space. This is where these vectors we've produced live. This is what I'm showing you in 2D. When we have a synchronised signal, the points that produce the vectors produced by the video encoder and the audio encoder are closed like this. If they're not synchronised, then they won't be closed. That's what we've learnt. We've told us to do this. We've trained it just on that. Now I'm going to show you, once we've done this, what can we use it for? The first thing we can use it for is to synchronise audio and visual signals when they're not synchronised. The way this works is we know that when they're not synchronised, the audio and visual embeddings are going to be distant. We can then shift the audio and if we shift it and they become close, then they'll be synchronised. We start with something that's unsynchronised, shift the audio until these vectors become close, then it will be synchronised. What that means we can do is we have this annoying example of out of sync. We can now synchronise that to this. Of heavy rain and probably four or five hours of heavy rain ahead. I was in the camps yesterday talking to people. There are 1.3 million earthquake survivors. More importantly, we can also use this network that we've trained to find out where in the image or the video the person speaking is. This is going to be shown is we're going to have a video and an audio track and we can produce what's called a heat map where it's hottest, where the person speaking. We're going to use this for localisation. This is a bit more technical but the way this works is inside the network, there's a spatial grid of vectors. This spatial grid of vectors corresponds to the spatial grid of the pixels. We can take the audio encoding vector and we can pick out the vector in the spatial grid which is closest. The one which is closest will be the one which is most synchronised and that will be where the speaker is. I'll show you some examples. On the left you'll see the heat map and on the right you'll see a box around where the speaker is. It's the perfect place to come if you want to see old roses looking the absolute best and the very latest. It's about to take place and here's another clue. The next thing we can do is track somebody using their voice. We've got multiple people speaking on and off and maybe some people are moving their lips but they're not speaking for their awning or laughing. We can pick out the active speaker by this signal because we can find the pixels if you like which are synchronised with the voice we're hearing. We can do that by private concerns on US relations with those countries and also FIA freedom of information request. Finally, we're not tied to humans. We have ways to train networks which can pick out synchronised signals. This can work for cartoons where the mouth moves with the voice. The blue will be the active speaker and the red will be the inactive speaker. Dad, you said you were going to play catch with me tonight. I have to work but give the monitor a kiss. It's the end of the first example. What you've seen is we can take something which rises from the world, the physics of the world, which is synchronisation and then manipulate it slightly, train this network which has tens or hundreds of millions of parameters and then use this network to track the person who is speaking, for example. We're going to go on to the second example, which is audiovisual correspondence beyond just talking heads. Now we're going to consider more general objects, more general scenarios where we have various objects that make sounds or actions that make sounds. In terms of this Turing infant, we imagine that it's watching its parents talking, now it's sitting up, it's looking around at the world and looking and listening to objects around it. This is development. The idea here is if you see an image like this, an image of a drum, you know what it's going to sound like. If you hear a sound like this, you know the answer, so obviously this is a guitar. You have this semantic correspondence between what's in the image and the sound. This rises just again from the physical world. You look at the scene, if something's sounding, you can see it and you can hear it. It rises from the physics of the world. You use this semantic correspondence between the vision and the audio to learn from through the network. This is a weaker requirement of synchronisation. We can do it from a single image. We don't actually need temporal information for this, we need the audio signal and an image. The way to do it, I'm going to formulate it like it's a picking game. We're going to task the network to pick which of these images this sound corresponds to. Imagine that the sound is actually a guitar and it has to pick out which of these it corresponds to. It should pick this one. The way we're going to do this is, again, distances. We're going to find which of these embeddings have been embeddings. It has the smallest distance and pick that one. We'll have a similar network. We're going to train, as before, we're taking a video clip. It goes through a visual encoder, produces a vector, a list of numbers. We take an audio clip, it goes through an audio encoder, produces a vector. What we want is if the audio and visual correspond, then the distance between these vectors is small. These points are small. If they don't correspond, then the distance from the points should be large. That's it. Where do we get the data from? We get the data from any videos we have. Here's two videos. We don't need to know what's in them. They differ in this case. What we do know is that there's a correspondence between the sound and the frames. Now we can take samples from this for training. Take positive samples where we take a frame and the audio around it. We can take any number of these. How do we get negative samples? We simply take the audio from one video and the frame from another video. In general, they won't correspond. That's it. We can do this. We have videos which arise from the world, have this correspondence property naturally, and we can sample millions of positive and negative samples like this and train this network. That's it. Imagine we've done this and we'll look at this embedding space where these vectors live. What we'll have learnt is when the audio and visual correspond, the embeddings will be close together. If we have maybe another instrument like a drum, then the sound will be distant from the embedding from the guitar, but it will be close to the embedding of the image of the drum. We have an embedding space that we've learnt like this. What can we do with this? One thing we can do is what's called cross-modal retrieval. We can start with a sound and now we can find images which correspond to this sound. The way to do this is to populate the joint embedding space with frames from videos. That's what I'm showing you here. All these points are frames from videos. Now we can look at the neighbourhood of where the sound's been embedded and pick frames which are close by, and they must be corresponding. I'll show an example. Here's a sound I'm going to play you. What is producing that sound? We can dive into this embedding space, look at nearby frames and find videos. Here are the videos that could have made that sound. Cross-modal. We start with audio and we find images. Now I'm going to show another use of this embedding space. We've seen that we've trained it so that when there's a correspondence between the audio and visual, the embedding vectors are close. Imagine here we have the sound of a guitar, then any other image of a guitar should be embedded close by to this. That's what we've trained it to do. By transitivity, what's happened with this network is it's learned to embed all of the objects of the same class near together. That's what it has to learn to do. This is in fact how it solves the problem, how else could it solve the problem of determining the correspondence between the audio and visual, and yes, it was doing this. We've actually learned to the visual network which learns to embed objects of the same class close together. Now we can use this for visual retrieval. We can start off with a frame of a video, put it into the embedding space, populate this with other videos, and now look at the neighbours of this. These will be other videos. We can start with a video, search for similar videos, or search for similar frames. Here's an example. We start with a frame of a guitar, search in a few hundred thousand images, and these are the ones that are nearby. We start from acoustic guitars, found acoustic guitars. Another query, we start with a drum, search inside the embedding space and we can find images of drums. The point is that all this has been learned simply from taking samples from videos where the audio and visual correspond. That's all we had to do. We've trained this visual network and now we can use this visual network for recognition. We can also use it for localisation. As we saw in the synchronisation case, we go inside the network. The network has this spatial grid of vectors to find out where the object is that's making the sound. We take the audio embedding, we look at the spatial grid of vectors, find the closest vector, and that will be where the object is that's making the sound. I'm going to show you an example. Now you're going to see a video and frame by frame, you're going to see the heat map in the centre and overlaid on the frame, and then on the right you'll see the heat map itself. As I said, this will all be done frame by frame. It's localising them. All of that learnt. Third example, we're going to change the mentality now. So far we have audio and visual, now we're going to change to language or text and visual. In terms of our infant, by about 10 or 12 months infants can start to understand words and speak words and eventually they'll learn to read like this. That's what we're going to do now. I've done it in this order. I've started with vision and audio visual and then moved on to language because an infant learns to speak after it's learnt to see. Let's go back to our cognitive psychology. I gave six lessons and lesson number six is learning language. We're following six lessons still. How to do it? We've seen that we can train networks like this. I showed you in the audio case where we have a visual encoder and an audio encoder, and then we have a contrastive loss and we minimise the distance when there's a correspondence between the outputs. Very simply, if we're changing the audio modality we can also just change this audio encoder to a text encoder. Now we have text which corresponds to the video. A text here is a man is playing an electric guitar. We have a text encoder and we can use exactly the same idea, this correspondence idea, that if this text description and the sentence corresponds to this image, describes this image, then the output of these vectors should be close together and if it doesn't describe this image, it would describe some other image, then the output of these vectors should be far apart and that's it. We've got our network now. How do we train it? We need to train this, we need to have paired data between images and text, text which describes images. Where do we get that from? Unfortunately, on the internet there's something called alt text, which is, if you hover your mouse over an image you often see a sentence comes up and it's provided so that you don't have to download the image or for the visually impaired it can be run out. So this alt text is available at massive quantities, there are millions or billions of examples of alt text available. Now I'll put some examples on this slide, on the left hand side, far left, the alt text is trees in a winter snow storm, it's describing the image and one on the far right is a facade of an old shop. So this is available easily and we can train this network by getting millions of examples and it's paired visual text and as before, just taking positive ones where they correspond, distance should be small and when they don't correspond, so we pick a random image and a random text, they don't correspond, the distance should be large and that's it, we train the network. So again, imagine we train the network and we look at the embedding space where these vectors live, then what we'll have is if the text corresponds to the image as it does here, the embeddings will be close and if we have another text man playing at the passing down, that doesn't describe this image on the left so it will be far away, but it will be close to the actual image where a man is playing at the passing down. So we've got this embedding space again. Now how do we use this? So we're going to use this for searching and retrieval of images and videos using language. This is really useful. So once again, we have a joint embedding space. We can populate this with images. So these dots now represent images that have been encoded, the vectors from those and say we want to find a particular image and we want to find it using language. So we describe what we want. Here's a sentence. We embed that in this space and then we again look for neighbours of this. This will then correspond to an image of what we're looking for because of the way we've trained it. So now I'm going to show you a demonstration of this. The really remarkable thing about language in terms of communicating with a machine is you can keep on adding words in language. You can make queries even more complex. So you can keep on adding requirements. I'll show you how that works in the demo. So this will be a demo searching 35 million images from Wikimedia Commons. You'll see that text being typed in and the retrieval will come immediately. Rysau'r swath was something quite simple. The first one is a red car and there we are. Now we make it slightly more nuanced. So it's now a sports car and there it is. And now more interesting, several requirements. Person riding a bike. There we are. Change bike to horse. Fine. Now make it even more demanding. Riding a horse but jumping. And there it is. And so on. We can also search for animals. And we can search for animals doing particular things. Here's penguins raising their wings. So what you're seeing here is by this embedding it feels like you're communicating with a machine actually because you get this instance of response keep on making it more and more precise and search query you're looking for. So that's the three examples I wanted to show you of learning from data. So that's sort of the tutorial part of the talk. Now I'm going to finish by giving you two snapshots of research sort of more recent research that build again on this type of self-supervised learning of correspondence between modalities. We've done all this work. Now let's use it for some applications. So I'm going to do two applications. One is going to be recognising British Sign Language and then I'm going to do audio description of videos. OK. Number one, British Sign Language. So this is a visual language that the deaf community uses in Britain. And here's an example. I don't know how many of you can read British Sign Language. So I'll tell you what she was doing. She was interpreting the sentence every spring our planet is transformed. I'm going to play it again so you can look for the sign for planet. OK. But she actually does seven signs in that short sequence. And it's very challenging. So spot them all. We would like, of course, to have a machine that could understand British Sign Language for many reasons. One is so that then deaf people can communicate with a machine at the moment. We can speak to using a lecture. We can speak to machines and get them to do what we want. If a deaf person wants to do that, they have to type it much better if they could use their own language to communicate. And, of course, it would be very good if they could communicate with non-signers. The machine could help do that, could translate. So that's why we want to do this. How do we do it? Where do we get our data from? Where do we get our paired data from? And the answer is we get it by watching television. Because on television you would have seen signers overlaid with television programmes like this. And you have subtitles which correspond to what's being said. And the signer is also interpreting what's being said so you have a paired data correspondence between the subtitle and the sign sequence. And this correspondence is what we can learn from as we've been doing all the way through this talk. Now, the BBC have very generously made available 2,000 sign language, sign programmes together with subtitles to support academic research on recognising British Sign Language. I'm going to show you some work we've been doing on this large dataset that they released. And what I'm actually going to show you is how we can recognise signs using maw things. So, I'm going to use my maw things. When signers are signing sometimes they mawl the words that they're signing. Not always, but sometimes they do. That's what we'll pick up on. So, I'll show you some examples. On the left, you're going to see the sign for office, but also he maws office. On the right, he's going to do the sign for tree and he's going to maw tree. So, why is this useful? It's useful because we can pick up words that are being mawved on the lips. We know how to do that. So, how this is going to work is imagine we have a subtitle like this clip here. Are you happy with this application? So, each word in this subtitle a happy application and see if it's mawved. And this example, she does maw happy. So, we look at the lips and we find where happy is being mawved. And once we've done that and we know the temporal segment where she mawved happy then we know the sign. Because we know exactly where she's mawved and we know the sign. So, we have a way of automatically annotating the data on the signs. And the way we do the spotting on the lips uses this synchronisation network I showed you in the first example that's actually how we do it. Okay, so, that was one example and I imagine we do this at industrial scale. We scale up, we do it on the BBC 2000 programmes. So, we take the words that occur in the subtitles, we look at all the subtitles that occur and we see whether the person is mawhing that word and then we take that segment and the sign corresponds to the word. I'll show you some examples. So, first of all, for family. Very fast. Important. You see we're getting all these different examples. Before. So, actually if you look at this one before it's signed in two different ways in these examples. You can pick out some sign is doing it in one way, some sign is doing it in another way. Perfect. Now, we're getting these signs from mawhings, but of course once we have the sign we can recognise it just, we can learn to recognise it just from the hand movements, the hand gestures. Okay, so then we'll be able to recognise it whether they maw or not. And we can generate, for each word we can generate thousands of examples. Hundreds of thousands of examples here. You're seeing these perfect examples we're getting and more perfect. So really it's quite a perfect method in fact because we can generate signs for thousands of different words and hundreds of examples say for each one and now we have a way where we can learn all these signs and recognise them by computer. But this problem's not solved, but this is a way of generating the data. Now, second application I want to show you, a snapshot is audio description of video. Audio description is a soundtrack that's provided for the visually impaired. And it describes the visual elements of the television programme or the movie. So that they can understand what's going on. I'm going to show you a short example of audio description, a type of thing that's available. This is for the film Out of Sight. So you see how the audio description is complementary to the soundtrack. So the things that you couldn't tell from what's being said or the music, that's what it's providing and then someone who's blind can understand what's going on in the film. Okay so we'd like to be able to generate these automatically. So we'd like to have a machine that takes in the video and then produces the audio description probably as text and then we have a text to speech that will read it out so visually impaired can follow it. So how would we do that? So we obviously need to supply the video to a model we're going to train and learn. But we have to do more than that because audio descriptions have the names. You heard it, the names of the characters. So we also have to provide the names of the characters. We have to provide a character bank of the people who are in the film. So we need this auxiliary information. Given those two inputs, then we want to train a model to produce the audio description. So where are we going to get the training data and we need paired data between films and audio description. Fortunately volunteers have provided audio descriptions for thousands and thousands of films. So this paired data is readily available and as you've seen we can learn from these corresponding data. So we have films, we have the audio descriptions, we can learn a model which generates the audio descriptions. Okay and I'm going to show you two examples of audio descriptions that we've generated. Again this is still working focus, it's not finished. They're both from Harry Potter and the first one is a painful example. Concentrate Potter Focus Okay so that's the clip and this was the audio description that we predicted for that. So Snape correctly points at Harry. So we've got the characters right. Harry closes his eyes in horror. There's more pain I'd say but this is what's produced. Second example. A more pleasant example. So how are we going to get along? Hi of course. The audio description that was produced was Hermione, Ron and Luna's eyes are fixed on Harry who is standing the doorway, that's correct, that's what happened. And then Harry rides on horses back as a horse raised up in the air. So this model doesn't think this is a horse but of course it clearly is not if you're a Harry Potter but there's more to do here. Okay so that's the end of my snapshot. So I'm finishing now. This is what you've seen. You've seen that it's possible to learn visual encoders directly from data in various ways. There's no name for manual supervision which is a traditional way of doing this and I've gone through a learning curriculum for a virtual infant, audio visual synchronisation, audio visual correspondence and then language visual correspondence, that's what you've seen. I just mentioned that the field, computer vision field works on this problem a lot and even though I've shown cross-modal learning you can also learn visual encoders purely from the visual stream. So deaf people can see as well of course. I'd like to end by thanking people. So of course this work is not just not done by me, by any reason it's done by my students, my postdocs and I'm always inspired by talking to faculty Oxford in the UK, DeepMind internationally it wouldn't be possible without all these people and a lot of them are here in the audience so that's great, so thank you. Thank you very much Andrew for a super lecture, very stimulating. We have time for some questions so who would like to start us off. Now there are a couple of people with microphones so if you put your hand up someone will come to you with a microphone so let's start over here on my right to your left with the first question. We also have people who of course has been live streamed so we have people who can ask questions on Slido and I think one of my colleagues is hovering around with an iPad and will wave at me if we have some questions on Slido. Please go ahead. Hi, thanks for the lecture. What I was going to say is how far are you away from having real time live access to this system if you like really? So there are lots of systems I showed here. The real time demo I was showing that's real time. So you can type it and it will immediately retrieve images or videos from a large data set. Right, I think what I was trying to say is more the video. So the moment is trained on videos online I take it and what I was trying to say is that if you used to have a device that could see someone talking in real time like for instance the sign language could that be then translated? Sign language is not solved just to be clear. We can't do a sign language yet. At the moment most of these methods run on big machines, GPs etc. But there's lots of work on taking these big models and distilling them down to smaller models in various creative ways. So even though at the moment they run on GPs etc. some of these models already can run in your browser. You can do real time post detection of humans in your browser and you can see that ASR speech recognition can be done in your browser. So these models start large but then once they're ready they can be made smaller and more portable. Does that answer your question? Thank you. I want to say all of the models are too large. At the moment to do that of course. That's the way it goes. With that question just a few rules back I think. Congratulations on the price. Thanks for the lecture. It feels like for different tasks maybe you have to do a very specific data processing which you explain now. Do you see a way of doing self supervision with a fairly general type of data processing where you can apply later to very different tasks and maybe what ShadGBT does for texts. Thank you. Thank you for the question. I've concentrated on visual tasks here. We already know the answer. The answer is yes. I've shown some different types of self supervision here. There are many others as I said. Once networks have been trained in some way by using self-advised methods they can be used for multiple tasks by applying what are called different heads. They can be used for recognition object detection, tracking. Once you have a good feature a good network it can do multiple tasks and this is really the way that large scale networks are trained nowadays. But there's still an issue of you have to say what the tasks are. If we haven't it's still a research question to train network and then do all the tasks you want to do and predict depth, predict other things about images or videos. It's still work to be done here but we certainly have lots of evidence that training a good visual backbone enables lots of tasks after that. Okay, so I think we have a question there if you go one row back first I think you pass it back again. I think that's got to be long and correct. Hi, I just wanted to ask is there any dimensions the abennic space should be? Yes, thank you. It's usually a power of two and I think it's a good question but it's not a good answer. It's also storage vectors are very large it provides more storage but no I don't have a good answer. It's an empirical question really you try different sizes and you determine what works best empirically. Thank you for asking I don't have if anybody else here has got a better answer please. Any volunteers? Okay, go ahead please. Thank you very much for the talk Andrew. Knowing what you know now about this research project which parts were the hardest? Was it data collection, data prep was it designing the encoder neural net architecture like what would you for somebody else that's working on a similar research project that's knowing what you know, what do you give? Each stage of these projects you stumble across something there's always a way of project you have an idea or somebody has an idea and you start doing it and then unexpected things happen so it's difficult to answer really sometimes the networks are hard to train sometimes the data that you think is good is not good when you're putting together so many things like this unexpected things happen like one of my rules is when you have data sets there's always noisy it always happens there's always something wrong and you always have to look at your data and see what's going on so I can't give you a definite answer it's just that every stage always has problems. Okay, so we have a question at the front here. Thank you for a great lecture so with the first message you showed with temporal alignments and so on that could be how babies might learn to see as a cue but with the later work with things like text to object that still seems to require a huge amount of data. Is there any argument that that might be similar to ways that babies learn to see and correspond with language? It's good to ask. I think a huge amount of data in the text case is Achilles' heel. That's a problem at the moment but you need so much data to learn from and I think it's a research question how you can avoid needing so much data. By the way, I wasn't saying this is how babies necessarily learn in principle they can learn this way because we can see that just from the data you can learn these skills not saying necessarily that they do this at all it's just that the information is there and the order is that after they've learnt to see and hear cognitive development we know that it's later that they acquire language. That's the only point I was making there but how to avoid having to use such vast quantities of data in the text case I don't know if I have a good answer to that. In the audio and visual case it's readily available there's no cost to that. Question here. Hi, thank you. I am not in the field so my question might be stupid but I'm just wondering for the correspondence or synchronisation what if there's some false correspondence let's say I'm waving my hand but this actually corresponds with the sound which is not in the image at all will your model be able to pick out that? It's the large scale data that helps avoid problems like that but we can call it noise things which don't correspond to what's making the sound but when you see enough examples you can pick out the ones that matter and the ones that don't. That's the answer to that question really. If you think of the talking head there are lots of things going on the person who is talking their eyes might flutter their hair might blow in the wind but in order to solve the task of learning synchronisation it has to see that what really matters is the lips because it's the lips that are synchronised with the voice with the speech so it gets to ignore all these nuisance factors this noise and pick out what really matters otherwise it can't solve the task. So Andrew how many is enough? It's a good question. How many in different circumstances cannot be the same in all circumstances? It's another question that we always fight with. We keep on going empirically until things work well then you see if you can train more efficiently meaning less data I'm waving my hand and saying a million samples because that's what we use because it's so easy as well to get this data. Is there a statistical test you can apply generically to these things? To give you a more proper scaling laws where you can say for the number of parameters how much data do you need given a certain training budget is this published work for training models but again it starts off empirically it's not like physics or geometry where you can give counting arguments this field doesn't have things like that it's much more empirical and then generalising from that. I think it's more hands so yes, two over this side this one first Hi, I have two questions quick ones what do you think computer vision will look like in say 20 years and and the second is is superhuman vision possible? Do you mind if I don't answer the first question because I think these sort of predictions I know people always want us to make the value they're always wrong so whatever I say it's going to be wrong really because it could move so quickly superhuman yes what would be superhuman though because I think for sure we can do things that are superhuman being able to search through 35 million images in a fraction of a second surely that's superhuman and we're going to be able to for sure search enormous satellite images spanning the whole world and find something we already can do superhuman things for computer vision and it will go on in terms of temporal temporal resolution spotting things which exist over long time scales that human wouldn't notice or a short time scale that human wouldn't be able to see all of the things will happen yes once we can do a skill on a computer we can make it superhuman a question here thank you very much for the talk you showed us the multimodalities examples and then in the end you said that this also can be applied to the single modality but with the single modality obviously then you have to figure out the augmentations because it's less obvious how to compare positive and negative examples have you found that multimodal embedding and comparing the embeddings from multiple modalities learns better because you don't have to engineer those augmentations or can you you know or do they perform on par or do the single modality can perform actually better that's a good question and you're right that with images you have to do augmentations which to other people if you're going to do something from a single image the sort of things you might do is crop the image and you want the embeddings to match even though you've cropped so if an image has got a horse in it and you crop out parts of it it will still be a horse so you want the embeddings computed from these various crops to all match that will then train the network to understand that the contents of the image shouldn't be affected by these crops but to answer your question empirically learning from multimodalities generally works better than certainly in video than doing all these augmentations it's naturally doing augmentations so it works better but the other methods work very well as well you just have to work harder to make them work and some people in this room have done multimodal and unimodal learning they will work well in the end it's just you have to do more work to make the unimodal ones work you guys have a question here thank you for the lecture very interesting just because I'm curious I suppose on the on the topic of solving for finding a speaker in a crowd what if everyone is turned around how do you do and detect a speaker yes then it can't answer of course it has to see the lips but there are other cues if a person is speaking when I'm speaking I move my hands there's body language there are other ways you can do this as well I haven't shown that here but you can imagine if everybody was always speaking from behind then the network had to solve the problem it would do something like that okay I think general here we'll have the honour of the last question yeah I was just wondering in terms of there were several examples where you had for example there was a visual network would be working together with maybe an audio network or it could be a descriptive network and I'm wondering whether the sort of embedded space that you end up for the visual network can apply across multiple different problems you sort of alluded to that earlier saying that sometimes even if you trained it on one problem then it can actually be useful on other problems and I'm wondering whether these embedded spaces end up sort of encapsulating an overall description of the image which can be used in multiple different tasks and whether that's those embeddings are a good gestalt of the whole thing that's being presented the answer is yes you'll train them for one task but in order to solve that task they have to do something more than you've trained them for an example of all the drums being embedded close together all the guitars being close together and once it's done that then it's like a guitar or like a drum it will embed to a certain point so it's learnt the characteristics and that applies to thousands of different categories and then you can use it for tracking guitars or other properties you might want for guitars this is the answer is yes basically okay well that's super now before you applaud to thank Andrew again I'm going to combine that task with presenting him with his scroll a metal so Andrew if you want to come out here because it can make it easier I will hand you this if you want to hold that and if you can do a tick and pull me out at the same time I'm shaking my hand thank you all very much for coming and it was an excellent lecture thank you