 Ok, merci, c'est un plaisir d'être ici, j'ai un petit peu de plaisir, si j'ai des choses stupides, pardonne-moi. Je vais parler de séries de travail que j'ai fait avec beaucoup de gens, surtout avec deux de mes étudiants, Samuel Joulin et Piotr Bojanovski, et quelques post-docs, Korean post-docs, il y a Min-Soo-Choo, Bum-Sum-Ham et Soo-Akwak. J'ai été un personne de vision, de vision. Aujourd'hui, la fashion est de faire beaucoup de benchmarking et de montrer beaucoup de tables, donc je ne vais pas faire beaucoup de ça, juste un petit peu. Je ne vais pas discuter des features visuelles en détail, et je ne vais pas prétendre que la façon dont nous implementons les fixations, que nos algorithmes sont les meilleurs, mais je pense que certains des étudiants sont intéressants. Je vais parler d'une vision supérieure, comme d'habitude. Dans les précédents étudiants, nous avons parlé de CNN, des choses comme ça. Je ne vais pas parler d'une vision supérieure tout à l'heure, mais peut-être dans les dernières quelques slides. Je vais parler de la vision supérieure. C'est un problème typique dans la vision d'incompteur. Vous voulez reconnaître les objectifs. Vous avez donné beaucoup d'exemples de entraînement, des images de beavers, de chairs, de trés, etc. Et donc quelqu'un a mis les speeches dans les sacs et a mis le label beaver, le label chair, le label tree. Et puis vous essayez de mettre des machines. Et puis, à la fin, vous vous donnez une nouvelle image, l'un de la droite. Et la machine va dire beaver. Et comme vous le savez, c'est généralement formulé comme un problème de machine d'entraînement. Vous êtes dans un espace. Et puis vous trainez votre machine par computer des fonctions de projection F qui va minimiser l'avantage de discrétion entre les labels qui sont données à votre temps de entraînement et la projection et le data. Et puis vous régularisez pour contrôler les types de fonctions que vous observez. Et puis vous avez l'illustration que vous avez pour les méthodes linéaires où vous travaillez dans des spaces fonctionnels sur le data, mais bien sûr, il y a aussi des méthodes linéaires comme vos networks. Mais pour toutes ces méthodes, ce que vous assumez c'est que vous avez donné les labels au CN. Et bien sûr, c'est expensif. Donc il y a un trend, je ne sais pas votre background, mais il y a un trend d'intervision de la communauté pour les label tout. Ok, donc après vous, vous avez les labels plus et plus de choses. Donc c'est un exemple de M.S.Coco et c'est un exemple que j'ai pris un an auparavant, donc probablement M.S.Coco est plus grand que ça. Donc il y a des gars qui label des images comme celui-ci et qui délinent tous les objectifs d'intérêt dans cette image. Donc moi, si vous me donnez un mot et une picture comme ça pour les objectifs d'intérêt, ça va me prendre 20 minutes pour faire ça. Je n'ai jamais fait ça, je n'ai pas d'idée, mais j'ai fait ça pendant longtemps. Et ils ont fait ça pour 400.000 pictures. Ok, ce qui est bon parce que ça vous donne benchmarks, ça vous donne des problèmes difficiles, ça vous donne une manière rigoureuse d'évaluer les choses. Mais c'est mal aussi. Parce que c'est trop expensif. C'est un peu mal à faire ça. Il y a aussi des biases parce que très souvent, ces benchmarks sont créés pour des tasques réels. Je veux dire que les précédents stocks étaient des tasques réels. Mais très souvent, sur la télévision, il y a des tasques faibles, un bunch de gars qui vont dans leur chambre et qui décident des pictures qui vont mettre dans leurs bags. Et il y a un autre problème qui est assez intéressant pour la vision, c'est le niveau de granularité. Qu'est-ce que vous arrêtez ? Ces gars labèrent ce qu'ils imaginent, les chiens et des gars. Mais ils labèrent le gras quand ils labèrent l'individu gras. Donc cette idée de labèrer le monde n'est pas la façon dont il faut aller. Surtout si je vais parler de vidéo vous devez faire ça pour des millions des millions de pictures, des temps ou quelque chose. Donc ce qu'on parle de c'est de essayer d'utiliser moins d'informations moins d'informations manuelles. Et je vais vous donner un bunch d'exemples, c'est très simple, et on va bouger pas beaucoup d'informations à pas d'informations. Donc, pour commencer, let's look at the co-segmentation problem. So in the co-segmentation problem nobody delineates for you the objects in the training data. The only thing that you hear the task is to segment the image that is to separate it into regions more or less foreground regions and background regions or maybe sometimes multiple regions may correspond to interesting objects but to train in co-segmentation nobody delineates anything for you. The only supervision that you have, it's a weak form of supervision is that you are told these pictures belong to the same bag. They contain something in common may not tell you what. And you want to use that information to segment the images. And so these are some of the outputs of the algorithm which Julien did with Francis Beck and me a couple of years ago and as you can see, it gives reasonable results. In most of the examples I'm going to show you I'm going to show you examples where the method works. Of course, there are plenty of examples where it gives random, not random, it gives bad results. But here it works reasonably well. If you look at the stonehenge pictures for example I find it remarkable that it works so well and I find also strange that picture be the elephants but maybe I shouldn't spend time on them. All right, so that it works. The idea is very simple. It's just to generalise supervised classification. So this is the supervised classification scheme I was showing you before where you know the labels and there's something called discriminative clustering that was introduced in the mid-2000s and that exploited this idea. And the idea is very simple. There are two ingredients. One is that to do this clustering into categories that you don't know what you are going to do is not only optimise over the classifier but you are going to optimise over the labels. So this time you don't know the disease you are going to try to compute them. And since this is in general complicated what you are going to do is instead of taking a complex loss function you are going to take the square loss function. Because when you try to optimise over the classifier you get the classifier in close form it's a rich regression problem. So you knock off the classifier and all you are left is an optimisation over the labels. The labels which is a quadratic optimization problem and this for relaxation you can do efficiently. So at the first question session what we did is we just took that idea and generalise it a tiny bit. We changed the classifier from square loss to a softmax because we wanted to do multi-class things and it seemed a good idea. We added spectral clustering things to group together pieces of images that belong together and then this discriminative clustering method have a problem that they tend to put all the pixels into the same class. So we have done an entropy term at the end that kind of spreads out the wealth and prevents that. Otherwise you can see the same type, read the same type of technique. Of course then you need to see how you can optimise it so you relax it. The problem is non convex because of the entropy terms and so we are going to do block coordinate descent. You need to initialise that thing with reasonable approximation so you do a Taylor expansion near zero and you get a quadratic form that you can do in closed form and then you have to do running. With that you get the results I showed to you. This is an example of the same method applied to three pictures in the same shot of the same movie. As you can see I've got this kind of ideal setting because the emission conditions will be roughly the same they will be the same characters in a movie they will have the same appearance the background will change a bit. As you can see this thing succeeds in finding pretty well the main characters. It doesn't segment them exactly there's a shoulder patch that's missing but who cares. All we care about to extract information from an image is the rough shape of the object. That brings us to video. In the previous case I was relying on some amount of manual annotation that is there are people who come and say well these pictures I put them in a bag because they contain something in common. Here I don't want to have manual exploit metadata that may come with various types of images or videos. In the case of videos it's a remark that was done by a while ago I think first by Laptef and Perez is that you should be able to exploit the script data that comes in movies. So if you take any movie or any TV series this shot after a script and that script that takes content information that's correlated with the visual information. This is an example from Casablanca in the space of the script so as the head wetter takes them to a table and you can see it's correlated to the wetter. I don't remember the name of the actor to their table then she passes some and you can see it's also correlated with some keep his eye on the keyboard. And so the idea is that there is a correlation between a text and a visual content so can we use that text to inform us about the content of the video and figure out what's interesting in the video. So scripts are available for many many movies and many many series. First they always exist but the production companies don't always give them to you but many of them are on the web they enjoy typing scripts as they watch movies and TV series and they put them on the web. But the script is like a piece of paper it's not synchronized with the video. And the other end any film or any TV series comes with subtitles for people who have a hearing and the subtitles contain of course the dialogues but the script also contains the dialogues. The subtitles are timestamped and so you can align the script with the subtitles very easily using dynamic programming and then you timestamp your piece of script. And so that piece of this has the weather you know that the dialogue is at such time and so that what is described will happen in the first second slice of time. And then you can try to use that to figure out what is happening in that particular time. Here is an example you would like to identify characters in a movie and also what they do. And so here the the script tells you Rick's walk up BN Ilza you can imagine that you do some face detection and you get those yellow blocks and then you want to label them with the name of the person and you also want to say what is each person doing. So it's easy to formulate that using the same discriminatory formulation that we have. Again this time we are going to have two label variables, one for character names one for action names and so we have these two variables there will be linear constraints that tell you that the person is mentioned in that piece of the script if that action is mentioned in that piece of the script then that action and that person should appear at least once so you have these linear constraints about that but then you have linking constraints between the person and the actions that both of them should appear at the same time and so those are non linear again non convexities etc you need to be able to to to optimize that so again we relax to block coordinate descent we initialize one of them by taking a uniform distribution of our time and then we run them ok it's no big deal in doing all that so you see the kind of things that you get it takes a while to start again green is good red is bad so the face is found by some of the shear face detector the body is the larger rectangle it's just an offset and scale version of the face and you know it's not perfect but it's a relatively difficult task and sometimes of course it doesn't pick up the people ok alright so often people we deal with video treat video as a kind of three dimensional block of data it's not quite right because video as a time doesn't go backwards most cases and so we need to use that temporal information so that leads us to the next thing here the idea is you are going to get this piece of script we are going to do some keyword mining we decide on a vocabulary of actions so I don't know sit down, open door whatever is written there and then for a given piece of script we mine which ones of these keywords or synonyms of these keywords are going to happen and then we are going to make the assumption that the temporal order in the video and in the scripts are going to be the same that is that if in the scripts they open door before sit down then it should be the same in the video ok how can you formulate that we are again going to write it as again very close to discriminative clustering what we are going to do so here what everybody present thinks we have our dictionary of actions there is an empty action because there will be plenty of frames where nothing in the dictionary is happening then we have the meta data is given by that mapping A that tells you that the cave action in the script correspond to whatever entry in the dictionary and then you will have a mapping between time frames this is a small m here between time frames and your action and your meta data so what you want to do is again optimize your classifier but this time instead of being just the label you are going to have the entry in the table that corresponds to your assignment and the fact that the fact that the temporal order is maintained corresponds to the fact that the path that you have is momentous ok so it turns out that one possibility to the representation of the problem is to mix together your meta data and your non assignment into an indicator variable and then you will have complicated constraints on these indicator variables that are associated with the fact that you have a momentous path ok if you do that you get back to the usual setting and you recognize the square loss et cetera et cetera but the difficulty now is that when you try to do optimization of a discrete domain that is defined by very complicated and implicit constraints so doing that optimization is not obvious ok so when we relax it instead of relaxing it on the discrete domain we are going to relax it on its convexor and fortunately there is a nice way to the optimization of that convexor using the so-called Frankouf algorithm which is an old algorithm that has become popular in a few years in the machine learning community and what's the idea is that you have a convex function defined by a convex domain and that domain may be very complicated and you may not have an explicit description for that domain and yet the nice quality of Frankouf is that you can find the global optimum of your problem if you have an oracle function that allows you to do optimization of the tangent plane to that surface ok and since it's a relative algorithm basically you optimize over the tangent plane and then you take a point not midway but somewhere on the way between the optimum you have found and the current point there's a way to find the scale factor and this thing will iterate and the tangent plane can be done using dynamic programming so it can be done efficiently ok and this is the kind of things you can do we took a bunch of movies we decided on a vocabulary and then we let it fast forward and when the program thinks that it has found some action it slows down and it displays it and of course again very sometimes but I show you examples quite well and those are not easy what we did is that we wanted to evaluate these things they wanted to evaluate these things so we asked somebody to actually annotate by hand where they thought the actions were happening in the movie and so the wider bands at the bottom that's where the person annotated the narrow bands for the program found and I think this is the next example that I like so the yellow thing is drinking the white yellow thing is the person annotating it the thin yellow thing is the program the program finding the action drinking and I think both make sense so for the person this is drinking so drinking are you going to drink, are you going to drink blah blah blah it's a social thing for the computer and both are very reasonable and I think that's interesting because in most of computer research and visual recognition the semantics of the labels is absent so you have labels, you call it drink or seat or potato or banana but they should be one, two, three, four or ABC, they are pure symbols you never use, never exaggerating but by and large you don't use the semantics of the labels ok so it's quite reasonable that somebody would call drink or something like this and somebody would call drink something like this ok I'm going faster, that's good and so so this idea brought us to try to consider the problem totally unsupervised interpretation of image of the video this time no labels here you know it seems easy but it's really not that easy there was no manualization for that movie but still we are using the script here what I'm going to do is some people putting things in different bags I'm putting everything in the same bag figure out for yourself what's in there ok and that's actually the idea of Minsu Cho, a former political mind and and there are several key ideas the first one is this idea that if nobody re-uses the semantics labels are just a way to put things in bags they are just a way to put things together that people think belong together ok so the supervision by and large is used to say these three things belong together so some people may say those are horse pictures some people may say those are horse riding pictures some people may say those are banana pictures I don't care the main things that people think these things belong together ok if that is the case maybe you can replace that supervision process by an automated process that just say well these pictures they look like each other so they belong together so idea number one idea number two is that if you think that way if you take a collection of pictures say the internet and of course we don't do it at internet scale but say the internet this collection of pictures has some underlying implicit structure kind of graph structure two pictures are linked implicitly because they belong they contain things that look like each other ok and it would be nice to discover that structure and if you know that structure third idea then you can use that as supervisor racing ok so that's the whole idea and I think personally I think that's a very good idea now what I'm going to show you is how we do it and I'm not going to say that this is a good way to do it it's a way to do it the same way as before we do all these discriminatory the story which is very simple that may not be the best way to do it but it's a good idea to use metadata etc ah, there's another idea the thing is good is that the is this idea of parts but I will come back to that later so the way we are going to do it is we are going to represent pieces of images by regions so we pick random regions in an image and some of these random regions may enclose interesting parts of the images interesting objects in the images so in this case for example the red one could enclose the bike and some other of these regions may correspond to say salient parts define in some sense of these objects we are finding and then of course some other regions may correspond to random stuff ok so we are going to try to match images by finding regions that may correspond to objects or their parts ok so in reality me and so in the vision community object proposals have become popular so you have some bottom up process that puts a rectangle where it thinks that it might be something interesting personally I'm not so keen on that because I don't see why this should work without any prior information but to be pragmatic it works better by using those region proposals than by using random regions ok and so what we are going to do is we are going to try to bring in when we match these regions we are going to try to bring in some geometric, very simple geometric information to match region so imagine we take two images we have all these region proposals in the first image 2000 roughly we think we take between 2000 and 5000 region proposals in each image we take all these region proposals in the first image all region proposal in second image and then we look at a possible match so a possible match m is a pair of regions one in the left image one in the right image ok and what we are going to estimate is the probability that this match is correct given all the data that we observe so capital small d here is all the data that we observe ok so what we are going to do what we are going to do is try to marginalise I mean rather try to use Bayes rule and marginalising other possible configurations of the two images that is possible offsets between the region in left image region in right image and by offset I mean translation and scare ok if you want you could do rotation whatever but so the configuration C here is translation and scare ok so we marginalise over it and so we are going to probability we will decompose probability of the match given a configuration times the prior probability of the configuration given all the data now then what we do is say well you know appearance and geometry are decoupled they are independent from each other so we can factor the first term into two one that depends on your geometry so the appearance term can be any similarity measure between patches after normalisation of course that you want Euclidean distance you want or correction between sift or whatever the geometry term can be just for example the some thing based on the size of the region after you have applied if you have set those are simple and you can come up with anyone you want the complicated term is probably of the configuration given the data and that we don't know ok so we don't know it so when we don't know something we make up an approximation for it we use what's called probability of transform that's not a very nice term but we use a proxy for configuration given all the data what we are going to do is we are going to take all possible matches and we are going to make them vote for that particular configuration ok so we are going to see if given among all the match if a match is compatible with the configuration we will give a good vote for that configuration and a better vote even if the appearance is good this approximation but it's reasonable and it can be computed ok so that's for a match now let's say only 2 images if you want to have I'm sorry it's not only 2 images for a match now if you want to get confidence for a given region you want to get the maximum of the score of all possible matches that include that region ok c'est un exemple ici 2 images all the candidate regions and then if you color code each region in each image by its confidence you can see that blue is not confident red is very confident you can see that the red regions are going to concentrate on the salient objects we don't know it's a horse if they are 2 horses they look like each other and the 2 backgrounds don't look like each other that's why it concentrates on horses it concentrates on trees instead of horses if they are more structured there and at the top what we have done is just display the top 20 scores top 20 matches rather in the 2 images and as you see they rather concentrate on the score so that's encouraging if you don't use the geometry term you don't get any structure ok what is this one I'm not sure why I'm sorry I'm jet lag I guess it's just because that was a different example and the PHN again concentrates on the thing so this was just for 2 images if you do multiple images well you are going to take the average for each region or the sum of the score in that region in terms of the more images you have the better the more concentrated things become so at the bottom you have between 2 images and if you have multiple images you see that you have a lot more of these red regions ok again this assumes that the images that we are trying to match are linked that we assume that they contain some common thing that's interesting you think that we would like to capture this idea of part that you know you have an object it will be a box that more or less contains tightly contains it and then there will be bits within that correspond to object parts and we don't know how to do that I think I will come back to that a bit later I think but the I think one of the things people envision don't know how to do today is reason about parts in an intelligent manner and so shame because I think that's important I will show you a couple of examples later so here at least what we would like to do is say that you would like if the region is too big imagine you just want the tight box if the region is too big then it will contain more background than the tight box if it is too small then it will contain less foreground than the tight box so we should be able to use that the second one we don't know how to do it so far we have so many but we don't know how to do it so for the time being we just just use the the first one and we define a silences score as the difference between the score of a region and the max of all the regions that contain it and that tends to select regions that tightly bound objects but why not but this actually I think this is an important point trying to understand what's inside what's inside is important and now the algorithm this is it you take a bunch of images and then you initialize it by looking for the nearest neighbors of one image according to some criteria ok so at the beginning you don't have any region matching information you don't have anything so you have to take some global descriptor we use a descriptor introduced by Efros and what's her name and his wife, no not his wife not Efros, by Toralba and his former wife Oliva, here we go your friend, sorry don't film that anyway there's some global descriptor called GIST that's reasonable for finding two images more or less look like each other so we look for the nearest neighbors of course it's a toy example, we can make mistakes in that process of course and then we try to match those ok once we have done those match we have some hypothesis for the same object in our target image and then we can use that to refine the search while we do it all over and then we can do that to refine the search for nearest neighbors it's very simple, I'm not saying this is the way this should be done but it's just a way of implementing but I think it's a good idea so does it work does it work in general in general it works pretty well you get better matches after a few iterations this is after one iteration you get these big three regions up there after a few iterations it concentrates on the object of course it's anecdotal evidence just an example also the retrieval the search for neighbors improves with iterations at the first iterations you get all sorts of random things here after five iterations you get essentially cars which is what you want ok to wait for a few iterations and you are done so because I didn't put it in a slime going to forget if I don't see it now so of course this is ugly we are not even optimizing a function we have an algorithm it does stuff it doesn't lower error of whatever it does stuff this is ugly and of course we are not trying to formalize it as a proper optimization problem so it's interesting is to try so I said I would not show too many benchmarks but the idea is that we took there's a classical benchmark in vision called Pascal VOC07 so we took all the images put it in one bag, check it and give it to the program and then we we we report core lock which is the correct localization my memory is correct oh yeah this is unsupervised so it's always difficult to evaluate unsupervised methods because they are unsupervised so what we do to evaluate it we took Pascal VOC07 where people have put bounding boxes around objects of interest because it's a database designed for object detection and so when we say we are happy the candidate object that we find coincides that is overlaps enough one of the things that have been marked and core lock is aware that it overlaps kind of intersection of our union or something like that and so what's interesting if you like at the top table so this is per class so we don't have this thing or mixing the classes together but there is no supervision whatsoever except for the fact that it's the same class and all the other methods have essentially have some supervision with negative example even with some pre-trained CNN methods and without with very very little supervision we do quite well and then we can also check everything and that's the second table and it still does very well and none of these other methods works for that and so I think that's interesting now if you look at those are some examples of successes those are some examples of failures and these are anecdotal as well I'm not saying that the failures are always like that but a failure that's observed sometimes often which I think is natural we find the head of the cow or the head of the dog or the wheel of the bike what's wrong with that it's unsupervised and if you take mixed pictures and you have faces you know faces are salient things and the face of a dog looks like the face of a man looks like the face of a goat I mean pretty much not exactly you have two eyes, nose, mouth, ears and so it's a salient thing to find and again this has to do with this part thing and I think this failure mode of many detection algorithms what is magic about the body about the box of all the body and so but we don't have a good way of reasoning about parts in general I think nobody knows how to do that but I think it's a really important problem so since we could do it in in still images we tried to do it in videos as well and that's the work mostly of Sua Kwak we mean to show and I forget I should mention Francis Buck or Lee Ashmit and he's also involved in some most or some of this work and the idea we're going to do the same thing as before but we're going to introduce some temporal consistency so I'm not going to bore you with the details but we can do that and then you can do these things you take, we took clips from the Jason Bourne movies and we took clips that contains a lot of cartridges and we let the we let the computer figure out what was interesting and of course there are a lot of cartridges so it collects cars but you know it's moving, the camera is moving the the cars are they are translating but they are turning sometimes they don't move and it works really well and my favorite example is this one we gave the poem The Movie Babe so it's The Adventures of a Talking Pig I've never seen it but it's supposed to be pretty good and the and the poem got pretty good at finding Babe and in general animals and their faces but very good at Babe you know we didn't try to the New York Times to tell them we did that but it's totally unsupervised give it the movie so I'm pretty happy but just written and you find sheep's face and then I think there are monkeys later and if I had pigs' butts and maybe some dogs but you know ok and more babe but that doesn't bother me at all you know what's wrong with finding the monkey's face as Babe's face and of course and of course I don't think you'd want to if you want to deploy an object recognition system you don't want it to be fully unsupervised there will always be some metadata but I think it's interesting and important to look at this unsupervising and what you can do without any prior knowledge so these methods can be extended I'm not going to the technical details but the way we treat geometry here is very important to have geometry makes a huge difference but it's too global because we get everything to vote so we also introduce a local voting procedure and I'm not going to describe and that improves things quite a bit and we also use that to match images to obtain what's called sometimes scene flow you have optical flow optical flow you have two images of the same scene you get the same scene but the camera has moved a bit and you want to see how the other pixels have moved in the image optical flow scene flow is when you have two images of the same category same type of objects but they are not the same same object and you would like to find the deformation from one image to the other and you can use these methods basically because you can do this region matching and then you can interpolate the deformation between the regions and this is the example so what you get on the left you have two pairs of images and what we are trying to do is deform the first image in the pair into the second one and we overlay so here are the deformed versions overlaid on the image and we compare our method to a bunch of classical methods for that that problem and as you see as you see it does pretty well and interestingly enough I think it's essentially the only method that gives you a different image that looks like a real image maybe it's even more you can probably see that on Dalmatian and even on the hand and for example for Snoopy it actually overlaps the thing very well there's just the arms that can be well because they are not in the same configuration and you can even match drawing of a pizza to the pizza itself pretty well so I think that was interesting so what we are doing I'm almost done so it's funny because we are starting on service but now we want to go back to very supervised so we'd like to use the same kind of ideas but this time you know a priori that we measure in the same category you want to match them and we'd like to learn the matching process so it's not too complicated you can this was our semi-probabilistic approach now we are going to replace all the probabilities or proxies for matrices by similarity functions we can rewrite all these terms the same as this and then the W here will be the parameters that you are going to try to learn so I'm going to try to learn similarity functions while having some geometric geometric constraints over them to turn that we could have done it there as well it turns out that when you do it you have some kernel that appears in the middle but you have basically you get similarity function based on delivery your match similarity function some geometric kernel and some neighboring neighboring matches and then that's ugly what I don't like about CNN papers they are horrible you know where people put all these layers they have 5 confnets 3 max pooling 2 fully connected whatever 64 I don't know so for me there's a black box and this picture is a bit too complicated for me because I don't understand but apparently the dotted thing helps but the solid thing I understand basically all this is you have a black box that computes f ok and it's a black box that's differentiable so it's called net and then what this does is compute the the function I was showing you in the previous slide and then you optimize and then you get something that can match match match regions so we have to there are not many methods that do that it's an old idea to learn too much I've come back to 10 years ago paper by Yann Leucan we are trying to learn stereo matching there's paper by people from Stanford people from Stanford and NEC that's trying to be essentially the same idea but it doesn't have the geometric the geometric knowledge so as expected we do better what's interesting this is interesting I think this is interesting this is yeah we have I told you that we have this localized way of using geometry and we have not implemented it in this approach yet but we have seen that if you if use the features that we have on in in our previous approach instead of using sift or whatever then we get a lot better results with the localized thing so that means that when we implement the localized thing in this network we should do a lot better so I think that's kind of interesting and and I think I will stop there so it was just a short tour in our experience I think it's a key component of what we should do in vision today thank you questions yes I like deep learning but I am not obsessed by it and it doesn't work for unsupervised unsupervised so which is good and so I that's pretty much it but I like deep learning I think it's good so we have to convert ourselves to artificial intelligence for several reasons but I think it's very good that some of it has moved to industry if industry no no no but if industry believes that this was a done deal they wouldn't sponsor us they wouldn't hire our students they wouldn't try to work with us they wouldn't have research labs etc so I think most of the innovation still comes from academia and I think so again unsupervised and I don't know I'm not the useless follower of CNNs in the world and there are people who are much more knowledgeable than me but I don't know any work unsupervised that work in that area I think what I have told people a lot is if you look at the old CNN papers they are not just doing CNNs if you look at the the paper of the Tinture Use-Le-Net they have something called graph transformational network that are structured that are a lot more complicated where you have differential blocks that talk to each other and that's how they manage to read the checks not by using the net but using that stuff so I think the idea of having differentiable learning systems that you can learn N2N is good of course I think you need to look at I think more interesting block structure where you understand you understand a bit how you program that thing I think just doing this is not so interesting but so I'm not pessimistic at all I think we are in good shape I think it's good we are having an industry an impact at the same time that there are plenty of interesting problems left I think maybe I'm naive Thank you