 So we have John Bruna, who is playing in the UPC, and he went to Paris and baseball, and played in some places, and we were talking about this very... We are very grateful for the team, for coming here. And about the group, I just want to comment that today we arrived to 1,000 members, aside from this, and the next session will be on June, and it's about big data, like for the crisis situations, natural disasters and this kind of thing. Carlos Castillo is the director of the OECA, research institute here. Okay, do you hear me? Okay, so hi, yeah, I'm John. I'm very happy to be here. Thanks, Alish, for inviting. Okay, so how many here are somehow related to university? Just for me to get a fast touch. Okay, so, okay, and the other, I assume that you are in industry or just here because you like machine learning? Okay, good. So many of you are, I guess, that most of you have heard about deep learning, maybe not in the right places, maybe you've heard about deep learning in the news, you know, on television. So that's why I'm here, trying to set things a bit straight for you, and so not everything is good, but there are things that are good. So what I will try to do in that time that I have is try to tell you a little bit what I think is good from the field, which things are not so good, and which things maybe can be improved. And if we have time, I will end up with a little bit more researchy slides telling you about what are the new things that I'm working on with some of my students. And, okay, so I would like to make this super informal, so you will be asking me questions and I will be asking you questions. So don't be afraid. I mean, those of you who are in the first line are probably gonna get more questions than the other ones. So if you wanna move, you can do it now, hopefully you will stay. Okay, so for those of you here that never heard about deep learning or never heard about comp nets, I'll have some slides for you. So just stop me whenever you have a question, something that is not clear, okay? Okay, so what is good about deep learning? What I think is good in deep learning is that there's one sub-area of deep learning that is the convolutional neural networks, that these are things that really work. Okay, so it's really, I mean, anything that is not a comp net and they try to sell you as something like an algorithm that is deep learning and it works, it's gonna work wonderful in my dataset, ask if it's a convolutional neural network. Okay, if it's not, well, we're all things that are on end are also very cool. I kind of agree that comp nets are cooler. Okay, and I'll try to sell to expand you a little bit why I believe that that's the case. Okay, so what is a comp net? Actually a comp net starts with all of us, starts with the brain, okay? So visual cortex is some stuff here that we have on the back of our skull. We use it for knowing where we are and recognizing stuff. And so there were very interesting intelligent people, Hewer and Wiesel, who were trying to wonder what was the sort of the model for that sort of part of the cortex. And because of this work, they incidentally got the novel prize by finding the right model. And so what was this model? Is this idea that we have some stuff in the retina, little cells in the retina that capture photons that come from the world. And these photons are aggregated into small receptive fields that they call that send this information to another layer of neurons. And these neurons receive the information that is sent by the first layer of neurons, do some complicated stuff, and then they are... There's another layer of neurons that takes the responses of these neurons and does something a bit similar and so on and so on and so on. So this is a model that is pretty... makes sense, right? Why not? But the really interesting thing is that it was really inspired by biology, okay? So comfnet and deep learning is really something that starts with people really opening skulls of whatever animals, I don't remember which animals they were, but doing... coming up with theories and trying to validate these theories. And so far, no computer science, right? This is just doctors in a lab. So when does computer science enter into the game? It enters the game with this crazy Japanese guy, Fukushima, who was really a pioneer. So he realized that, well, if a brain can do this with that architecture, perhaps I could try to use the same architecture to implement it in a computer and use it to solve artificial vision tasks. And so that was his idea and it was actually a pretty... a paper that was ahead of his time somehow. And so that was the architecture, except that there was a problem that he didn't know how to train. So in this architecture, what do we have? We have, again, input. So these are pixels. There are two operations on these pixels that are repeated after... they're repeated a number of layers that is specified here by this diagram. But of course, these operations require some parameters. So how do you take these pixels and produce these ones? So you need to construct some form of weighted average. So how do you choose these weights? So that's what machine learning is about, right? Looking at the data and maybe coming up with what is the best set of parameters that solve this task. I don't really know how to do that. So he just had some heuristics and it didn't really work so well, so this thing was not really the answer. And so then we had Yanle Can who still believed that this was a good model, was a cool model, and then he just combined this model with the right algorithm. So he just decided to try to train this algorithm using... I think I'm going to switch to this one because it's better. Is it better? Is it good now? So then Yan took this architecture and he combined it with a very naive algorithm. Some of you here have done probably math in high school, I guess, all of you. So he basically combined the chain rule with this model. And so once you have some objective, so you have an input, it computes some output, and then you have to tell the computer well, I would like this output to be closer to some target. So for example, if my computer decides that I'm... I don't know. I'm not blonde. I will tell the computer that I should modify my output to move me to do something that is brown color or something. And so then you just have to propagate information back into the lower layers. And that's something that he managed to do in the lower layers. This very small network. And so that really created something that was very popular, hugely popular, and not only popular in the machine learning community in academic world, but also really deployed in the industry. And it was something that I mentioned yesterday that I really found, find something that sometimes is a bit understated. I told you that you have this model that comes from a neuroscientist in the century, and then you put them together. Someone could say, what's the point, right? It's super easy. That's really not new. But you have something that reads, I don't know how many million checks a day using this, right? And so if it was so easy, why it hasn't been done before. And so this is something that let's keep this in mind. And it's one of the strong points of deep learning that don't really understand them because they are doing simple things, right? This thing that works is very important. And so, okay, so that's where we are. We have these models that are good to recognize automatically these digits that are written in checks or in papers or whatever. And so these things actually were as a matter of fact they were working very well not only to read checks and to read numbers, but to solve many tasks across computer vision. Face recognition is one example. And somehow despite being so good if you go to a conference if you had gone to a conference in, I don't know 2010, 2009 in computer vision like CVPR or even NIPS or these things they were really laughing at these things. So they had Jan in the corner of the room and he would give his talk people would like applause very politely but then, okay, next one. And that was the state of deep learning until 2012. Despite working very well, I mean here you have an algorithm that manages to recognize faces in a picture, okay, that's very cool. Other things that these things were applied and they were actually very good at doing. They were very good at, for example, doing automatic scene leveling. Okay, so here you have a camera that you can use, I don't know, in a GoPro let's say or something that you are walking along the street and you would like maybe to label every pixel that you see as belonging to that's the road. That's the road that's the building, that's the sky. All these things, right? So it's not just is it good now? Okay, that's much more similar to what I do usually. Okay, so that's a problem where every pixel has to be labeled. Okay, so imagine that this is much much harder than having to just come up with a label for the whole image. How do you do that? You do it with a comfnet. Okay, a particular comfnet but just keep this in mind. This is a model that you can use for that task as well. By the way, this is a test that I think that is, for us, for humans it's pretty hard actually. You have to really spend a long time doing this. I give you a picture, yeah, maybe you can do it but it doesn't take you it doesn't take you a couple of 20 milliseconds to do that. It's really something that is quite laborious and these models are pretty good at doing it. Okay, so that's sort of a slide that I have here for the scepticals. I don't know how many scepticals are here, like people who are really purists in math, but that's basically the slide for you. So deep learning is just a bunch of algorithms that are pretty naive. That are trained, I here have the formula in a sense it's one slide that resumes all that is to know if you want to just go there and implement your deep learning model. Okay, so you just need to know that you have to repeat an operation that is essentially a linear operation. If your favorite point was nonlinearity I mean you can choose the thresholding you can choose anything else you can repeat a bunch of them and you come up with a loss that is essentially here a model that we use to model output probabilities, okay, discrete spaces and then you train it with a stochastic gradient descent. That's it. Okay, so that's in a sense that's why it was too simple, that's why people didn't like it okay, it was too simple. So there are many other reasons why people didn't like it. Of course these things in order to start working well you need a relatively large training set. Okay, so and that was I agree, that's the problem and that's a valid point but it doesn't mean that it's not that it's useless, right, there are many other examples where we have a bunch of data points. So there was a lot of people who were sort of defending their territory okay, so you spend I don't know one year and two grad students working on, you know, coming up with good features for your problem and now someone else comes and tells you well I can do the same automatically it's not so easy, right you have to, you know, there's some ego involved, there's some politics involved that's not very good sometimes and that's another reason for more mathematicians, right, it's ugly when you have to solve this problem we don't have, we don't have any theorem we cannot prove anything about this problem about the optimization of this problem okay, so some people, you tell them that they cannot prove anything, they run okay, so they don't like it many other reasons I'm not gonna spend a lot of time here but then what happened then what happened, it was 2012 there was a bunch of, a team of crazy people in Toronto, very cold city sometimes it might correlate, right if you are in a place that is super cold and you have a lot of time to do to work in your room, maybe you can create a breakthrough, I don't know so what happened is that they sort of took this model and they continue to believe, right and it's pretty crazy, I mean I've known these guys and when we think, when we talk about it with more colleagues it's so crazy, I mean you have an algorithm that the field is super skeptical about, okay so no PhD advisor would tell you about this model, there was a couple of PhD advisors that were crazy like Hinton and maybe Jan, okay so there were only two PhD advisors that were believing in this model and you have two PhD students in those groups that group in Toronto that spent month and month and month and month trying it, but think about this, until the last day, until it doesn't work until it works, it doesn't work so you try this, it's crap and then why would you continue and you jump into something that is safer so you really need to be a bit crazy so they continued, they tuned the hell out of it, they used very, it was not just tuning they really had to solve a huge architectural problem so it was a time where GPUs were still, they were not a commodity as today, they had to really be very creative in making these GPUs work but they managed to do it and they completely crashed this problem so for those of you who don't know what this is it's like the Olympics of competition, so you have a competition where everyone competes with an algorithm and you are ranked automatically by a jury if you want, but it's completely objective so you send your algorithm it's tested on a data set and then you are ranked by how good you do, so they did much better than the rest and using a comfnet and then of course everyone realized that that was the way to go, from first you jumped into 20 this year and here maybe you had 200 or 2000, whatever, so now it's all read the field is completely changed and this is the most challenging task in computer vision, it's not the only one but that was really the one that was deciding everything any questions, so far so you know all this stuff what? Red is comfnet, so red is using the method that no one liked before and so another picture doing the same thing humans are somehow at 4% in this task so we are super human in that task sorry no, I take it back in that task, so if I have a collection this particular family of images, these algorithms learn how to be better, it doesn't mean that now I give this same algorithm the image label that it has never seen before is going to be much worse than that but in the things it has been trained on it does better than that and so that was really really not the case 5 years ago same thing for other tasks so not just classification but also localization what is localization is finding where the object is in the scene not just telling it what it is but also where is it also completely crazy, the jump that we have here is really really important just for just for some perspective having a jump that is going from 20% to 5% it's something that can take 40 years in a normal field, in a normal circumstance, here it's one year not only classification so that's again something that is just background and context you can use the same thing for speech not just to recognize images but also to recognize speech speech recognition you have to see that it's a field that has, it's pretty old it has almost 30 years old now and people have been working on this a lot a lot a lot you can imagine the business impact or the economical interest on having the ability to understand what we say automatically you can imagine the scale of this problem so you had all these monopolies like AT&T, Bell Labs who were heavily investing on this problem IBM has historically been the strongest groups in speech recognition but somehow they reached a plateau so they were improving they reached a plateau things were not improving why? we don't know but then they started to use these models these convolutional neural networks applied for speech recognition and boom we get the same sort of dramatic improvement so we recognize faces and not only, so I told you here before in the ImageNet problem here you have a thousand categories so you have a thousand different objects that are sort of chosen somehow randomly you have dogs, you have flowers you have all this stuff here you have a problem face recognition that's something that is deployed at Facebook how many labels do you have? billions, billions so this algorithm is able to among a million a billion, ok? so the American billion it's not a Spanish billion ok? and so how does it do it? same story, but of course there's a lot of engineering work to make this work there's a lot of work behind to make this happen but it's basically the same model I don't know if you are happy with knowing that Facebook can recognize faces I think I don't care I mean I'm a bit afraid of privacy but I don't know, that's more a philosophical question I think it can be it's not a big problem, and Zuckerberg is a nice guy so he's not going to do anything wrong to you ok what else? not only for images and speech you can also apply it for any data that has some sort of spatial structure not just images have pixels pixels are organized in a grid speech is organized as a time series you have information that comes at every time step but you have many other examples where you have a structure that can be absorbed and can be eaten by a component and this is one example, you have many others so that's other object localization I think I told you about it so just deciding where things are post estimation, that's done by a friend of mine, so that's why I put it here very cool paper if you don't know it also very funny segmentation think about it, it's completely crazy this is done automatically you enter the image and the algorithm understands that this is a single object I find it completely crazy I find it very impressive and it's doing it, something that was if you show this like 5 years ago they will say you are cheating this is something that changed and happened very very fast and when I show you the algorithm that does it there's only components yes well, no freelance that would be even more crazy if the thing learns by itself so you need labels other things, something that we all and others worked on captioning on the news and the algorithm is going to tell what's going on these things are really important they can be very cool for many people and of course they Google, Facebook, Twitter of course they are using these things to make our interaction with these systems much easier we were always telling yesterday that he's actually relying on this technology to find his car, I think or to find whatever he has there and of course you can use this in neuroscience and I was talking with people who are in neuroscience and they are telling me that now if you go back to the first slide I told you that you had these two guys that were coming out with this maybe this is a good model to understand vision well now we have something that we can really use as a tool to test our theories if this is a good model maybe we can understand how a hypothesis that we can verify so we can say if we take this image and we produce an image that has this representation that would happen, that should happen now we can go to a subject and try it and verify this theory neuroscientists are really excited about these things so they are really it's really important for them I have a little video here I don't know if that's going to work but let's try I don't know if you saw it I don't know did you see it before so that's another idea where you can see this technology in action so of course it's not me who did this it's actually a colleague from Berkeley Sergey Levine and he used the money from Google to build this arm farm that they called so it's like a factory of arms, robots and they are all trying to learn how to grasp things so you have a bunch of like a tray of objects and the goal of the game is to grasp super easy, I don't know what age babies do that I don't know, maybe 6 months a bit later so it takes 6 months of human to learn how to grasp I don't know how many it stopped so it took I don't know how long it took to train this thing but it was pretty impressive so they had this result like a couple of months ago and the key for this thing to work is again the same tool, the same element here so these algorithms can really make a lot of things work better and of course there's no question, this is a good part of deep learning this thing did not happen before and who knows if one day we'll be able to do this with alternative technology but here we are, we have these things that are really helping things a lot so other things that I like about deep learning I like the fact that it's a very democratic thing and let me try to explain what I mean by that probably some of you are tinkering with these models so these things are all how do you go from these mathematical formulas to something that actually works you have to program these things you have to implement these things and it turns out that we have very good tools for that so I don't know, probably here most of you are now very fan of TensorFlow there's a book, no I think that was presented here in this meetup I guess TensorFlow is good, but you have alternatives this thing comes from Berkeley so that's why I try to do some other testing here this thing comes from Facebook, that's why I do some other placement of torches and this thing comes from Montreal so these are all different sort of environments software that are mostly built on top of like C++ Lua and Python and these things are now super popular so it's starting to get a lot of visibility a lot of exposure not just within the machine learning community but in the software community so everyone can do it so you can now go to TensorFlow and have your neural network up and running in like 20 minutes good, even me I can do it and I'm not very good at coding okay, what else you have the tools, you have the software but how do you go and implement stuff, I mean how do you learn this stuff and deep learning now is actually starting to be very very very available, I mean you have a lot of resources available you have learning in YouTube probably you get a lot of lectures in particular the ones from our machine learning these are the best ones you have a very nice online class by Google Brain and this is Vincent who is from Google Brain very nice guy you should try to follow his class and you have not just lectures, not just videos but also if you want to learn a bit more if you want to go a bit more in depth I really recommend Andrew Carpathis class I don't know if here people have followed his class but that's a very nice class and then there's also my stuff if you want to get a bit more mathy there's some material online that you can check on my website if you are interested and then you not only have sort of lectures but you also have the research is really on the open so we are in a community that we really like archive and I will come back to this a little bit later maybe we like it a bit too much but we really like archive and so you can really get up to date in that field not only that so we have software we have algorithms, we have whatever theories available we also have free data which is something that for example if you are working on biology or you know on medicine well you might you better be in a very rich and strong lab if you want to do experiments data is expensive here it's completely the opposite we have the famous image net you can play with it for free you don't have to pay anything you can now use the very very very cool if you are into reinforcement learning I really invite you to go to check the GIM OpenAI so that's a platform that is really trying to create something very democratic if you want to compete in reinforcement learning you can use the cocoa data set for localization and captioning and you can use another thing that I really like a lot it's called the Babel Tasks from Facebook so for example you could try to see if you are good at following this dialogue you have your algorithm understand how to answer these questions ok so for example Mari moved to the bathroom John went to the hallway where is Mari? well for us it's completely trivial but go and try to implement an algorithm that is able to answer this question compare yourself with competing algorithms ok and we have now people who really worked a lot to make this thing very easy very painless to compare ok so I told you about cool things ok applications I told you about it's easy to use for everyone but me I think I'm more you know I also like math a little bit I like to understand things I like to go a little bit into the theory and it's really really really the best time for me to be in that field and the reason is because we don't understand almost anything we know very little about these things I told you that these things are very wonderful but the fact is that we don't really understand why ok so why do these things work so well we might have some ideas yesterday in my talk you might have got some ideas on why this thing can start to be the reason why this thing works but we are still very far there are many many things that we don't know the optimization aspect of these networks is really at the center of the stage everyone uses an algorithm that was invented in the 50s stochastic gradient design is an algorithm that comes from the 50s it's an algorithm that works everything it's like if you had a tool like a screwdriver that was one size fits all so you have a screwdriver that works for any problem that we are using we don't have any other tool that is better right now and we are still working in a class of problems that is very very very very concrete very small can we do better? we don't know we don't have a better tool right now something that might be relevant for the fogs that Barcelona Supercomputing Center how can you distribute these algorithms so that's another very complicated open question in theory something that is relevant for people more into the neuroscience world so they have this algorithm that produces awesome outputs but if you want to do science and you want to relate this to some real stuff you need to interpret you need to give a meaning to these neurons to these features very little is known on this and then the other thing that is very important I will put it later in the box what I like is that we need error bars we are here doing essentially stats we are having data the data is a sample of the world and we are doing predictions in outside what the algorithm has been tested on so there's always an error and we need to have these errors so for those of you who are here maybe going to apply this stuff and you know maybe someday I don't know if you want to publish these things but we do not necessarily tend to reject papers that don't even mention error bars so I hope you will come with a reviewer that is nicer than me here's a list of for those of you who know what this means this is a bunch of things that I consider to be part of the picture and actually that's an interesting slide because what I try to say here is that it's a very hard problem it's not you take a class of tensor flow for a couple of hours and then you take a class of androeng's machine learning and then you read a couple of papers and then you are good to go you can correct the problem you really need to be pretty good at all these things you need to be pretty good it's really deep it's really important stuff and so of course this can be completely wrong if you throw someone comes and solve the problem and this thing you can throw it but that's what I think today any questions? ok a question that I was expecting to get asked today but ok it's here for you so I guess that you know this fable from La Fontaine the tortoise and the hare what is hare in Spanish? yebra the tortoise so that's the fable I guess that you have heard the story you know someone who is really going a priori faster than the other but in the end it's not so clear so that's something that you could use to compare theory and practice and typically you have these practitioners and then you have the theoreticians and they work at different speeds they are together in the field they help each other but some people might people who are just implementing engineers they go much faster why do we care about theory? theory is the tortoise whatever this tortoise is going to do I don't care at all and actually machine learning is really an experimental science we need experiment we can do all the statistical learning theory we want if the stuff doesn't work we don't care but and it's normal in the sense that the hare is faster at some point in the sense that the experiment is normal that they go ahead of the theory I only know one example one case where that was reversed maybe you guys can tell me another example but for me the only thing I could think of is the Einstein he's the only guy who said there will be this gravitational wave because look at my theorem 10 years later SVMs where would you put them? it came at Bell Labs and people were tinkerers there yeah that's true SVM there was a point where maybe SVM was ahead and that was really pretty impressive but well Einstein is very impressive as well another question is why do we care other than that he's paying jobs for professors and maybe teaching your kids well I think the theory is important first it really creates a legacy passing knowledge to the people 100 years from now it's not just this graph I told you before of ImageNet with a complicated algorithm with 2 billion parameters sorry I don't understand the question it's just a joke guys oh no they won't do everything did they well it seems like did they suggest that they were going to do everything they were very good at solving some vision tasks and you will see later that there are many things that they don't know how to do and the fact is that there are many things that we don't mean I don't know if he was the guy who said it but you had this Republican from the Bush government maybe you know Angel like he was saying there's the thing that we know there's the thing that we know that we don't know yes so there are many things that we don't know that we don't know right Rumsfeld so there are many things that we I mean it's not even that we can say that we don't know them I mean it's like we are at a stage that we are completely lost okay so what I said is that yeah so there are many things that we want to understand before passing them to the next generation right I mean in a sense it gives us closure right I mean you really want to I mean there are a lot of people who until they don't really understand what's going on they are not happy okay and that's why theory can help and the other thing that I really believe is that despite what you might think actually I think that understanding a theory is easier than understanding a collection of experiments right if I have a collection of experiments how can you make sense of you know there are many disciplines right now in deep learning that you start looking at the papers and well it's a mess right I mean how can you even understand what is going on if every new experiment might give you a new data point that is completely I mean I think it's something pretty natural that theory is like you know like something that's like a meta learning okay you try to put all the points in the experiments together into some sort of unified stuff that connects everything just for me the role of theory and it's not a problem if it comes later it's not a problem for me other thing another interesting thing is that these are efficiency right how can you be sure that you are in the right track until you have a solid theory for it okay and there's a very nice example that Jan sometimes talks about is this I don't know if you know this Clemana there okay so he was like a French inventor from the beginning of the century and he was obsessed with learning how to fly and he came up with this model here and you can feel that there's something that's pretty wrong right this product it looks like a bat I mean so in a sense it's like okay a bat knows how to learn how to fly therefore if I imitate the bat I will learn how to fly so I didn't really understand the principle of flying right it didn't understand that if you just have this you know this like this force that like this wings that sort of I think it's called portans I know in French I don't know how to call it in cattle okay so you didn't really understand the principle that makes planes fly and the other thing that is actually very important I think is negative results theory sometimes can be used to say you cannot do better than this okay there's no point in continue to investigate the theory is nailing what we can and we cannot do okay and here in deep learning we really need negative results okay and this is something that you know you can apply to yourself in evolution right I mean there are things that we understand what's going on right there's no point in continue to investigate and I'm not an expert in genomics or whatever but you can you and you see what I mean right and that's an example that I really like and I apologize for this little detour it's a like a couple of slides that I have a little bit has have two theorems sorry about that okay but it's something that I really like okay who knows about optimization here who is using only three people no more people okay so you know great in descent right great in descent good so great in descent is an algorithm to optimize a function right and we know how to say things when this function is convex convex is the function that has a like has a well okay so I think that it's a convex function so it has only one point where it's minimal okay and this point is well identified so you can understand how fast you reach that solution what is important here is this little rate okay this one over T okay it means that I can go fast I can approach a solution at the rate that is one over T and there are some assumptions question can you do better can you do better only using basically gradients and what I mean by using gradients is using the same complexity okay you don't want to make the algorithm super expensive using the same complexity can you go faster to the solution forget about Hessian yeah Hessian is too expensive to compute it's quadratic yeah and then you can approximate the Hessian to make it linear but then you are in this territory okay so everything that is so that's the difference that I say with first and second all the methods okay first of the methods are methods that only look at the gradient of the function okay so I think that you have a query and the query only gives you gradients where you are not allowed to compute Hessian sorry sorry because you need to invert the Hessian right if you want to apply Newton's method so how many operations do you need to invert the Hessian that has a million by a million well do you want to wait I mean we can make a we can make a game right I you run the Hessian I run the SGT and let's see who goes faster well but this question is on a bounded amount of time right I only have a couple of hours do I want to use the Hessian or not okay so here I I'm not going to use the Hessian okay I'm going to use only gradients can I do better someone you know you know what what do you have to do yes momentum okay momentum is what we have when we move right we have some you know I mean I'm running so I have momentum so in a sense that the gradient tells me that you should turn but I'm moving so I have to take a little bit longer to turn right that's momentum so what do you do with momentum so crazy another crazy guy Nesterov he understood everything he tells well instead of doing gradient in this time why don't you do that you don't really understand where these things come from you do that and boom you replace the t by t squared okay and actually it looks it looks complicated it is complicated people who work on optimization still don't understand how come Nesterov had this idea okay it's it looks it's really magic and it works very well okay everyone uses momentum okay now I ask the friend question can you do better no no you don't know it but no but I mean even if you are in a non-compact optimization function the only thing you can hope for is to go as fast as you can to whatever local minimum you have closest to you so if I give you the choice you can go very slowly to a local minimum or you can go very fast to a local minimum what do you prefer we don't know how to do that we we don't have any algorithm that can magically jump from a bad local minimum to a good local minimum I mean if you know how to do that you will probably get very famous and very rich okay it's very it's a very hard it's a very complicated thing to do okay and I mean maybe you can do this in some context and that's a very open interesting question yeah yeah so you are following the gradient okay and the gradient is going to push you down okay so if you if to go even deeper you first have to go up and then go down that's a pretty good it's a pretty big risk okay and you can actually I mean there's people who like to you know to see this question as like as in human life okay and there was this very nice metaphor by John Chulman last week in the reinforcement learning is that you have to you can do two things you can start taking crack smoking crack or you can go to school okay so what do you prefer I mean if you look at if you look at what is the immediate reward okay what what is going to make you happier the next 10 minutes you want a small crack right because that's you know that's the well I'm using I'm borrowing his example okay but you know going to school is painful right that's painful and then you only start to see the effects the nice effects of going to school after many many many years ahead right so it's a very hard thing to do I mean we know how to humans sort of learn how to you know go to better local minimum than just the one that comes by taking cracks but algorithms don't know how to do that okay so some algorithms know how to do it but it's not in that complexity not in so many dimensions okay I hope I answered a little bit the question okay so can we do better than this momentum with the first order method no we can okay and there's another theorem that tells you that essentially there's always there will always functions out there where the rate of any algorithm is going to be t squared okay so essentially he completely solved the problem okay using momentum there's no way there's no hope that you can do better okay and actually that's the question here that I put you know I would be super happy if someone answers otherwise no worry someone knows how one has an intuition of what is the underlying reason it's a Russian guy another Russian guy of course okay it's chevy chef polynomials okay so that's something super beautiful right I mean we studied chevy chef polynomials maybe in a random lesson not in high school in the university probably and now they tell us that you know if you all have like an engineer working in a deep learning company and he says oh I have this cool idea I'm gonna try to combine three terms in the past not just use momentum but use acceleration okay like the the vector t minus one but also the vector t minus two and I think it's gonna work I tried it on MNIST and I think it works well no right chevy chef polynomials in a sense completely nailed this question and without theory without you know intelligent people coming up with these results we'd be wasting a lot of time like a lot of students time okay so that's something that I believe that let's please don't never underestimate the role of theory here okay so that was the good things I think and I will have maybe 15 more minutes that's okay yeah okay so what are the bad things things that I don't like so much about deep learning there's a lot of hype okay so hype corresponds to I mean these things that we are suffering right that we have like people from outside the academic community who is writing about deep learning commenting about deep learning there are other things that are related to AI general AI okay the how the terminator is gonna put every one of us in danger and how this hype is actually affecting research program okay this very nice equilibrium very delicate equilibrium between important questions that require many years to be solved and shorter questions that are gonna give you papers okay at NIPS or whatever and then there's another thing that is also like a bad thing or you know somehow well you could argue if it's bad or not but there's something that this trend that I told you before like these crazy rates that every year the problem is solved improved by a factor of two they cannot go on forever okay it's like a law I mean universal law right at some point will reach a floor where the more time you invest the less you will get in return okay and this is something that is happening in academia it's a problem for us in academia but it's also a problem for you guys in industry right I mean if you have this very crazy idea that you know you are you believe that you are gonna make a lot of money by applying the latest deep learning model to your whatever data set well I think it's not gonna work anymore okay so if you had done this three years ago maybe but now it's not anymore the case right I mean and it's a consequence of the fact that things have gone too fast maybe like very fast so what do I mean by the hype okay so well yeah what I told you here is that you know the first slide that I told you before right I mean what do we do is we take a bunch of layers we put them together we get data we compute gradients and we hope for the best and it's actually most of the things are really bad it's not so hard I mean it's not so hard to do it's not so hard to start playing you know with how many layers many filters etc so this is good in the sense that many people can do that right many people can get involved even if they don't have a PhD in math right they can they can start playing with that but what I think is that it's also not so good because it really puts too many people wrong problems okay it's not about just tuning and getting you know 1% in ImageNet better okay I don't think that this is where like the smartest people should be working on and sometimes it's hard it's hard to convince people to work on the right problems of course another thing that I don't really like that it's bad is that it's really in deep learning we are now being reviewed not by the way traditionally it's being reviewed so those of you who are used to journal papers or conference papers you know you are always a bit stressed right you send the paper there's a reviewer that is going to read out your stuff it's going to be very critical and it's going to tell you that you have to do this this this this and then you do it and then you send it again and maybe you are a bit lucky you get accepted right and it takes time right it takes time and it takes effort but everything makes I mean this is necessary it makes it really separates what is good and what is not so good right now do we have this in deep learning well it's not so clear we have now media who is taking papers and archive so archive for those of you who are not familiar is like a repository okay you have a paper you put it on archive and it's visible after today's to everyone okay and actually I will I will come back to this later it has very very very good things and it has helped us a lot but it also has dangers okay and now you put the paper on archive it has a like a you know a nice name you put like you know you put like a nice word in the title okay so it catches the attention maybe it comes from a nice lab it comes from DeepMind or from Facebook or some other nice place and then boom some journalist read it oh that could be a very nice article okay and people with very very tempted and so you have you know titles like this and you can have you know stuff like this with a very beautiful with a beautiful lady playing with a robot and and of course it's a game right and once you are into this game you know the companies have they their PR people trying to you know take take the most of it take advantage of it and those of us who are in academia we are of course we are not going to have a PR person in our department right it's not going to happen and so that's something that I yeah so that that's something that I think it's bad and we should be aware of it okay and that's I'm not going to say more okay so going back to archive archive is very good I told you that it has very good things and it's actually something that I wanted to maybe relate with another very very sort of well-known you know dichotomy in software engineering people who do computer science is this model of the cathedral the model of the bazaar I don't know if you heard about it so there are essentially two ways to do nice things one is the cathedral and one is the bazaar and it has been applied for software and so what is the cathedral the cathedral is a model where before releasing it to the public I'm going to make sure that it's perfect pristine okay super beautiful all the code needs to run no bugs I'm not going to show everyone anything until it's perfect it's ready okay and that's something that was in a sense followed by some software projects that are very well known like gcc, compilers e-max I don't like e-max I like vi so I don't know if you are using e-max here but don't use it and then there's the bazaar okay the bazaar is like the you know it's the place where it's a bit anarchic like everyone you know sets up the store you know you need more space let's design a road over there like no guidance it's a scheme where the code is developed as we go the code is developed in the open and of course it has advantages you can react much more often you don't have to wait 10 years until the cathedral is done I don't know how many years the Sagrada Familia has been in construction but you can see the difference it's one model I'm going to wait a lot until it's perfect and I will show it to the public the other model I don't care about mistakes I don't care about bugs and maybe the bugs are going to be found easier if everyone is aware of it if everyone can start using the code and then it can be fixed so this is something that has been used for example in Wikipedia Wikipedia is a bazaar model you keep adding pages people are fixing the pages works very well so somehow the model of you know the traditional is like the cathedral you send the paper it gets reviewed one year now it's perfect publish it and then you read the paper and it has gone through this process archive is more like the bazaar you send your paper maybe you send three papers two of them are wrong you put them back and there's one that survived and you know you get this interaction and people communicate ideas much faster so in a sense my question is you know you might have your arguments you believe that one is better than the other for software engineering but is research the same as software engineering whatever applies to one does it apply to the other and I think that there's a question I mean it's not a definitive answer I think that somehow we need some balance and in this community we might be a little bit off balance right now so there's not enough rigor in the way papers are being produced anyway that's what I think and then the other thing I would like to add is that I mean we belong to a machine learning community I don't know who of you here is an active researcher in the machine learning community but we are a community that is driven by conferences it's not like journals and we have three major conferences right now a year we have NIPS that is going to be by the way in Barcelona this year we have ICML, we have ICLR and you have three of these conferences every year right and so you really need to produce more than three papers a year that's the question right I mean do you have three papers that are really worth publishing more do you really need to publish three a month some groups might need because I don't know if you are like Mike Jordan for example my colleague he has 20 students all of them super good super brilliant of course you are not going to ask him to publish three papers a year he needs to have a larger quota but let's think about I mean maybe there's a model that we can come up to that tries to tune down a little bit the number of papers and of course the other thing that is important here is how as a community what can we do to make people work on the long term ideas the ideas that are not going to give you an article in the Guardian or you know on the BBC but that are sort of fundamental a bit more fundamental and this is something that everyone chooses to work on what they believe and I believe that I would be happy to see more people working on more like long term ideas something else I talked about is this idea that you have like a law of diminishing returns I told you that object classification is reaching its limit some other problems are going to be in that limit as well and so what happens here is that people it's like a bunch of bees that they see like a pot that is very sweet and everyone goes there the pot is completely empty there's only the corrupt and then they decide they see that there's no more low hanging fruit to be done here and they move to another problem and maybe the problem maybe the pot is not completely empty the problem is not completely cracked but you have this phenomenon that there's all this movement in the community that really people are working a lot on one problem, everyone is working on the same problem and you know there's another for example I envision you can still see papers that I it's a paper that I really like it's like a student of mine actually who was following my class this year very nice guy, very nice paper so here what he's training is a model that in a sense is a network that is being able to figure out how to fill the page so try to do this at home you have a picture the center of the picture and then you try to fill it in we are completely unable to do this this algorithm is able to do it pretty well sorry? it's pretty unbelievable and again something that maybe we are studying it in this period where this problem is maybe not what the community oh vision is solved let's move to another thing no I mean there are many things that are pretty cool I mean he has a very nice advisor so who is actually not following the hype as much so thanks to him now we have this cool paper another thing that is more like a personal note I mean sometimes you feel that you don't have time to think about problems because someone else is going to publish an archive before you have time to solve a problem and this is something that you have to go fast I don't want to complain about it but I don't know it's more like a debate question if you have a field that is very crowded you have to be ready to accept the game if you are working in an area that is you know that there are many people working on the same area at the same time as you you have to be able to live with a certain amount of stress that you have to go fast and sometimes going fast with research I don't know I'm not sure it's the best but whatever ok what about the ugly with this I will just finish like 5 minutes 10 minutes ok a couple of things that I just wanted to point out ugly things in the sense that these are things that are fixable maybe easier to fix than the others so the one that I really the first one I wanted to talk to you was the notion of reproducible research that's actually a concept that is not just a particular deep learning to machine learning it's really to science and probably you have seen all these papers and there are very famous examples of people saying smoking it doesn't affect cancer or you come up with this test that I don't know there was this very nice comic article that they were saying eating red jelly beans causes some cancer completely crazy but this actually happened you have experiments that if you have an experiment that is nonsense it's probably not going to give you the answer that you are looking for but if you do 100 experiments that are nonsense where is the probability that one of them is going to just by chance give you the answer that is crazy well just do the math at some point eventually there will be an article there will be a connection that is completely spurious and it's just because we are not very rigorous we are not very rigorous with the statistics and so here what I mean is that anything that you do that involves data the output of your algorithm is a random variable it's random the predictions that you give me all these numbers that I gave you before are not just my algorithm that's a random variable it's random with respect to many things it can be random with respect to how did you choose the training set if someone gave it to you they did the choice for you but it doesn't mean that it's not random if you are using an algorithm that is random like for example SGD there is some randomness in there so what you have to tell me is what are the numbers of this random variable and so that's something that we need and it's easy to fix another thing that I believe that is important is being respectful or trying to be as exhaustive as we can with the baseline what is the baseline in this problem the baseline corresponds to like a default so I don't know if I can do this task in 5 well this means nothing I have to tell you what is the baseline if I did it yesterday did I do it at 4.9 or did I do it at 1 you need a reference need the baseline to sort of evaluate if your model is doing well or not and so baselines are hard to obtain they take time to grab students if you are in a deadline mode if you have a better algorithm so any baseline that you do the less baselines you try the better your algorithm is going to look so there is no incentive to create good baselines and this is what happens in a field that is sorry Uriol if you are going to hear me in YouTube an example of image captioning that's something that happened there was a lot of bias on this problem very complicated models when you generate text it's pretty good you do nearest neighbors using a comp net you get essentially the same thing okay I have to refer it's the same thing if you are using the same metric so they are using the metric that these models were trying on if you show these things to humans they still prefer the fancy ones which is great, which is awesome but with less stress you have to know there were six labs working on that problem at the same time are you going to risk missing the deadline because you want to do this and then the five others are going to publish and you are not going to publish okay the last thing great search great search and it's something that they put here as a more like a challenge what do I mean by that what do I mean by that so most of these algorithms they have parameters that we train using gradient descent and they have parameters that we don't train using gradient descent for example the number of layers in a network or the step in which the gradient descent operates what's called the learning rate or I don't know the momentum term but still has a parameter there so these things are not parameters that we typically do gradient descent because if we did there would be a new set of parameters that would control the new gradient descent so there's always this chicken and egg problem so what do we do with these parameters what's called we grid them what do I mean by grid it means that you don't know what the value of a parameter is but you might have an idea of this value this parameter should be between minus one and one let's say well you just create a grid okay and you test all of them so let's say that you have three parameters you take all the possible points in that cube and you test all of them and that is very nice because you can what's called distribute this so you can send one job to every machine and they operate in parallel and at the end you just choose the best that's a very nice way but these are these problems but of course what happens how many of these parallel threads are you going to test well the number of points here that you see it grows exponentially the more parameters you trace the finer the grid that you are testing on the more GPUs you need and so what I think is that this creates a very important disadvantage if you are in a place where you can afford to launch like 60,000 jobs I have heard this number you have 60,000 GPUs that you use to optimize your network who has 60,000 GPUs Google maybe Facebook but I don't and probably none of you have access to this number and this is actually on the one hand it's great because they are there to solve the problem that we couldn't solve otherwise so that's good but we are competing against something that is very hard to compete against and this is actually something that is very hard to say they are brute forcing the problem I'm going to be smarter I can do the same with one GPU that they can with 60 very hard finding the right hyperparameter is a hard task okay so with that I think I'm going to just because I wanted to tell you about something else but I think it's getting a bit late so yeah well just maybe one slide to just show you what are the things that we are working on so it's actually one of the reactions of one of the things I said that I told you that the InVision things seem to be either completely solved, boring very crowded, very risky, very stressful so one of the nice things about this is that it forces you to work on something else so not just put all your eggs in one basket try to look for other things where you can contribute or you can create new stuff and so one of the things that I've been working on a little bit with one student is what's called the algorithmic learning problem so this is like a completely crazy thing I mean I can do for example like sorting okay, sorting problem so we know how to sort numbers I give you n numbers you know how to sort them and you can sort them you know more than that, you know how to do it in an optimal way so you know what is the best algorithm that does that but still we are going to say well can we learn these things from data so if I give you a bunch of examples of numbers and then their sorted versions how can we learn them is it possible with what complexity okay, so these are for sorting can work and then there are many other tasks that have the same flavor and they are here in this paper that I put here so I just wanted to talk very briefly about this but I don't know maybe I think I'm getting a little bit out of time so maybe what I will do is that if someone is interested in this problem because it's very crazy so if someone wants to know a bit more just come to see me offline and see what this is about so I think I will stop here and of course I'm going to get some questions thank you for certain tasks for example I'm thinking about computational advertising can you where might it pay off to use deep learning instead of more simple algorithm because previously we saw that for example with nearest neighbors you can get data for certain tasks so where will you try to explore deep learning in problems that solve with algorithm yeah that's a very good question so what I told you is that for me the successful so right now you should believe in convolutional neural networks okay so where can you apply convolutional neural networks you can apply them to data that lives in some grid space for example images have pixels and pixels are arranged in a grid okay so in a grid that's good if you have a time series like speech or you have let's say measurements from your sensor that is trying to do some seismic processing if you are like an oil company okay and you want to know which places to dig your well then I would tell you well you better have a look at these deep learning models because you might get something out of it if you are working with data that is completely unstructured for example advertising we had a very wonderful talk at the MLSS with Nicolas Leroux from Criteo who is working on the ad placement problem like he's from Criteo and there they have features that are completely unstructured right you just have the username you have the cookies that you have visited before you have some browsing history are you going to apply a deep learning model there I don't know it would be a bit more skeptical there and so I don't know if that answers the question a little bit any other task where you think that it doesn't that doesn't work in tests that are very noisy so you have a lot of time series problems where you have a lot of noise in the input okay so noisy inputs don't play very well with deep learning models today we have people working on it I think it's a very interesting question but typically there are many many time series in which it's actually very very hard to do better than linear regression okay and actually linear regression is all that's there right it's really just solving least squares so there are still many things where simple models actually work super well and so yeah previously you mentioned that there is a very a good necessity or to have a proper theory to justify the working like the internet they are just wonderful in their output and we have seen a lot of state of art coming up in the past few years so what my question is which direction you think would give us a proper justification or kind of we could come up with a proof to give them a proper validation kind of thing good question I didn't very hard question so one of the things that I would start by trying to do and actually as a matter of fact that's what I'm working on is try to reduce or nail down the functional spaces like the spaces in which you can characterize natural images or natural speech so you can actually construct a mathematical story so you say well I'm going to define space of function that have this property this property this property assuming that my point is in that space then a convolutional network is the model that has this property this property this property and any other model or a model that has this property this property this property must have this form and then you have a hypothesis my hypothesis is mathematically expressed as a set a functional space and then you can then as the question well what belongs into that set if you have proved that conflates are optimum for a functional space that has one point and this point you don't care that's a useless theory right so that's pretty easy to do but the whole point is sort of to realign the functional spaces that are small enough so that you can say interesting things but large enough such that they contain natural images and that's actually a very it's really a long long long term question because we really don't know how many functions we might need there well I mean math physicians have cared about this problem for example I don't know if you are familiar with like literature but you have spaces of bounded variation okay that was very popular in the 90s we thought that we could you know incorporate functions that are not just integral but also have some jumps okay because images have jumps from the HS it turns out that this space is far too large right because images have jumps but there are many other crazy things that have jumps and then look like images at all okay so this space is too large it's not adapted to understand really what's going on and so yeah thank you yeah so you mentioned that entering deep learning research is easy in the sense that there is a low barrier there are these open source software that you can start using and then you can immediately train your network and start having good results but in my experience entering bona fide deep learning research is much much harder than that so it seems like that the narrative itself of deep learning research is very much owned by the large labs so if you're from a certain lab in Canada or from a certain lab in New York or something like that then you may have a much easier time publishing your results than if you're at another lab this is a problem yeah I should put this maybe in the slide that would be very hard to that would be a very delicate slide to put but you're right I mean once you're in this between you have crossed the first barrier so you can start producing results but you are not in the second barrier where you are really saying things that are deep like profound between these things you have a lot of people it's very you have a lot of people and then it's true that among these people if you come from a good lab you're off and the question is do we have this bias only in deep learning or do we have it everywhere and that's the question right I mean it would be awesome if we couldn't have this thing and we have tools for that right we have double blind reviews we have actually double blind reviews is actually the best thing we have it's like you submit to a conference we don't know who the author is we don't know who the reviewer is that should try to mitigate this problem but it's true that when you go to conferences people well it would be naive to think that there's no bias like that and yeah I don't know yeah I think we have to leave it at that it's a good point Thank you for the talk you have one question on the issue of how many data is actually being needed by deep learning so one of the criticisms is you need to have large data sets to train your models do you think that's really something inherent to how many data are needed to fit many layers or is it a problem more of the algorithmic complexity that would be required to learn from much smaller data sets because I've heard both opinions I don't know I would say that it's not a question on deep versus shallow it's more actually a question I come from a department where we have all these discussions between parametric and non-parametric inputs non-parametric is a very important field in stats that says you have to have a model that grows with the size of your data in a sense it would be a bit naive not to do so why would you keep the size of parameters fixed when you have more and more data available to you when you're learning the question so of course if you are in a regime where you have a very small data set you cannot train a model where the number of parameters is overwhelming large with a number of data sets so what you need is a strategy to limit the capacity of your model and we have strategies we have rich regression we have dropout we have many things to regularize the learning problem so that we reduce the capacity to which model are you using are you using one layer or ten layers it's not it's true that it's a criticism that people do typically on deep learning but I think it's actually it's again people who are probably themselves they are not dealing with these models themselves because the fact is that there are ways where you can train neural networks with very small number of parameters with very small number of trainings typically they are not behaving the best there are other models that do better but it's not that we are using them wrongly I mean it's really something intrinsic to the learning problem they are not very nonparametric if that's what that was your question