 Είμαι πρόκειται να ευχαριστώ για αυτή την καλή επιτυχία. Είναι ένα προσταγιογεία στο σπίτι σε ένα πολύ προσταγιομένο σπίτι, εις αγήγευση να είμαι εδώ. Είμαι για να μιλήσω για τη δημιουσιασμή σε υπαρροχές προσταγές, δημιουσιασμίκες προσταγές της δημιουσιασμής, με τον αποφασίσμα της δημιουσιασμής σε δημιουσιασμή. Αυτό είναι η δυνειογεία με Νικος Κομοδάκης, χώρες είναι ο ανθρωγίκος του Πόμπα, ένα οικογε Sebastian Viego free ride. Ζεύσιο Μέντρας who is my assistant professor at the parallel collective of London, but also a student of mine, a PhD student. Τσαίω Ωανγκο who is the assistant professor of the University of Paris student of mine, xiánbó, research engineer at Amadeus and Anzhau Ferande, post-doctoral fellow at the Imperial College of London. I'm grateful to our sponsors. First I would like to say that I'm not an AI person. Είναι αυτοκρότητα πώς, within the past five years, we end up having so many experts in artificial intelligence and data science. This was not existing back in 2011 or 2012, and now everybody is claiming himself as an AI. So I'm a computer vision person doing applied mathematics, so my background is more applied mathematics and I do not consider myself to be an AI person yet. So I'm going to, here is the outline of my presentation, I'm going to start by explaining how usually we solve problems, mathematical problems through inverse modeling. I'm going to introduce the foundation of probabilistic graphical models and how these models can be used to address inference in a number of different domains. And then I'm going to talk about the most challenging part, which is how actually on these models when we want to introduce very complex interactions and be able to account for data that comes from different natural and models of different complexity can do inference, I'm going to talk about optimization, and then I'm going to show some examples on how these things or these methods can be used to solve efficient, very challenging problems in computer vision, which I think is one of the domains now, which is using the most methods for machine learning and AI. The way usually mathematicians approach a given problem is fairly standard, we are given some data and we are given a task, which is actually getting something out of this data. So the most common way of solving this problem is actually by parametrizing what we like to estimate, so we will introduce some kind of a model, which explains the parameters that we like to estimate. Once we have introduced this model, the next thing consists of actually associating the model with the data, which are the observations. And then our objective is fairly standard, it's actually find the instance of the model, which actually can explain the best observations we have to deal with, given the model we have chosen. Okay, so this is a fairly standard mathematical procedure, you have your problem, you define your parameter vector, you associate the parameter vector with some magic function F, and then you would like to minimize through some kind of traditional procedure. If we look on the current trains in the field of machine learning or data science, there are two different ways of actually looking on this problem. The first is what we call data-driven methods, and I think there's going to be a lecture from some points in this afternoon, data-driven methods, actually they do not assume a model. So the objective actually is not to interpret your data through some kind of model, your objective is to be able to reproduce the behavior of the data. So are given a set of data, which most likely are associated with some annotations, and your objective is given this kind of model and the annotations we are given from the expert, to define a kind of a black box strategy, which consists of applying a number of operators, such that once you have passed your observations through this kind of network, you are able to reproduce the behavior of the data. This is what we call data first, and it consists of actually putting together highly complex in terms of processing pipelines, which actually will not be able to understand to the extent of getting some reasoning out of the data, but will be able to reproduce the behavior. So once you have learned this kind of model, then what you should be able to do is actually to reproduce this behavior in your data. So this is the major trade, the major hype in computer vision, artificial intelligence and data science, deep convolutional neural networks, they work extremely well, I mean that's I think the reason of their huge success, and the idea is data-driven methods, data first, everything is on the data. The second approach, which is most traditional, is what we call generative probabilistic models, and in this case actually the focus is something fairly different. The idea is not only to be able to explain your data or to be able to reproduce the behavior of your data, but what you would like to do also is to be able to explain the data. So what you would like to do is to have a physical model, which given a set of observations, should be able to explain why the observations behave the way they behave. So this is a model-driven approach, where the objective is fairly different. It's not only explaining what you're observing, but having some kind of physical interpretation of your solution. This is something that was used a lot before, I mean this is model-based approaches. So it's not different, the aim with data-driven is fairly similar. The difference is that at the same time now you come out to come back with a plausible model, which explains your observation. So you have a kind of distribution, which can be any complex distribution. You have some kind of likelihood for the different solutions, you have some prior, and then you would like to maximize the posterior. This is what we call data-driven model-based approaches. If I want to be more concrete, I would like to give you an example. Most of you, actually all of us, the risk of heart attack is the first cause of disease in developed countries. So what we would like to do is we would like to estimate the volume of blood that actually your heart is sending out throughout every cardiac cycle. So this is going to be your left ventricle. What we would like to do to be able to actually measure or have the probability of a risk of heart attack is to be able to estimate this difference of volume. So this is a 3D scan of your heart. So my model will be a three-dimensional surface, which you can see here. The parameterization of the surface, the unknown parameters are all these points. With these points what you can do is I can build a mathematical model which is just a simple interpolation of these points. Then what I would like to do is my inference consists of actually taking this three-dimensional surface and mapping it to the observations or deforming the surface such that at the end is getting the left ventricle. Since inside the left ventricle we have blood and outside we have tissue, what we expect is discrepancies. So my cost function can be just a simple integral which measures the strength of your intensity gradient along the surface. So this is an example. So these kind of approaches means suffer or have to address what I call four courses for highly important challenges. The first is what we call dimensionality. So if you would like to be able to explain very complex behaviors, usually what we do is you have to increase the parameters of your model. The more parameters that your model have, the more degrees of freedom you introduce, the more flexible your model will be. Which is natural. It means more parameters, better expression power. However, whenever you increase the number of parameters, at the same time what you are doing is actually you are increasing the complexity of inference. That means that being able to estimate this model from your data is going to be more complicated than with a simple model with very few parameters. The second concern we have to address is what we call nonlinearity. Which means that the parameters that we have to estimate and the objective function we define, it's a nonlinear relationship. That means there is no direct way on measuring the impact or the fitness of a model to your observations. So when you have nonlinear functions from a devisional view point, this becomes fairly complex. Assuming that you have solved this kind of two concerns, the next concern you have to worry about is what we call nonconvexity. So in the most general setting you're going to have to deal with a huge number of variables to estimate. Which is not as huge as in neural nets, but still is very important. And your cost function is going to be most likely what we call highly nonconvex. Which means that you're going to have a lot of local minima. And then depending on how you optimize or how you initialize your solver, you may end up with what we call an unoptimal solution. Nonoptimal solution means that you're not able to solve the problem appropriately and you are getting just one of the local solutions. And the last but not least, which is very important, is what we call nonmodularity. So if you design a cost function specific to a given problem, most likely whatever progress you were able to make, this is not going to be applicable to another problem. And that's the reason why I think deep neural networks are very successful. Because once you have a pipeline, you have very simple optimization procedures, you just change the network and you keep applying it. So let's try to define the basics of what we call discrete probabilistic graphical models. Well, probability is a function which measures how well a given set of observations with respect to a given set of parameters does perform. What we're going to do is we're going to assume a very simple probabilistic model, which consists actually of stating that among all your parameters, the only interactions you observe are what we call pairwise. That means that you're given your unknown vector space X. What you are doing is you are saying the interactions between these huge unknown vector space actually is constrained to only pairwise interactions. That means that only two variables, at the same time, you assume to be correlated. You may have plenty of interactions, but the only thing you're assuming is that they are pairwise, that you do not have what we call higher order interactions. So one, you have this very simple probabilistic graphical model. What you can do is you can express it as a graph where actually nodes of the graph corresponds to what we call the parameters you have to estimate. The connectivity of the graph corresponds to actually all the constraints you have about correlations between variables. So this can be seen as a probability map where you have only mappings or constraints or correlations between pairs of variables. Now what you can do even farther, if you want to simplify things, is actually discretize that. So the idea is that you are not looking for a solution on the continuous space or with respect to the probability function that you have defined, but what you are looking is for a discrete solution. So if you have some prior on where the solution space is defined, what you are doing is you are discretizing. So you assume that you have just to choose one solution among the ones that you have considered once you have discretized the solution space. Once you have defined this very simple probabilistic graphical model involving only pairwise correlations, a very popular model in the computer vision and beyond is what we call Marko-Randofields, or conditional Randofields. Where the idea is that in order to actually maximize the probability, assuming some cost and normalization factor, the maximization of the posterior becomes equivalent of a minimization of a cost function, which actually involves two terms. There is one term which we call pairwise, and for every node and for every label what you will estimate is how well this node and this label can be explained from your observations from the data, and this is the unary potential, and then the second term is going to take all the pairs of variables which are connected, and what we'll try to do is to impose consistency with respect to the solution I'm choosing according to the observable correlations you have for the model. So you are mapping your probabilistic graphical model to a graph, you are discretizing your set space to a number of discrete variables, and you define a cost function which one term depends only on a node and a label and measures how well that does explain this label, and the second is imposing actually the correlations or the constraints you have between variables. So this is something that has been around for almost 30 years now, and has been used in a number of fields. So if you have this kind of very simple cost function, assuming that now this graph can be a huge graph and can have random connectivity, that means that you can have any as many connections as you like, the question is how you solve it. So when we see on this equation from a mathematical viewpoint, it looks like a very simple equation. You have two terms, you have a sum and the second sum. So what you can expect is actually that it's really easy to solve it. It turns out that it's not as easy as we think, and actually the complexity of the solution depends on how you define these pairwise interactions, how you define the connectivity of the graph, and how you measure whether or not two labels are consistent. If you have a linear function, which is usually never exist, it means if you have very simple interactions, then there are algorithms which can guarantee that you get the optimal solution. That is great. When you go on something more complex, which is metric functions, then you can get a good approximation of the global solution. But the most general case for random graphs and random problems, what you have is what we call arbitrary interactions or arbitrary pairwise terms. In that case, actually you have actually no guarantee of the quality of the solution, and this becomes really hard to optimize your problem. So what you would like to do is we would like to move as much as we can on the x-axis, and y is staying as low as possible on the y-axis. That means that we have to be able to solve graphical models that involve pairwise interactions as good as we can, while maintaining the ability of introducing very complex interactions which has to be pairwise. If we look on the literature, I mean, as I mentioned, this is a problem that actually has been around for 30 years in vision but more in networks. At the very beginning, people tried to solve these kinds of problems using what we call local iterative methods, and these are the solutions back in the 80s. They were extremely efficient in terms of computing time, because at every step what you are doing is you are doing a local update of your solution, but they didn't provide any guarantees on the quality of the solution. Then a major breakthrough came into the community at the late 90s through actually the introduction of two algorithms that were known in other fields, which is kind of funny. So we got the MaxFlow mean cut principle from the network community, which was known since 1965, and then we got another algorithm which is called the belief propagation from physics. So these are two algorithms that actually can solve very efficiently for computational viewpoint, highly complex graphical models, and at the same time they provide good guarantees in terms of the quality of the solution. That is the idea. So we can consider that if your models are pairwise, usually you can do very good job. However, I mean, in reality, life is more complex, and the one we would like to solve very, to do what we call artificial intelligence now, is you would like at the same time to be able to introduce an arbitrary set of interactions. So given a random graph, what you would like to do, is you would like to be able to have what we call hyperclicks, which means that actually extending the strength of probabilistic graphical models to any imaginable interaction you can have between nodes. Okay, so you can imagine that every hyperclick can be a random collection of nodes, and then this random collection of nodes, when we estimate whether or not the correlation is satisfied, can be a random function. Okay, so this is a hypergraph. It's exactly the same idea like before. So this is going to be the parameters that you have to estimate, the magic x vector. We're going to have again a discretization of the space with a label set L. But now, instead of having just the simple interactions, what you are doing is you are introducing any subset of variables interacting together. Okay, and at the same time, we have a similar cost function, where actually the first term measures how well a node can be explained from a label. And the second term actually is for any subset of variables, which are considered to be a hyperclick, it's going to look whether or not the correlation is satisfied. Okay, so it's exactly the same thing like before, but now we're extending the notion of connectivity to hyperclicks and to arbitrary interactions, that is the idea. Well, from what I explained just before, the complexity of a graphical model is depending only on the pairwise term, and we said that when you have no linear functions, then it's complicated. Now what you are doing is you are adding higher order terms and arbitrary functions. So it's actually becoming much, much harder optimization problem, that is the idea. Once you have defined this problem, there are two ways of solving it, okay, in the community. The first way is actually to do with what we call reduction methods, which consist of taking this higher order click and mapping into a number of pairwise constraints. So what you are doing is actually you're going to introduce artificial nodes to your graph, and then you're going to express this higher order constraint as a sum of pairwise constraints where you have introduced consistency variables, which actually is going to impose that a node that appears in more than one pairwise click is getting the same label. So this is what we call reduction methods, which take your graph and map it to a pairwise graph, okay. This can be used for a number of hyperclicks, but it cannot be used for any hyperclick. But if you can use it, if you can have this reduction method, then you can use any existing pairwise algorithm to optimize it. So the strategy is simple. You are mapping your problem to a pairwise, and then you take the pairwise methods that we were explaining before, and you try to solve it, okay. However, this is going to expose the number of nodes on your graph because you have to introduce artificial nodes to satisfy the constraints, and this is not applicable to every problem depending on the way you define hyperclick. So now I'm going to introduce what we call dual decomposition and how these methods can be used to solve any optimization problem in an arbitrary graph with arbitrary hyperclicks. Well, the idea of decomposition is not that new. It's a very successful technique in the field of optimization, and the underlying principle is fairly simple. Instead of solving a highly complex problem, which most likely you are not able to solve, how about decomposing this problem into a number of subproblems, much smaller problems with a smaller number of variables, try to solve every subproblem in an independent way, and then come up with a strategy which actually will guarantee that all the solutions converge to the same unique optimal solution. So you take your original problem which we don't know how to solve, you map it to an arbitrary number of subproblems, you can solve every any subproblem the way you like, and then you come up to try to introduce some kind of consistency on the decisions you have taken for the subproblems, such that the global solution is consistent. So that is the idea. If I want to be more explicit before going to math, I'm going to use something fairly didactic, which is this is my problem, this is my master, and this is going to be the mathematical model of how I should be able to solve the subproblems and update the solutions to get convergence, and then every subproblem is going to be solved actually independently with whatever optimizer you like. So the idea of decomposition consists actually of defining a rigorous mathematical framework, which will tell you how you should combine cleverly the solutions you are getting from the different subproblems, such that at the end you are getting the optimal solution. So if this process does converge, then you have a guarantee that you are getting the global optimal of your energy. That is the idea. Let's try to introduce some very basic mathematics. So assuming that you are given an instance of a function, a phi here, and assuming that actually for every function you know how to solve it. You know how you can find the optimal solution with respect to the minimal value. So for every function here, you have a solution which tells you that actually I can find which is x, which will give me the minimum for this function. The fact that you know how you can solve individually every function doesn't tell you much about the sum. Because most likely the x-vector that minimizes the sum has nothing to do with the individual axis which minimizes every instance of this phi function. So here's what we'll do. The first thing is actually introduce replicas of the unknown vector x. So what we'll do is we're going to use multiple copies x i where i corresponds to the phi function with respect to the original variable. So we take the unknown vector and we multiply or actually we replicate the unknown vector as many times as the number of subproblems. In here the subproblems are the phi's. Once you have done that, then actually your optimization problem does not consist anymore of finding the optimal x. What you would like to find is the optimal x and the optimal x i because we have augmented the unknown variable space. Such that the sum of phi x i is minimum. But if I do that, there is a trivial solution which is for every phi I am taking the x that gives you the minimal value. So what I would like to do on top of that is to impose the constraint that once I have solved this problem every x i should be equal with x. Okay? So if you are able to minimize with respect to x i and x this function such that at the end every x i is equal to x it's equivalent of solving this problem. So solving this unconstrained problem is equivalent of solving this constraint problem where we actually have multiplied the number of unknown variables by as many times as the number of slaves or subproblems. That is the idea. How we can solve this optimization problem where we will do something that is also very well known in the area of mathematics. We are going to use a Lagrangian implementation. So which consists of actually taking my objective function which I had before, which was here and actually augmenting it by actually penalizing the distance between the individual solutions of every subproblem from the optimal solution of the entire problem. Okay? So what you would like to do is you would like to minimize your cost function as before and since a way of actually imposing a constraint is through Lagrange multipliers which will penalize distances or deviations of the optimal solution from the individual solutions. Okay? And this is lambda i depends on every subproblem. So for every subproblem depending on how far you are you have to be able to define a Lagrangian constraint which we will try to impose that at the end this local solution will converge to the same global solution. That is the idea. So if I do some very, very simple mathematics which consists of multiplying xi with this thing I'm going to get two terms. There is one term which actually depends only on fi, xi and lambda i. This is a local term. It means there is nothing that actually requires knowing what is happening to the remaining slaves, to the remaining subproblem. And there is a global term which needs to know all the lambdas for all slaves times x. So since x is not part of my Lagrange constraint functional it means this term should be equal to zero because otherwise there is a trivial solution which is x going to infinity. So what I'm getting is that minimizing the Lagrangian consists of actually minimizing a function which depends only on the local variables. That means that for every subproblem if I know the lambda that corresponds to the constraint I can work independently. Okay? Why this is huge? Why this is great is because actually this gives me a way of updating from monetization to the next the subproblem. Such that the convergence is imposed as a function of time. So what we will do is the following. So that's what I was saying before. It's now the couple. Okay? So the way we will do that then, this is our cost function. Okay? And for every i problem what we have is a truly decomposed function. Okay? This is a function that only depends on the current slave and the objective of the master is to minimize this constraint for all lambdas. That means that the master, that means the global problem should be able to find the right lambda i for all problems. So every subproblem needs to have as input only its own constraint. And the job of the master is actually to be able to find the best constraints such that overall the solution converts to the same global optimum. So you can easily minimize that by applying a subgradient method which will give us actually a way of updating the constraint of every subproblem according to the gradient of this g of lambdas. Okay? So I will skip the details but what is the principle? The principle is at the very beginning you have no constraints. Every problem is going to be solved in an independent way. So you're going to produce n solutions for every subproblem. Then what is going to happen is these solutions will go back to the master and the master will actually measure discrepancy between them and what you are doing is actually you are the lambda will depend on how far every solution is from what we call the mean solution or the median solution. So at every iteration this Lagrangian subgradient method what is doing is changing dynamically the subproblem such that the distance between the current solution and the mean solution among all subproblems is minimized. So every step what you are doing is you are pushing more and more the solution of every subproblem to be the same as the mean solution of all subproblems which is a global solution. That is the idea. So you decide to any decomposition problem any principle any graph structure. So this is the illustration which I was explaining before At the very beginning you have a lambda set to zero no constraints Then what you are doing is you are solving the individual minimizers with any solver you like it doesn't have to be a discrete method it doesn't have to be a continuous method Once you have solved the individual problems what you are doing is you are sending back to the master the solutions which is the X here X bar of i and then the master will actually measure the discrepancy between the mean solution and it will update every lambda. If the current solution of a subproblem is far away from the mean solution lambda will be high which means that you are going to force the constraint If it's closed then lambda will be set to zero and you keep iterating until it's convergent and here you can find the papers So this is a very simple problem Let's now go to what we had before which was a probabilistic graphical model So there are no subproblems here What we can do is actually we are going to re-express the problem as what we call linear integer programming problem which consists of introducing artificial variables which are here, the linear programming variables So XPI is actually a zero odd variable and XPI is also a zero odd variable which means that we assign to a node P the label corresponding to the XP This formulation is equivalent to the previous one but we have to introduce some kind of constraints So you don't have to understand that because it's kind of a complex thing but the idea is that you can take any arbitrary graphical model and you can express it as a linear integer programming problem by introducing artificial linear variables and then you have to put some constraints such that the solution of the linear integer programming problem is the same like the one before Once you have done that you have your linear integer programming problem and you can apply the principle I explained before That is the idea Okay Any questions? Perfect So I can skip that Now let's try now to see what is the interest of this kind of methods Assuming that you are given an optimization problem of this type which actually in symbol words, rend means good black means bad It means that I know how to solve very well the red part of the problem but I have no idea on how you can solve the black one So if you try to optimize the whole problem at once most likely you are not going to be able because of these black parts which actually increase dramatically or the difficulty of the problem So what we will do is the following We are going to take the two parts which are complex and actually make the inference quite problematic and we will put them separately So this is going to be the two slaves corresponding to the complex part and at the same time we are going to take the part that we know how to solve it and we are going to put just another slave Okay So since we know how to solve that that's not a problem, I can apply any method since it's a modular MRF I can apply any method you would like any method you can imagine They are complex, what you can do is you can use very complex inference methods but now this is possible because the size of the sub-problem is significantly smaller compared to the original one and you can do the same thing here Okay, so the idea is that you take the challenging parts of your optimization you create them as what we call sub-problems or sub-clicks and then you try to solve these problems with very complex methods which now is possible because of the dimension of your sub-problem but also as a regular sub-problem is where you know how to solve them Okay, that is the idea and then once you have done that then you are able actually to get a solution which most likely is going to be much better compared to the one if you try to solve the original one with any optimizer you can imagine That is the idea Okay and then you can actually think of ways of decomposing your problem if you want to increase convergence rate or if you want to be more powerful the composition is happening it's going to affect the convergence rate or actually the tightness of what we call your relaxation That's the idea Okay Okay, let's now go and see some examples The first thing I'm going to talk about is what we call image completion This is a computer vision problem that is considered to be solved What you would like to do is give an image you would like to introduce some kind of mask and this is actually a person that you would like to remove from your image because it can be your ex-boyfriend or ex-girlfriend or whatever or your ex-father-in-law So you define a mask and then with using this mask you would like to be able to reintroduce content that actually will make the image look natural Okay The same problem appears on what we call image completion or textual generation Given a small sample of data what you would like to do is you would like to create a very complex panel of the same sample being repeated and being put together consistently and people are doing that now also for videos and for many other applications So what is the idea of image completion The idea of image completion is to go on the existing content and try to figure out if you can find some information which you can map consistently to what you have at the borders between what you are supposed to fill in and what you are supposed to keep in So the idea is to look for similar content in your image domain How are we going to map it as a graph Well, what we will do is assuming that this is your area that you would like to fill in We are going to try to look all over the image domain for patches which you are going to actually position in this area where we have no content The way we will position these patches is fairly elementary When you are positioning a patch in an area where you have existing content what you would like to do is to have a measurement which will tell you that whatever you are overlapping it makes sense That means that the content that you are running with respect to the content that you have already is consistent and whenever you are running information to areas where you have no existing content like here, I mean you have no idea what you would like to do is you would like that the two patches that you are superimposing they are consistent The nodes correspond to the pixels where you have to fill in information and the labels correspond to these patches here The problems that you need to have millions of patches to be able to fill in information So it is really a high dimensional problem and you can solve it very efficiently with what I explained before So if you do that and these are some examples and some publications Assuming that you don't like elephants then you can get rid of them and this is the terium You can get rid of him and you can keep doing this thing for palm trees or even for father-in-law again and for bridges if you like bridges So it is a fairly standard procedure now you can even find it as a plugin in Adobe or even I think Powerpoint But that is a good illustration example A second example which actually his higher order connectivity is very important to understand brain tumor behavior So glioblastomas type 2 tumors are the most aggressive brain tumors Usually the life expectancy if you are diagnosed with a glioblastoma type 2 is 12 to 24 months and the problem with this by the glioblastoma tumors is that actually they have a rather linear behavior that means the tumor is not progressing at all and at some point what is going to happen is it is going to take an exponential growth and within some weeks you are going to end up having a huge tumor where you are not able actually to do much that's the idea So what we try to do is actually try to understand how these glioblastoma tumors are actually correlated and how they are positioned in the brain So we took a lot of training examples where we had the tumors and then we try to see if there is correlation between tumor characteristics which means texture tumor volume which means size tumor aggressivity which means curvature and then we define this high dimensional graph where every node corresponds to how similar two tumors were between different individuals The problem is that if you take this high dimensional volume with all these connectivities there is not much you can do there is not much you can infer So what we did is we defined a higher order graphical model which the objective was to take all these connected nodes and come up with a very simple which is a highly connected graph which explains how glioblastomas appear in the brain and how they actually evolve depending on tumor characteristics like texture, volume flexibility which is curvature usually So it was a higher order optimization problem which we were able to solve and we end up with something that is consistent with anatomy So first of all we observe symmetry so there is no preference on left or right hemisphere of the brain We observe also that most of the tumors usually are in the initial lobes and we know that from anatomy and we also observe that keep adding data actually didn't affect this network So this is the first kind of functional map which explains how glioblastomas appear in the brain and can be very helpful because if you want to detect this kind of tumors what you can do then is you are going to put a strong power using this kind of network which will help you even if they are not aggressive even if you don't see them to detect them early and then keep following them So if you have already identified the tumor then actually following up is much easier because then you can do regular exams if it becomes part of one of these areas and then detect when actually this exponential growth is starting Another problem which we work again with similar ideas is what we call non-drute detection cancer So this is also a lung lung cancer Again expect a life 80% of the people usually they don't live that much because we detect them in a very advanced stage these are tiny tiny tiny structures that we don't see so radiologists will not be able to see them and these are the results we are getting What actually what we do is we combine what we call multiple anatomies and transfer learning So the idea is that you are going to take a lot of passions and then you are going to try to do actually map all these passions to come in a different space and once you have done that doing learning is much better and much easier So once you have done that then you are able to get with partially labeled data performance that actually is better than humans It means that radiologists they don't do as good as automated softwares because in a very simple way this is a 512, 512 400 volume which means that the radiologist has to check every slice of this volume and even if he is very efficient most likely at some point he is going to be tired and he is going to lose information Okay Let's move to another example which we have worked which is again the idea of higher order graphs and this is the idea was actually to be able to guide a surgeon during surgery So usually for a surgery what you will do is you are going to have a pre-surgery data and this is highly annotated and high quality data like MR and CT scans and during surgery what you are doing is very often is that you actually once you have opened the depression then another change you are going to do ultrasound imaging which will allow you actually to position the data that you are actually acquiring with respect to what you have used for the planning of the surgery So we put in place an algorithm which was able to in real time to find within the pre annotated and high resolution data what is the corresponding slice which we are going to help you actually to guide your surgery I will skip the details because we don't have time and I will go to something more visual Okay Of course the results like any other paper were great I am going to talk about higher order graph matching and this is a problem that is used a lot on what we call image and multimedia retrieval and this is also a lot in cinematography so the idea is that you are given two shapes again through what we call non isometric deformations and what you would like to do is you would like to be able in an automatic way find the correspondences which will associate every node of your first shape to the second shape for non isometric deformations Okay So the problem can be cast as a very interesting higher order graph optimization problem If I take my two shapes I can map it to a single plane just for facilitating the understanding of the method So what we are looking is for connections between a given node of the first graph and the given node of the second graph which will mean that actually these two nodes do correspond, that is the idea So how we can cast it as a higher order optimization? Well the first thing we will introduce Boolean variables between every node of the first graph and the second graph zero means no correspondence one means correspondence Given these Boolean variables what you can do is you can take geometry information from here and measure whether or not it is similar This is going to give us an idea on whether the two terms correspond The second thing we can do is actually you can take pairs of nodes and then given these pairs of nodes you can measure a F-Clydian distance or any other distance you can imagine What you can introduce as constraint is that the distance of the first graph should be similar with the distance of the second graph once you have estimated the right correspondence and you can keep doing that and go to more complex interactions So the idea is that you put more and more constraints where actually we will look on subsets of nodes and we will try to impose the same characteristics on the first graph and the second graph And if you do that you can get these results in a fully automatic way and then you can go even further and actually get these kind of results which are highly in-depth mapping So the last problem I would like to talk which puts together machine learning and reinforcement learning is work we did on parsing images with grammars So I guess all of you have been in Paris at some point So this is a typical houseman building which is about 65% of Paris buildings They are amazing very beautiful buildings So it turns out that architects they use a notion of grammar which means that the way they design the buildings is always consistent So there is a generic way of describing these kind of buildings which consist of actually and this is something we learn from architectural books is actually the following At the very beginning you have to split your image to shops and the rest What you do is you split the rest to sky and floors and then you keep adding floors and then you keep adding windows So the grammar is a very simple thing which consists at the very beginning splitting in three parts then introducing the notion of floors introducing the notion of windows and introducing the notion of balconies Once you have done that you can generate any building in Paris So if you do that given a footprint what you are doing is actually you are able to get with a very simple grammar starting from a footprint that's the only thing we have this is very complex building and this is something that is used a lot in computer synthesis So people are using what we call procedural grammars to generate buildings and if you take the same thing and you apply it to a small district you can get a typical Parisian district So the only thing you need is actually the footprint and some very simple number of rules So what you would like to do this is from synthesis viewpoint what you would like to do is the opposite which consists of actually giving an image which is shown here trying to find out which grammar is able to explain this image So given just a photo that is taken from any user you would like to figure out which derivation of your generic grammar is going to be able to produce this kind of building This is a very complex problem for three reasons First of all the number of unknown variables is not fixed It depends on how many derivations you apply to your grammar This will define the number of variables That's the first constraint The second constraint is a dynamic problem That means that depending on the choices you have made in the previous interactions actually this will affect massively their final outcome And the last thing you have at the same time discrete variables which is a derivation sequence and continuous variables which is things like the window or the size of the balcony So what you have put in place was an inference method which actually was trying to find the right grammar such that when we use this grammar and we generate the artificial building it looks as similar as the one we have in the observation and this is what is shown here So there is a log likelihood cost which is shown here which tries to optimize the rules of the grammar and the derivation sequence such that once I have applied the final outcome I end up getting a generation which is as close as possible to what I am observing So this is a typical AI problem which was used for the goal problem between DeepMind and the famous Korean player So it's what we call a reinforcement learning You try to optimize the final outcome by actually taking local decisions Here are some results So this is a typical Parisian building This is fully automatic I mean grammar has been defined inference is optimized Of course it's robust to illumination changes to missing paths and trees Here you are able to see that actually even when you have things that are not visible to humans because of the fact that you are imposing this grammar thing you are getting a global consistency Of course if you don't like Paris you can go to other cities So this is New York The grammar itself is something advanced It's very simple Just split and put windows So it works pretty well It's more complicated especially when you are looking on article buildings and this is Paris So let's now try to wrap it up So I have worked a lot on what we call the script graphical models for computer vision The idea is to take a probability function and model, map it to a graph At the very beginning what we did is we studied very simple graphs We assumed that the parameters are only correlated in pairs and this is shown here If you do that then you know how to solve it This is very popular What is becoming more and more useful and more and more interesting is what we call higher order graphs because they are actually both the optimization problem and the inference problem become really challenging And the idea is actually to again to have a probability function which now is decomposed on sub clicks Such that every sub click can impose some kind of consistency and we demonstrate how this can be optimized and used to solve different problems in computer vision and medical imaging Just to conclude to sum it up Graphs are really great tools because actually you can learn them I mean depending on the problem you have you might be able to find what is the right graph structure and decrease the size of the graph structure such that you are able to express your problem So you don't have to worry not much about dimensionality About nonlinearity Graph based optimization methods they do not use any derivative function These are discrete methods in general So even if you have nonlinear functions there shouldn't be a problem About nonconvexity what we know is that discrete methods they do much better than continuum methods in general and they are more independent on the initial conditions So you should be able to guarantee better optimality properties with respect to nonconvexity Non-modularity is not an issue anymore because once you have defined the optimizer then this optimizer can be used for another graph for an arbitrary subset What is the future? Well the future belongs to more data This is not something novel but we are going to need more algorithms So people are convinced that if you get more and more data then you should be able to solve all your problems That's not true We have to be able to develop efficient algorithms There are plenty of applications where black box solutions will be really hard to get acceptance So for example if you are using a deep learning algorithm which works reasonably well but you don't know when it fails or you are not able to explain why it fails in let's say medicine this is going to be very complicated In physics it's also going to be very complicated If you have a nuclear reactor you should be able to actually explain it's better to know when it fails and why it fails and having 80% performance and actually having 90% performance but have no clue why this is not working So at some point I think there's going to be a number of domains we are going to need to have what we call human interpretable models That means that you have to be able to combine the power of prediction with the ability actually to region interpret on these models which means that you are going to need to have better optimization and better learning methods and since we are going to have more and more data in order to be able to optimize and solve these problems you have to go on parallel architectures and in the long run 10 years from now I think given some data given an arbitrary graph model we are not going to need scientists like myself anymore just going to put your data to black box with a full connected graph and the model will be able to learn which is the right graph for your problem or what is the right structure and what is the right parameter So we still have a job for 10 years let's say take home message Thank you for your attention So are there questions In the model of your graphical model how we get the parameter of theta How we define the cost function I mean it depends on the application you are solving so we try to figure out the best way to explain the solution space from your data So it's not generic it's application dependent but I think at some point given annotations that's what is the last point of my conclusions given a data and given a fully connected graph you should be able to get automatically what is the right way of connecting the data with your labels and the nodes between them but now we do it by actually taking into account specific context of the application and using the data we have as observations I remember in the neural networks we have this supposed to be restricted both of our machines that basically so what makes it possible to train them not everything is connected to everything else but you are saying that you can do away with this restriction that you can have everything connected to everything So the way the reason why neural networks are very popular is because they work really well and at the same time it's exactly what you said you are applying a very simple optimization algorithm which is a back propagation mainly which at every step what you have to do is you just have to know what happens in the previous iteration so it works really well but the other optimization means there is nothing exciting or nothing advanced on the way we optimize them and because it works really well it seems that we don't need these long term interactions so the power of deep neural networks I mean the main philosophy of deep neural networks is keep putting more and more layers instead of actually trying to look for long term interactions what they say ok just keep adding layers and by adding layers you don't need to have these long term interactions so these kind of methods are methods that are designed to deal with what we call hyperclicks the idea is that you want to put together really long term interactions of course the longer the interaction is the more complex becomes the inference but it's not the same thing deep neural networks the philosophy is keep putting layers and apply simple optimizations here the idea is to put the strength on the model itself try to understand which variables are supposed to be correlated and then try to learn these correlations which in the deep neural networks is actually just step by step try to figure out what is the best outcome So if what compares these methods for images for example with neural network for images it depends on what you would like to do I mean see if your objective is to do classification I mean which is decide for a given pixel what is the best possible level with respect to some training set you are going to be able to beat neural networks if your objective for example is for two exams to find the best correspondence because there you need to introduce anatomy you need to be able to guarantee that there is continuity you have anatomical constraints you have functional constraints then these methods is something that you have to definitely use So when you have problems where you need structure and you need human interpretable solutions that's where actually these methods when methods should be used for dependent decisions classification decisions it appears that deep neural networks at least for now is the best possible option So I have a question I mean the work is quite tedious because for every problem you need to define a high graphical model So you need to do this every time and depending on the size I mean how do you do it because it takes time So that's a very good question The data-driven model So I mean I will give you an example and then I will give you a more generic answer For the houseman buildings what we did is we look on books and then architecture books and then we define the grammar manually I mean that there was someone a student that spent six months reading books and then he said ok that's how houseman designed his buildings and this was our first paper we were very happy we went to Pami, fantastic job and then the second paper said how about actually try to learn these rules from data So what you can do at the very beginning is assume very simple grammars means you assume only binary grammars which means every time you just apply a very simple split operator if you do that and you have ground truth you end up getting a huge tree but it does the job the problem is that your grammar is not going to be efficient because for every decision you are making two nodes So the idea was once we had that was the following how about now try to take all these trees from all the buildings we have and try to look for consistent sub trees So we did a very simple grammar we end up with huge number of trees and then we did back training which consist of taking these very simple trees and try to map it to hyper trees So this was automatic and we got the second paper If I go now to a more concrete ensemble I think in the long run if we have data what you can imagine are graphs that are fully connected with an arbitrary number of hyper clicks and then what you are doing is given that you have a ground truth what you will do is actually you will try to automatically determine among these highly connected graphs which is the best subset of nodes which can actually explain best your observation So this is what we are doing now is for any random problem for any random graph you will find a huge number of some problems since you are going to decompose it it is not an issue and then try from your data decide which you should keep with what weight the advantage of this method is that you are able to put context the way you define the hyper clicks is a context maybe you are going to have millions of hyper clicks but the way you design your hyper clicks is a context So it is going to be through training So just on a personal point of view do you think your method about how to determine the building stuff like that you could also push it to the texture of the building because this has a high application for us in terms of propagation to understand the buildings if they are concrete, steel or whatever Yes actually I didn't have time to explain the method because I think I had too much content in the presentation but at the final derivation level for the grammar you have the texture itself So there is a layer or actually you are adding texture and then you can't imagine instead of having a single texture we know that housema buildings have always the same texture So what you can imagine is to introduce another layer which actually instead of having a single texture option you have multiple texture options and then you are going to be able to fit it directly from your observations So you can construct a total 3D math giving you the whole texture With a single image It's a frontal view You need to have the grammar so it will not work for an arbitrary building It will work very well for New York because there is not much imagination It will work very well for Paris for 60% Then you are going to need another grammar if you have buildings like The idea is that you are putting the intelligence of your method on your model What kind of picture do you need Google maps, stuff like this With Google maps you can reconstruct everything It's even normally with a good photo taken of your cell phone of the facade of the building you should be able to get the 3D model and the derivation Ok, are there questions? Ok, if it's done then we will thank you