 So I have to make a terrible confession. I'm not a mathematician. I'm more like an engineer. But I'll talk about a little bit of work, some of which is mine, but most of it is not mine. It's not my work. It's work by people who work at Facebook at research and perhaps other locations as well. So since this is about graphs, I thought I would talk about graph embeddings and how this is used as kind of a way to do content understanding. At Facebook and other places, there are many different projects on this in various locations. And at the end I'll talk about a somewhat different topic that doesn't have much to do with the previous one, although it has some connections called self-supervised learning, which is what I've focused my research on mostly over the last several years. Okay, so let me start with one of the things that's happened over the last few years is what people now call deep learning, which is this idea that you can train essentially a network of very simple computing elements that are parameterized and you can basically train it with surprising efficacy to approximate any function you want from a whole bunch of samples. And that's called supervised learning. So it's very efficient for things like classification but also for other types of applications. So basically, you have a parameterized function. Again, you know, that reflects the fact that I'm an engineer not a mathematician. I put knobs instead of parameters in a formula here, but it's the same thing. So this is a parameterized function. You feed it, let's say, an image and you want to train this function to produce, you know, like turn on the red light when it's seeing a car, turn on the green light when it's seeing an image of an airplane. And of course, you know, an image is just a bunch of pixels, so you can combine the values of those pixels in some ways to figure out what the function is that maps, that classifies cars from airplanes. So the idea of supervised learning is you feed it lots of examples of images of cars, lots of examples of airplanes. Every time you show an example of a car, if it doesn't produce the right output, you correct the output by adjusting the parameters using a gradient descent type algorithm. Very simple. You do this repetitively with thousands and thousands, if not millions, if not billions of examples. And the magic of this is that if you build this machine, this function, if you parameterize this function in a complex enough way, it's going to be able to not only classify the images that you showed it, but also images of cars and airplanes it's never seen before. That's what's called the generalization property, which is, you know, something like interpolation or extrapolation if you want. And this works surprisingly well if you build those things appropriately and if you have big enough computers to recognize speech, which essentially amounts to mapping speech signals to words or to word sequences, images to categories like cars and airplanes, portraits to names for face recognition, photos to captions, you can generate text. This is not just a discrete variable, but it can be something complicated or classify the topic of a text or something like this. And what deep learning is all about is there's nothing deep about it really, other than the fact that the function that maps inputs to outputs is parameterized, is basically represented by a composition of parameterized, individually parameterized functions, and each of which is some non-linearity to it. And by cascading multiple non-linear functions, you can show that there are certain types of functions or you can intuitively show that there are certain kinds of functions that can be efficiently represented that way. I'll come back to why you need multiple layers. From a mathematical point of view, you can actually approximate any function you want with only two layers, basically one layer of linear operations, one layer of non-linear point-wise operations, and another layer of linear operations. So from the purely mathematical point of view, it is not a useless, but from a practical point of view, it is very useful because of efficiency of the representation. I'll come back to this in a moment. So the next question you might ask is, why do you put in those boxes? And there is this kind of surprisingly simple way of representing a rich family of parameterized functions which consist in essentially representing any signal you have through a vector or some sort of multidimensional array of some kind, matrix, engineers use the word tensors, but it's not the appropriately named tensor. Anyway, multidimensional array. Then you apply a linear operator, a matrix, if the input is a vector. You get another vector, perhaps of different size, usually larger, at least at the first layer. And then you take each component of this vector and you pass it through a nonlinear point-wise function. So each component is just passed through a nonlinear function. The C plus one is something like this. It's not differentiable, but it's okay for now. So this is sort of a very elementary. You can think of those as performing an elementary classification if you want. So if the weighted sum corresponding to the dot product of one word, this matrix by the vector is larger than zero, then the output here is positive, basically the identity function. If it's smaller than zero, then it's zero. And that's kind of an elementary switch that determines whether something belongs to a category or not if you want. And then you stack another layer of linear, nonlinear, linear, nonlinear. You can stack many layers of those. The way you train this is basically just a high-dimensional gradient-based minimization. So it's a very simple process. You show an image. You run through the function. You compare the output you get with the output you want through some measure of distance. Say, Ukrainian distance if you want. And then you just simply compute the gradient of that objective function with respect to all the parameters in the system, which are all the coefficients in all the matrices in this network. And you take a step in a negative gradient. You do this on the basis of one single sample. Then you show another sample. Again, compute the gradient of the function for that sample. Take a step in a negative gradient. Very simple gradient descent. That's called stochastic gradient descent because you don't actually get an accurate estimate of the gradient of the function you want to optimize, which is the average of this error over thousands of images. You just make the estimate on the basis of a single image and you make an update. So it's a very noisy estimate of the gradient. But experimentally, this converges much faster than actually computing the gradients and not making an update. For reasons I'm not going to get into. So the way you compute the gradients is nothing more than chain rule. That's the most sophisticated mathematical concept that really is used in deep learning, is chain rule. And it's just a little complicated to wrap your head around it when you want to implement this in sort of very complex networks of functional modules, all of which are parametrized. The one I showed is just a very simplified version of it. Turns out there is, you know, in terms of computer implementations, very easy ways of representing those networks of parametrized modules that give you the gradient of any function with respect to all the parameters of the system automatically without having to write a single line of code. So you just write the function that computes the output of your cost function in Python or whatever your favorite language is. And there is some, you know, Mechanism or Denise that will automatically give you the function that computes the gradient of that function with respect to anything you want. So back, you know, the idea of training neural nets like this goes back many years to the mid-80s, essentially. And back in the late 80s, people like me and others figured out that it's not like you can take a whole image which may be, you know, at the time tens of thousands of pixels, now it's more millions of pixels and multiply all the pixels as if it were a vector by a gigantic matrix. It's just too expensive. And so you have to make some assumptions about what kind of linear operator you're going to use. You know, make it a sparse matrix of a particular kind. That's the idea behind convolutional networks which are universally used today for anything from image recognition to text understanding, translation, and all kinds of other things that I might mention later. And the idea of convolutional networks is to basically make the operation performed by those linear operators be convolutions. For example, let's imagine you have an image here. You take a patch of coefficients. You multiply the dot product of, I mean, you compute the dot product of those coefficients by a window of the same size. That gives you a kind of a weighted sum if you want. You pass that through a nonlinearity and you put the result in a corresponding location and then you kind of swipe this over the entire image which essentially amounts to doing a convolution, discrete convolution in this case and record the results in the output. We have the linear operator now so it's got only a few parameters. And the convolutional net is one in which you start with the image. You apply multiple convolution kernels here, each of which extracts, you can think of it as extracting local features of different types. And then there is another operation called a pooling operation which basically aggregates the response of the filters within an area and reduces the resolution. It's very similar to the averaging in wavelet transforms where the cross-graining and some physics models of normalization group theory and stuff like that basically just reduces the resolution and the result of it I'll show you in a minute. And then you repeat the process again so this is the result of applying convolutions to each of those maps, each of those feature maps with different kernels, summing up the result passing the result through a nonlinearity and then doing a pooling and sub-sampling resolution again and you sort of keep going this way. So this is an example of one of those convolutional nets in action where that's the input, that's the first layer, that's after pooling, third layer which is another set of convolutions, pooling again, convolutions again. I don't show the output here but you can use this to recognize characters. This was widely deployed in the mid-90s for character recognition. It was one of the big successes of neural nets back at the time. But there was a dark period of about 10 years where people in the machine learning community kind of stopped getting interested in those things for reasons that I don't fully understand. So back in the early 2000s we realized we could use this for all kinds of stuff not just character recognition but also things like driving robots around and by using the systems to basically classify if a piece of an image is something that a robot can drive over or not just from lots of examples and systems kind of pass through the entire image produces its idea of whether a piece of land here is traversable or whether it's an obstacle then it puts the result in some map and then it can do planning in this map to go to a particular goal and it works very well. So the nice thing about this one is that you don't need to manually label all the pieces of the image because there is a second vision system based on stereo vision that can figure out whether something is traversable or not but only for objects that are nearby so it uses that to train itself and then you can get the robot to drive regardless of whether there are annoying grad students trying to stop it they entitled to do it because there were the code and they trained it. You know you can also use those convolutional nets to kind of label all the pixels in an image by looking at a context around them. You basically slide a window of this convolutional net over the image and you can essentially train it to classify every pixel as whether it's the road or building or sky or tree or a person or car or something like that and we had a system like this working around 2010 also implemented on a special piece of hardware that could run it at about 20 frames per second this was before things became popular that caught the attention of a number of companies that started using similar techniques for self-driving cars so one company called Mobileye that was eventually got bought by Intel an Israeli company also in Vidya and they used this in self-driving cars so if you are looking enough to have a car that can drive itself on the highway chances are it's using a convolutional net for doing this at least the recent ones so what happened in 2012 the end of 2012 early 2013 is that our friends from University of Toronto ended up implementing one of those convolutional nets on very very very efficiently on GPUs which are those cards that are designed for graphic rendering and it turns out they are very good for numerical computation and they ended up implementing a convolutional net that's much bigger than anything that we had done before and participated in a competition called ImageNet for which there was 1.3 million training samples of images that belong to something like 1000 categories different categories and what happened there is that the state of the art on this dataset was something like 26% error up to 2011 so this is on a separate set from the one that we trained on same categories but different instances of the images and what this convolutional net did was bring down this error rate down to 16.4% and all of a sudden the machine learning and computer vision communities started paying attention to those methods and they were ignoring them until then and what happened over the next several years is that people started switching to using those techniques and got them better and better simultaneously making them deeper and deeper with more and more layers if you want and got the results, the error rate down to less than 3% it's actually still getting better to the point that it's not really interesting anymore to work on this problem because it's kind of below the error that an average human would produce on this dataset so those networks now are very very large and very deep that are routinely used in industry have 50 layers or so and they have far use tricks to make them work so one reason we can ask ourselves is why is it that multiple layers are good for analyzing natural signals like images and it's probably because the world is compositional there's some compositionality about the world that physicists of course know about but even the perceptual world is compositional in the sense that a scene is composed of objects and objects are composed of parts and parts are composed of motifs and motifs are composed of elementary features like oriented edges down to pixels so there is this natural compositional hierarchy which is well represented by those multi-layer architectures that learn to detect those suspicious combinations of features that are more abstract as you go up the layers so there's a matching between the type of functions that can be represented by those deep architectures and the functions that seem to be useful for analyzing natural signals but it seems a bit like a conspiracy it just works just quickly on the recent progress in computer vision over the state of the art we can do things now using techniques similar to what I just showed using convolutional nets that are swept over an image on multiple scales those systems are not just trained to produce a category but also to produce a mask of the object that they recognize at every location and you can do things like this the outline of each person appearing here including the dog at the bottom the baseball bats the wine glasses we're in France after all the people that are partially occluded, the table the computers in the back you can count sheeps you can detect backpacks etc so it's amazing how it works and you would have asked people in computer vision how long is it going to take us before we can do this they would have told you it happened really quickly so quickly in fact that now we can do things like in real time running on a smartphone figure out the key points of a human body and figure out the pose of a human body in real time so although this is open source by the way distributed by Facebook and there is various applications of various versions of this for all kinds of stuff but that's technology so this can be used for translation and text and all those things because it turns out you can represent text by a sequence of vectors and that's what I'm going to talk about next and this is where we are starting to talk about graphs because so far I haven't talked about graphs at all so one very interesting question here is how to embed graphs using machine learning, say SQL methods perhaps deep learning and one question we can ask ourselves is can we embed the world in other words it would be nice if you are a company like Google or Facebook what you want is associate a vector to every object that you manipulate whether it's a person a user whether it's a piece of content or anything anything in the world that you have to deal with manipulate it would be nice if you could represent it by a vector because if you could do this you would have a lot of questions on vectors that you could use to figure out if one person for example is going to like a particular piece of content or if two people are going to are likely to become friends or something like that so a very simple version of this is users ooh that's not very visible is matrix completion for graph embedding so imagine you've been around for a long time with various techniques but people kind of realize how well that worked with the Netflix competition that took place several years ago so basically imagine you have items on the columns of a matrix those items for example are movies you have a bunch of people here and for some of the entries you have the information as to whether that person liked this movie or that other person didn't like that movie symbolized by this plus and minus so anyway this is a graph essentially that where each node is a bygraph really where you have two sets of nodes people and items and then you have edges between them that have values on them of some kind and what you'd like to do is complete this matrix so there are some entries you don't know and there's some entry you'd like to predict is this person going to like that movie or not and so you could try to figure out what is the case of this person similar to another one they look kind of like this person here so maybe we can impute this value here as being similar to this person because those two people are similar but really a very systematic way of doing this is through matrix factorization so essentially you say I want to associate a vector to every person and a vector to every item in such a way that when I compute the dot product vector and that vector I get an estimate of that value what I'd say plus one for plus or minus one for minus something like that okay and you can do this very easily through stochastic gradient descent again you start with random vectors and then you say okay how should I change those two vectors so that I get the closer to the value here that I like so I'm going to bring those two vectors closer together so that they are dot product you know I'm going to keep them normalized so that the dot product is closer to one and then here I'm going to bring this vector and that vector a little bit away from each other so that the dot product is more negative etc it's a form of matrix factorization there's tons of different objective functions and things like that you can use for this in optimization algorithms and this can be done at a very large scale a particular special case of this is just using SVD essentially right but there are other methods that are kind of slightly more appropriate for other types of data so that's kind of the simplest way of doing matrix completion and basically predicting the edges in a graph that's what it comes down to you have a graph you want to predict imagine the items are other people or maybe the same people and you know who is friend with who and you'd like to know are those two people likely to become friends or suppose you know this is someone you may like and Facebook actually has a service that does this it's called PYMK people you may know okay not very visible so so a very successful form of this to some extent is in the context of what's called unsupervised running or self supervised running finding vector embeddings for words and text and there is essentially the this pioneering work in this going back to 2003 with Yoshua Bengio's work on using a neural net to essentially predict the next word in a text and and then some more recent work by Colbert and Weston they also use a neural net to predict the next word in a text to do various things with text so basically there's a form of this that looks like this where you feed you feed a text and you can think of a word as essentially a very long vector of size say 100,000 dimension 100,000 where one component is one and the other ones are zero that's called the one hot encoding and the one component that's one is just the component corresponding to the index of the word in the dictionary okay very simple so you have a dictionary of 100,000 words you have a vector of 100,000 dimension one component this is one, multiply this by matrix basically it amounts to doing a lookup in a table for which column in the matrix is going to be activated and that's going to be a word embedding so that's the first layer here in this neural net you're going to have this one hot vector that is multiplied by this matrix and produces a vector and then those vectors are combined in some way by some neural net perhaps of the type that I showed you and then you try to predict one of the words in the sequence perhaps the middle word that you didn't show the system or perhaps the last word if you want to predict what the next word is going to be in your language model turns out there is a very simple form of this called word2vec where those functions are essentially linear and you can't really see the two models here but they're very similar to this you take the words surrounding a particular word and you compute embeddings for those words in such a way that when you sum them up it essentially predicts the embedding for the middle word okay and you can just learn all those embeddings all those vectors for every word using just basically stochastic gradient descent the way it comes down to there's another form here that's kind of yeah it's popular so those things work amazingly well here at the word level even at the group of characters level are the group of words level so there's a method called fast text which is extremely popular produced by some of my colleagues at facebook that basically produces embeddings for common phrases not just for single words but common phrases so that allows you to map words or phrases or even sentences to vectors which you can then manipulate and there is this amazing property that just emerges when you do this kind of training where you can think of text now as a graph because essentially what this does is that it tries to do a prediction as to whether a word appears in the context of other words right so think of text as a graph take a giant piece of text and build a graph where every word is neighbor to all the words that surround it say 5 before 5 after and then try to compute a vector representation for each word in such a way that you can you can predict whether two words are going to be neighbors ever in a long collection of text and you train the system on a very very large collection of text the text corpus and that's the kind of embeddings you get so the amazing thing about this is that the kind of embeddings you get have this compositional semantics property where if you take the vector for Tokyo and the vector for Japan and the vector for Berlin and the vector for Germany you subtract Germany from Berlin or Japan from Tokyo and you basically get more or less the same vector not exactly but more or less so the system has understood where the relationship between a country and its capital is essentially just by reading tons of text and figuring out that the name of a city appears next to the name of the corresponding country generally and so that just pops up naturally in the embedding so you can exploit this to do to build a question answering system and this is a relatively old system there's been a lot of progress on this in the last five years that was built by my colleague Antoine Bord and his collaborators just before he came to Facebook and then he kept working on this after Facebook and then the idea is that the graph that you're trying to represent is a knowledge base so the knowledge base is a bunch of entities like say George Clooney and his wife and relationships is married to you know his male they were married in Honolulu in 1987 blah blah blah blah so you have this knowledge base somewhere there's one that is free that now belongs to Google for free based and what you can do is do a kind of one hot encoding of of those say a question so take a question, who did Clooney marry Clooney married in 1987 you have a vector for each of those words and in fact in this case here they don't even a computer vector for these words they just represent this sentence by a vector where of the size of the dictionary where each component is set to one if the word appears in the sentence so this is a very sparse vector with only a few ones and many zeros very high dimension and then there is essentially a linear mapping matrix that you multiply this vector by this is going to compute a vector that will represent this question we're going to learn this matrix and then simultaneously we're going to take a sub-graph here that contains some of the entities that appear here because the rest of the graph is going to be relevant so you take a big chunk of graph that contains the relevant entities and here you do the same you embed this graph by essentially a vector that indicates which entities are present and again you compute here we're going to learn a matrix that maps this linearly to another vector and we're going to compute this in such a way that we compute the dot product of the question and the representation of the answer which is a particular arc in this graph if you want we get a high score and we'll compute the dot product of a question with something that's irrelevant or is not the correct answer then we get a low score a low dot product so you train this on freebase and then you can do link prediction on freebase basically and answer questions that may not necessarily be in the database but in the first place because you have some sort of compact representation of this of what answer is relevant in this sub graph using this linear map this is a very simple thing again there is much more sophisticated things that appear more recently for this problem using complicated neural nets but this can be used to answer simple questions you know what are and that's the list of potential answers and you know etc ok so what can we do with this sort of graph embedding representations for language and can we use this for translation so every translation system now that are deployed widely by google macrosoft facebook etc use neural nets and use those representation or embeddings that are used using vectors essentially but they feed them to neural nets that produces the translation in another language the problem with this is that that requires a lot of data so you have to have a very large corpus of parallel text in the source language and the target language so it's ok if the language you are handling is in the dozens or something you may have a lot of parallel text you know french and english or german and and swedish or things like that but you know what about swahidi and ordu it's probably not a huge amount of parallel text in those two languages people used something like 6000 languages on facebook and so there is no way we can have translators for every pair there would be 36 million that's just too many and we just don't have the data so what we can ask is is there any way that the structure of language can be used to actually produce translation systems that don't require parallel text so one idea of course is to use an internal representation of text that is independent of language so you show the system a sentence in english let's say the first step is to turn this into a sequence of vectors and then you run this through a neural net that produces a vector or a sequence of vectors of some kind that represents the meaning of that sentence and then you can run it through another neural net that will produce the translation into a target language that you want the problem is how do you produce those so one idea that popped up very recently this is work that was done at facebook research in Paris by Guillaume Lamp and his collaborators and there was a series of papers on this is to basically exploit the fact that every language in the world talk about the same thing more or less so the structure of language actually is very similar so what you do is you find a word or phrase embeddings for one language these are the red dots if you want from a corpus of that language and this embedding is done by trying to predict a word from its neighbors something like that it's using the fast text technique for that done on short phrases groups of words that frequently appear you do the same for the other language and then you ask the question is there a simple transformation that will essentially turn the blue vectors into the red vectors very simple transformation like not too much distortion like an affine transformation for example and you might find something that kind of maps more or less and it's just because you manipulate similar concepts in various languages and so there's a cloud of point for a language or cloud of point for another language and they may have similar shape they might just be arbitrarily transformed by the algorithm that finds the embedding so now there's two other tricks we can use one trick is to use the language model so language model you can think of this as some sort of smoothing so language model essentially says I know that certain words that a particular set of words can follow this sequence of words I just observed and so if I'm doing a translation and another word comes on that is not predicted by the language model for my language it's probably wrong so that gives you a little bit of smoothing and error correction if you want and then there's another trick which is that you can sort of translate into directions so you can have to figure out if two sentences correspond to each other you take one language transform it into the other one and then take a similar sentence that you have that is nearby in the embedding space in the second language and translate it back and then if you want those two things to be similar you can sort of adjust the positions of all those embedding so that the thing kind of works out and so the beauty of this is that this is so people in translation measure the performance of the system with something called the blue score which I'm not going to explain it's imperfect but that's the best we have and and what they've done here is essentially compare the blue score of a completely supervised system as a function of how many sentences training sentences you use to train it in supervised mode compared with the performance in terms of blue score of a completely unsupervised system produced this way and if you have less than a few hundred thousand sentences this is English to French I believe then you better watch using this unsupervised system if you are more than a few hundred thousand parallel sentences that have been correctly manually translated that you can use for training then the supervised system is better so what that means is that for languages you can use things like this and that's basically using kind of graph embedding techniques it's a good video that Facebook people put together to demonstrate the process but it doesn't help understanding things much once I've explained this okay now for something that has a little more kind of mathematically interesting concepts if you can read okay I'm not sure how this can be fixed I guess it can't wow okay use neural network to denoise the image yes exactly to restore the image alright yeah you have to actively I'm sorry I didn't hear controller R what is I going to do oh unfortunately it's a pixel image I grabbed it from a PDF including the text okay anyway I'm going to explain so there is this idea that a lot of graphs that you want to represent represent hierarchies hierarchical structures and using Euclidean distance to embed the nodes of a graph when this graph is sort of a tree it's very inefficient you need a high dimensional space to you know represent to place the nodes of a graph in a Euclidean space in such a way that the Euclidean distance between points correspond to the graph distance if you want however as I think Michel Gromov showed you can embed trees in two dimensions very easily if you use hyperbolic matrix okay so two dimension is only approximate if you really want to have an accurate representation you need more than two dimensions but you can get pretty close with two dimensions already so why not use hyperbolic metric between vectors instead of Euclidean metric and then try to compute embeddings using that metric directly so basically this idea that the geometry of the data you're trying to represent is not necessarily Euclidean and you can always embed things in Euclidean spaces but you might require too many dimensions and then you lose the kind of regularity and the advantage of using this embedding the generalization ability that it might give you so yeah this is I guess you can't read it but it says Gromov 1987 um so the idea that my colleagues Max Nicoll and Daouay Kiehler had is to use the Poincare bowl as a model of hyperbolic space I hope you can read the formula so basically you define the distance between two vectors this is within the unit sphere as the hyperbolic arc cosine of 1 plus 2 Euclidean distance of u and v divided by I think there is a square it's got to be a square yeah there is a square 1 minus the square norm of u times 1 minus the square norm of v okay you can see the square but there is a square here and here so in that metric a geodesic is between two points on the you know any other boundary is something that is curved it looks like this and basically the center is pretty close to every point so which you know is why so if you have two points that are far away from each other in the sphere the geodesic between them goes pretty close to the center so now the problem we have is that we're going to have a collection of objects we know the pairwise distance between some of those objects and what we want to do is find vectors associated with each of those objects such that the Poincare distance within the Poincare ball basically predicts or approximates the distance that we have in our graph so you can do this with stochastic gradient kind of same similar technique that was showing earlier for matrix vectorization and you say let me compute the distance between those two guys the hyperbolic distance if it's not right, if it's too large I'm going to bring them closer it's too small I'm going to move them apart and what you have to do is basically compute the gradient of the objective function which is the difference between the distance you want and the distance you get multiply this by the inverse of the metric tensor and the hyperbolic metric which in this case is just it's a scalar it's just a scaling factor that says when you're close to the boundary divide by a large number because even in Euclidean space those vectors might appear to be very far away from each other, in fact they're really close so divide by how far away they are from the origin essentially and then they use an exponential map to do the learning for various reasons and you just do stochastic gradient and and update them this way so this is in fact the update world that you can't read so one of the things they apply this to is embedding concepts from WordNet so there is this data set called WordNet which is a collection of a very large number of words from the English dictionary where they are organized as a hierarchy so essentially they say you know a house cat is a feline and a feline is a mammal and mammal is an animal etc so they have this whole hierarchy going all the way up to entity which is sort of the root hierarchy physical objects versus concepts and organisms versus artifacts versus just natural objects and locations and things like that and what they did was build a graph and there basically is a transitive closure of the hierarchy of WordNet so they essentially have an arc whenever something is a first, second or nth order child of another node and then using this point carry embedding they just try to find an embedding that sort of predicts the nodes in this graph essentially so that the point carry distance between two points predicts our children of each other and you get this in two dimension you get this sort of embedding where it kind of makes sense right? You have food here you have all kind of material things here wood, hardwood etc here you have concepts that are like rate and metabolic rate you know here you have types of plants and stuff like that and abstractions of various kinds and physical entities, organisms artifacts, plants, herb scientists right here next to herb mathematicians here too yeah Gauss is a mathematician was okay so you get this this hierarchy that sort of reflects naturally and you can measure how well this representation can be how good it is at predicting the edges in the graph actually but okay those numbers are supposed to be the good numbers they are the best numbers in the table so you don't need to look at the other ones actually yeah I really apologize for this I didn't realize the projector was going to be networked out you can of course use this for social graphs as well, something that facebook is really interested in as well as I hope you're going to be able to see this, not much this is for historical linguistic so here you take a bunch of languages some existing languages, some of which are extinct and for most of those languages linguists have identified cognates which are words that are essentially the same in the two languages or very similar and so if you have the graph of cognates you can try to figure out if there are some sort of hierarchical relationship between languages and what you don't see here is that at the center you have Indo-European, Vedic Sanskrit and here you have all the Romance languages Celtic languages all the Irish and Welsh and probably Breton and things like that and these are Slavic languages Germanic languages, Greek Persian and Armenian etc. so they sort of mapped out the languages in the kind of temporal affiliation so all of those things that I talked about are methods to map not to kind of map objects to vectors but really to kind of pre-compute a vector for every object that you already know so the word embedding techniques I showed you, you have to know all the words in the dictionary before you can start working you have to know them all in advance so basically what you get at the end is not a function that maps words or sentences or whatever to vectors, you just have a table essentially but what you like, you can't do this for images it's not like you can have a table of every possible images in the world and then have a table that looks up a corresponding vector for that image so there's no way you can use those techniques what you have to do is basically train one of those neural nets, perhaps a convolutional if you apply two images, to map images to vectors in such a way that those vectors correspond to kind of semantic similarity or whatever it is you're interested in similarly for text, you'd like to plug a text into a neural net and basically outcome something that contains the meaning of that text or the topic kind of similar to the part I was talking about in translation earlier but it's not just a word, it would be an entire text that would be really interesting because with this we could try to sort of embed the entire world you know, have you know text and images and concepts and videos and things like that that are nearby in this complex space so that we know that they talk about the same thing so one application of this that has been quite successful at Facebook for the last four or five years is for face recognition so the idea there is so this is using what's called Siamese network or metric learning and it's the idea that you'll have a neural net one of those convolutional net things that is going to map images to vectors and what you have is a graph of similarity of all the images you have in your database so for example faces, so you have a large collection of face images and you know that this group of 10 images are images of the same person and this other group of 10 images are images of a different person so now you have a graph that says those 10 images are actually similar but they are dissimilar from those other 10 images that themselves are similar to each other so that's a graph, another graph and what you'd like is to basically compute an embedding of this graph in such a way that every face image is associated with a vector in which the Euclidean distance corresponds to identity so two images of the same person will be mapped to two vectors that are very close to each other and two images of different people will be mapped to two vectors that are far away from each other that's basically what this does using machine learning so you show two images of the same person run them through this neural net it produces two vectors and you tell the system in its objective function here bring those two vectors closer together so tweak the parameters of this network so that those two vectors get closer to each other conversely if you only do this then the whole system collapses basically the network ignores the input of the vector for every image so you have to have a contrastive term that says here are two images of different people now what I want you to do is move those two vectors away from each other up to some distance for example with a set margin of some kind so that's called a Siamese net the idea of it goes back to the early 90s actually the first papers on this was 1993 but the practical applications for things like face recognition is very recent it's around 2014 you can use techniques like this for other things like figuring out if two images are similar by looking at people post pictures on Instagram for those of you who either use Instagram or have children or grandchildren that use Instagram basically whenever you post a picture on Instagram you also type hashtags and you kind of correspond a little bit to the content so you can try to sort of find embeddings that would predict the hashtags and if you do this you get an embedding of images that looks like this so this is here a reduction of that embedding to two dimensions even though the original embedding is a much higher dimension it's more like 4000 where in one area of the space here you have dogs and then in another area you have food essentially people taking pictures and then this is the flower quarter and you know there's people and that's events those are like fireworks and those are like buildings and every concept now more recently people at Facebook have tried to do something a little bolder which consists in training a fairly large convolutional net with 3.5 billion images from Instagram and they train this convolutional net to predict the set of hashtags that someone would type for this particular image 3.5 billion images and if you take this network so there is kind of a layer of classification on top that basically predicts the hashtags remove it so train this network then remove the top layer and then just retrain one or two layers on top to classify images from ImageNet the data set I was telling you about earlier what you get in the end is a system that has the best performance on ImageNet that holds the record so what it tells you is that if you do a very large scale if you exploit the regularity that exists in the world it's not task specific but you kind of use as much data as you can to train a system to produce vectors and then you use vectors as inputs to kind of a system that actually does a task like recognizing object categories it actually works better than if you train for the task directly because you never have enough data okay the last thing I want to talk about is and I'm not going to get to self-supervisor running impressionally but this is perhaps more interesting for this crowd it's somewhat disconnected from the previous things but again has some connections it's the idea of generalizing the concept of neural nets or convolutional nets to input data types that are not multi-dimensional arrays but are functions on graphs so you can think of a multi-dimensional array like say an image is a 2D array with pixel values you can think of it as a grid graph where each pixel is linked to its nearest neighbors and where the value on the node is just the value of the pixels and we use convolutional nets on this graph and what is a convolution so there is the low level definition of convolution just by the formula you have a convolution kernel that you correlate with a minus sign in front at all shifts with the signal you want to convolve there is a more abstract definition of convolution which is a diagonal operator in the space of the graph leplacian so essentially think of an image as a 2D graph compute the where nearby pixels are linked with an edge as say value 1 and you don't have any connections that are anywhere else now compute the graph leplacian of this graph and compute the transformation into the again space of that graph leplacian and that's the definition of Fourier transform I mean that's actually how Fourier transform because of heat diffusion so in Fourier space a convolution is a diagonal operator it's a point-wise multiplication and so basically a convolution is a diagonal operator in the eigen space of the graph leplacian now you can define this transformation for any graph so take a graph of any shape any number of nodes compute the graph leplacian compute the operator that transforms the function on this graph which basically is a big matrix whose dimension is the number of nodes squared so you can define the function on your graph into the eigen space of the graph leplacian and now if you multiply this by a diagonal matrix it's as if you were doing a convolution in the original graph so you can define convolutions on functions on the irregular graphs so there's been an original paper on this by Joanne Brina, I was a co-author on it Joanne Brina was a student with Stefan Malasse over years ago he's now a professor at NYU called Spectral Networks and this kind of spurred a whole bunch of activities in various ideas on how to kind of exploit this in various ways in particular and so we ended up this is not the article that I was hoping to refer to but there is a review paper here well that's a review paper so a bunch of us wrote a review paper about this idea of geometric deep learning basically that allows us to apply neural nets and convolutional nets on data that doesn't come to you in the form of a multidimensional array and so in particular you can use this to apply neural nets to regularity networks, social networks functional networks, 3D shapes so let's say for example you have data that comes to you in the form of a 3D mesh and you'd like to be able to represent a 3D mesh to a neural net and have it recognize if it's a hand or if it's a leg or something so that you can match one 3D mesh to another or other applications of this type for all kinds of domains so the idea of using the Eigen space or the graph Laplacian has limited practical value because this transformation operator is very expensive it's n squared where n is number of nodes in your graph so it's not going to be practical for images for example so you have to kind of cut corners and this review paper that I was mentioning earlier essentially goes through a whole bunch of methods that kind of makes this practical either through approximation or through other concepts of how you define convolutions on non-euclidean domains essentially and so people have been applying this to probably one of the most interesting applications of this in recent years is for chemistry so you can represent a chemical compound as a graph particularly as a function on a graph where the value on the nodes correspond to the identity of the atoms perhaps some sort of characteristic of their energy level, number of electrons and whatever and then kind of feed perhaps two molecules to a neural net that will train itself to compute embeddings that will figure out whether two molecules will stick to each other so that would be kind of a good way of figuring this out there's Alan Asperg-Guzik in Toronto he just moved from Harvard to Toronto he was using this to basically synthesize molecules to figure out how you synthesize molecules there's a lot of really interesting work in this area, Stefan Mala has been working on similar things also using neural net on graphs so you can have systems that have regular grid graphs, that's a classical convolutional net, you can have fixed irregular graphs, so basically spectral coordinates that I mentioned earlier and then you can have dynamic irregular graphs so this application to chemistry is a typical situation where the graph changes for every new data point you get one molecule does not have the same graph as another molecule so you get a different structure for every time there was a workshop at IPAM that I co-organized earlier this year that has a whole bunch of interesting talks about this topic okay so I'm out of time so I'm going to end with so I'm not going to talk about cell supervisor, I'm afraid but I'm going to end with two things first a question and the question is what are we missing so this is what I just talked talked about deep learning is what people now call AI when you hear talking about AI it's all about deep learning but it's not like we know how to build truly intelligent systems and there is I'm asking myself the question what is missing from the type of learning that I talked about here, supervised learning that would allow machines to learn a little bit more like humans and animals and that's cell supervised learning but that's not what I'm not going to talk about but but then there is a bit of a more kind of cosmic conclusion here which is you know given the background of a lot of people here in the audience it's been the case in the past in science very often not always of course but quite often that new areas of science appeared after people came up with artifacts and those new areas of science were basically derived from the need to explain how those artifacts worked or to kind of figure out what the limitations were so the telescope was invented long before optics was figured out the steam engine was invented and thermodynamics was basically derived to explain the you know, chrono cycles and things like that right, entropy I mean all came up because of because of the steam engine essentially and that became one of the most fundamental intellectual construction of science thermodynamics aerodynamics kind of existed before the airplane but not the really kind of tricky parts like stability and you know lift and drag and things like this certainly calculators were developed before computer science became a science and it was also the case for take communication that artifacts for that were invented before information theory was developed so one question I have kind of a scientific question that I'm really kind of interested in is there the equivalent of thermodynamics for intelligence is there a sort of general theory of intelligence or intelligence signal processing what you want, data analysis deriving knowledge from data that that kind of plays the same role as thermodynamics displayed for thermal machines that's my scientific program for the next several years until my brain turns to white sauce, thank you very much when you were marking languages what the other, I mean time to refer an underlying universal language if I understood properly so can you say something did you find some universal properties of languages in terms of studies of properties of structures I mean you are assuming that there is a universal you're assuming that there is one to the extent that it can produce translations of the type that current translation systems produce which are not, which are very far from perfect it's not like the translation problem is solved those systems make a lot of mistakes they're very useful but they don't understand anything about how the word works other than through the testical relationship between words essentially so they will make very very stupid mistakes when they translate so I don't think we can answer that question really I mean if you analyze the data did you find some universal like the Chomsky style the Chomsky style universal structure of language so this is, this basically throws out the window everything that Chomsky said in the sense that there is nothing like grammar syntax, you know, anything like this other than through their statistical relationship of, you know, neighboring relationships of words in a corpus there's no hardwired grammar, there's nothing like that there's no parsing so all of this is just purely data driven so if there is some structure it should appear in their internal representation I don't think there's been a huge amount of analysis other than I'm not a co-author on that work, right this is work by others at Facebook but clearly there is some property there of some sort of interlingua that's what linguists call this which is, you know, sort of a common internal representation that doesn't depend on the particular language you express yourself in I think there's probably years of work there to really analyze what's going on there and to make it better people are more interested in making it better than figuring out how it works I also had a question about language so you said how one finds the vectors for words then one was associated with vector to a sentence does one just take the sum of what do people do? there's tons of different methods for that okay, and then does anybody find any regularities like the one, France minus Paris is equal to Spain minus Madrid at the level of phrases I don't know actually natural language processing is not my application specialty but what happens so this this relational structure certainly appears when you assume that the vector corresponding to a word is actually obtained by summing up the vectors associated with the neighboring words okay, so when you have kind of a linear prediction from the embeddings then you get this compositional structure you also get it in other types of learning the general form is you know you have words that are vectors using a lookup table that you train and then you can feed this not just to a linear addition but to a complicated neural net that does something with it to predict the next word or the middle word or whatever and there's an increasingly large body of work on this on different architectures like the transformer networks and things like that transformer architectures and it's like the more recent versions of it there is memory networks those things that have association and attention and all that stuff which I didn't get into which of these ones will win out in the end is now clear I think what people are interested in at the moment that's really exciting is trying to figure out representations of sentences that are not task specific so they would work not just for translation not just for text classification but also for everything you want to do with text question answering information retrieval a little bit like the hashtag prediction stuff I showed for images that you can specialize for a particular task and it works better than if you just train for this task so this idea of transfer learning and meta learning is really hard to pick Thank you for this interesting talk I had a question how can topology make CNNs better apart from using graph spectral spectral CNNs and different distance measures so other types of representations or definitions of convolutions that would not go through the spectral again space the interaction of mathematical topology I see it can make CNN the first model better I see there's a bunch of things that are interesting in topology that are not connected with this I don't know actually I don't know if topology can help in determining the appropriate architecture obviously there are only a finite number of topologies that make sense whether there are kind of cheap ways to compute convolutions for certain topologies that we might be able to exploit the way fast Fourier transforms can be exploited when you have a grid graph it also works if you have a toroidal topology Fourier transform can be defined easily if you have a sphere, Fourier transform works too it's a spherical harmonics but other topologies have no idea now there is something else which is a class of models that people are really interested in that I didn't get to talk about called adversarial networks or generative adversarial network or generative networks more generally and those tend to try to find so the problem here is the opposite of here is an object, finite embedding it's kind of the other way around what you want is to learn a function that given a point in a low dimensional space sample, let's say a sphere or a cube pick a point there map it through a function, a neural net and add comes an image and it's a natural image and when you kind of smoothly move that point in the original space you kind of smoothly change the image in ways that make sense so that every point in the sphere that you run through the neural net will generate the image of a face or a bedroom or something like this from one point to another in that sphere you will change the gender of the person or the age or the amount of hair or the skin color or something like that whether the person has glasses or not so there's very interesting work in this area that I didn't get to talk about, it's very exciting there I think topology might be relevant because if you take the set of images all images of a face from all possible angles it's topologically equivalent to a sphere but it's an empty sphere if you if you use this latent space that doesn't have the right topology I don't know what happens it's probably not going to work so there's probably interesting things to do there I think we have five distant talks after the coffee break so we should keep in time thank you