 So welcome everyone to this lecture on graph convolutional networks. Okay, so this is the outline of the lecture. So first I will go quickly on the traditional conglets and the architecture and then I will introduce graphs and I will also remind definitions of convolutions to extend it to graphs. Then I will present two classes of graph conglets. The first one is what I call a spectral graph conglets and the second one is the spatial graph conglets. I will talk a little bit about benchmarking graph neural networks and finally I will conclude. Okay, so let's start with the traditional conglets. So we all know conglets are a breakthrough in computer vision. So for the ImageNet competition, you know, for the image classification task, when a conglets was used, they decreased by almost a factor two, the error of classification. It was in 2012 and it was basically the end of uncrafting features and we shift the paradigm to uncrafting learning systems. And now for this very specific task, we all know that we go to superhuman performance. Conglets are also a breakthrough in speech and natural language processing. So at Facebook, when you want to translate, you are also using conglets. So conglets are powerful architectures to actually solve high-dimensional learning problems. So we all know about the curse of dimensionality. So if you have an image, let's say 1000 by 1000 pixels, so it's you have one million variables. So an image can be seen as a point in a space of one million dimensions. And for each dimension, if you sample by using, by using 10 samples, then you have a 10 to the power of one million possible images. So these spaces are really huge. And of course, this is the question, how do you find the needle of information in this big haystack? So conglets are really powerful to extract basically this, this information, the best possible representation of your, of your image data to solve problems. Of course, we don't know yet everything about, yeah, we don't know yet everything about conglets. So it's a kind of a miracle how, how powerful, how, how good they are. And it's also quite exciting because this opened many research areas to understand better and to develop new architectures. Okay. So when you use conglets, you are doing an assumption. And the main assumption that you are using is that your data, so images, videos, speech is compositional. It means that it is form of patents that are local. So you know, this is the contribution of urban and visual. So if you are on this layer for this neuron, this neuron is going to be connected to a few neurons in the previous layer and not all neurons. Okay. So this is the local reception field assumption. Then you have also the property of stationary stationarity. So basically, you have some patterns that are similar and that are shared across your image domain. Okay. So like the yellow patches and the blue patches, so they are all similar to each other. The last property is hierarchical. So you make the assumption that your data is hierarchical in the sense that your low level features are going to be combined together to form medium level features. And then this medium features are going to be again combined to each other to form higher and higher abstract features. So any convnet work the same way. So the first part of the architecture is to extract these convolutional features. And then the second part will be to solve your specific task, like classification, recommendation, and so on. And this is what we call end-to-end systems. And the first part is to learn the features. The second part is to solve your task. Okay. Okay. Let's see more precisely what is the data domain. So if you have images, volumes, or videos, basically, so for example, you can see this image. And if you're zooming this image, what you have is a 2D grid. Okay. You have a 2D grid. This is the structure of the domain of this image. And on the top of this grid, you have some features. So for example, in the case of color image, you will have three features which are red, green, and blue. Okay. Now, if I'm looking at natural language processing, so like sentences, you will have a sequence of words. And basically, you can see that, you know, as a 1D grid. And on the top of this grid, for each node of the grid, you will have a word. Okay. So a word can be represented by just an integer, for example. The same also for speech. So what you see here is the variation of the air pressure. And it's the same. You know, it's like you have the support is a 1D grid. And for each node of the grid, you will have the air pressure value, okay, which is a real number. So I think it's clear. We all use all the time grids. And grids are, you know, a very strong, regular special structure. And for this, for this, for this structure, this is good because we mathematically, we can define the component operations like convolution and pooling. And also in practice, it's very fast to do it. So everything is good. Now let's look at, you know, new data. So for example, social networks, okay, so you want to do your task, for example, it would be to do advertisement or to also make recommendation. So for a social network, I'm going to, it's going to be clear, but I'm going to show you that if you take two nodes, so for example, you know, you have this user, let's say this user, I and user J, and all the others, you see that this is not a grid. Okay. So the connection, the pairwise connection between all users, they do not form a grid. They have a very special pattern of connections. And this is basically a graph. Okay. So how do you define your graph? You're going to see the connection between users. So if I, user, I, user J are friends, you're going to have, you know, connection. And then for this, you are going to use what we call an adjacency matrix, which is just going to record all the connection or non connection between nodes in your, in your social networks. Okay. And on the top of your network, for each user, you will have features. So for example, you have, you know, messages, you have images, you have videos. So they form, you know, some feature in a D dimensional space. In neuroscience, in brain analysis, for example, we are really interesting to understand, you know, the fundamental relationship between structure and function of the brain. So they are really connected to each other. And it's very fundamental to understand that we also want, for example, to predict neuro generative disease, different stages of this disease. So then this is very important. For this, we need to understand the brain and the brain, if you look at the brain, the brain is composed of what we call a region of interest. Okay. And this region of interest, if you take one region of interest, this region is not connected to all those regions in the brain. Actually, they are only connected to a few other regions. So it's, it's, and again, you can see nothing to do with a grid. Okay. So this special connection between different regions of the brains, they can be measured by the structural MRI signal. And then you also have an adjacency matrix between region I and region J. And, and here you have a strength of connection, which depends how many connection, how many fibers do you have to connect region I and region J. Okay. And then on the top of this graph, so if you look at the region I, then you will have activations, you know, functional activation, which is basically a time series that you can see here. And also we can record this activation of the brain with functional MRI. Okay. The last example I want to show you is in quantum chemistry. So for example, the task would be to design new molecules for drugs and materials. So, so you see again, the connection between atoms has nothing to do with a grid. Okay. It really depends how you're going to connect your atoms. And then you will have, you know, molecules. So, so the connection between atoms, they are called bound. And you have, you know, different kinds of bounds. They can be single bound, double bound, aromatic bound. And you have, and you have also different features like energy, and many other features that you can use from, from chemistry. For the node of the graphs of their atoms, and again, you, you may have different features like the type of atom, if it is, you know, hydrogen, if it is hazard, all this, all these types. You have also the 3D coordinates. You have the charge and so on. You may have multiple features. Okay. So, and it's not the list actually goes on to give you example of graph, graph domains. So you also have, you know, computer graphics with 3D meshes. You also want maybe to analyze transportation, network, and the density of, of cars, or maybe, I don't know, trains. You have also, you know, generator in network, you have knowledge graphs, world relationships, you know, users, products, where you want to do recommendation, you have also same understanding, you want to give more common sense to your computer vision, machine. So you want to understand the relationship between, between your objects. You also have, you know, for example, if you want to detect high energy physics particles. So you have captors and the captors are not, you know, structural as a regular grid. So for all this, you see that there is a denominator, which is basically you can represent all these problems as graphs. Okay. And, and here is the command setting. I would say the mathematical command setting for all these problems. So the graphs, let's, let's call it G, okay, they are defined by three entities. So the first entity is going to be the set of vertices. So usually, you are going to index the set of vertices from one to N, N is the number of nodes in your, in your graph. Okay. So for example, this will be the index one, two, three, and so on. Then you will have, you know, the set of edges. Basically, they are the connections between, between the nodes. And finally, you will have the adjacency matrix A, which will give you the strengths of the connection of your, of your edge. Okay. Okay. Then you have graph features. So, for example, for each node, node i or node j, you will have some, some node features. So it's basically a vector of the dimensionality dv. Okay. The same or so it's, it's possible that you can get, you can get h features. And it's going to be a vector of dimensionality dv. So for example, for molecules, the node feature, maybe you know the atom type, and the edge feature, maybe the bond type to give you an example. And finally, you can have also some graph feature. Okay. For all, for the whole graph, you can have some feature. So again, it's, it's a vector of dimensionality dj. And in the case of, of chemistry that, that, that might be the molecule energy. Okay. So this is, I would say the general definition of graphs. Okay. So now what I'm going to do is that I'm going to talk about convolution and the question, how do we extend convolution to graphs? Okay. So first, let me remind you the classical way to use convolutional layer for grids when we use conflets for computer vision. So let's say I have this image and, or maybe this is some, you know, hidden feature at a layer L. Okay. And I'm going to do the convolution with some pattern or kernel that of course I will learn by back propagation. And then I would get some activation. Okay. So this is the, the features at the next layer. So to give you maybe some dimensionality. So for example, and one and two is going to be the number of pixel in the X and Y direction. And D is the dimensionality of each pixel. So if this is a color image, the dimensionality is going to be three, for the three colors. And if this is like intermediate hidden feature, maybe you have 100, you know, dimensions for the kernel. Usually you take small kernels because you want, you know, the local reception field. So that might be, you know, three by three pixels kernel or five by five, five by five. And of course you have D because you need to, to respect dimensionality of your input, of your input features. Okay. So maybe for this one, so you see that, so you are going to convert this image with this feature, which is oriented in this direction. So you will basically identify, you know, lines in, in, in this direction of the image. So that was just an example. And we use padding right right now. Right. So we had the same dimensionality of the. Yes. Yes. Absolutely. You can use padding. So basically you don't reduce the size of your image. Right. Yeah. Okay. So, so how do we mathematically define convolution? So the first definition is to do is to see convolution as a template matching. Okay. So, so template matching. So here is the definition, the mathematical definition of convolution. So what you're going to do is that you're going to take your template, you're going to take your image, and then you are going to sum over the index in the whole image domain, omega, okay, of wj. And this is going to be a product between vector wj and vector hi minus j. Okay. So this is the pure definition of convolution. And what we do usually in a, in computer vision is that we don't take minus, we take plus. Okay. And we call that because when we, when we do that, we have the definition of correlation. And this is, this is, you know, because it's more like it's, it's exactly like template matching. Okay. So it doesn't change anything if you, if you do i minus j or i plus j in the learning sense, because the only thing that you do is that you flip up and down and left and right, your, your, your, your kernel. And when you learn, it doesn't change anything basically. Okay. But this is the definition of correlation. So it's, it's really a template matching. And then I'm going to take for the notation ij. Okay. So basically, and yeah, and something very important that you have here is that you, when, when we do convolutional layers, we are using kernel with compact support, you know, like a tree by tree, it's very small support. When we do that, we don't do the sum over the whole domain, the whole image domain, we just do the sum over the neighborhood of the node i. Okay. And this is very important. It's very important because suddenly the sum is not over the whole pixel, it's just, you know, in the neighborhood. And then the complexity of doing convolution is actually to the order of the number of nodes. So the number of pixels in your, in your image. So, so the, the complexity is quite easy to, to compute. So what you're going to do is that you're going to take your, your pattern, you're going to slice your pattern. So it's going to be n slicing because n number of locations. And then you're going to do, you know, scalar product of tree by tree elements. And, and you're going to do, you know, vector product of vectors of dimension 18d. So you see the complexity of doing this operation is just n times three times three times d. So the complexity is n. And, and again, everything can be done in parallel if you have a GPU. The computation that you're doing in this, in this location is independent to the computation that you're doing in this location. So everything is, is, is, is linear complexity. Okay. So doing that. Okay. So, so at the end of the day, if you want to do convolution with template matching, you're just going to compute this scalar product between your template and between your, your image, I would say your image patch. Okay. Okay. So something that is very important to see in the case of the graph being agreed. So this is for standard convolution in computer vision. If you look at, if you are looking at, you know, your template, which is here. Okay. So you see that I'm going to give some node ordering J1, J2, J3, and so on to J9. And this node ordering is actually very important. Okay. Because for, for all time, I mean, this node, I mean, this, this node. So for example, the node J3 will always be positioned at the same location. So it's always going to be at the top right corner of, of the pattern. Okay. So that's, that's very important. Why it's very important. So let me go to the next slide. So why it's very important is, so when I will do the convolution, so the pattern matching, again, I will take my, my pattern and I will slice the pattern over my image domain. Okay. So that would be maybe here. And I put it here. And, and, and also this is position I position I prime that I put here. So when I'm going to do the template matching between the kernel and the image, what I will do is that for this index, so the index J3, it will always match, you know, the information in the image at, at this index here. Okay. So this is very important. So you, when you have agreed, the node ordering, the node positioning is always the same, whatever the position in your image. So when you do the template matching between index J3 and this index here in the image, you always compare the same information. You always compare the feature at the top right corner of your pattern and the type rock corner of the image patch. Okay. So this, this, you see this matching scores, they are for the same information. Okay. So that's very important. Now let's look at what happened for graphs. So the question is, can we extend this definition of template matching for graphs? And there are too many issues. So the first issue is basically on a graph, you don't have any ordering of your notes. Okay. So on the graph, you, you have no given position for your notes. So let's say, for example, I have this graph template. Okay. So there are like four notes with this connection and I have this vertex here. The thing is for this vertex, I know nothing about the position. The only thing that I know is the index. Okay. So maybe this is the index number three for this one. And then when, when, if I want to use the template matching definition, what I'm going to do is that I need to match, you know, this index with other index in the graph domain. So this is my graph. And let's say this is for the node I and they are the neighbors of the node I. So for this label, this is the index, the same index J3. But here, I mean, how can I match, you know, this information with this information, when I do not know if they match with, they match with to each other, because on the graph, you don't have any ordering of your notes. You don't know if this notes, it's for the top right corner of any information. You don't know that. So on the graph, you have no notion of where is the up, where is the down, where is the right, where is the left. Okay. So when you do this matching between this feature vector and this feature vector, actually this matching usually generally has no meaning. Okay. You don't know what you compare to each other. Okay. And again, the index is completely arbitrary. Okay. So you can have the value three here, but it can be here the value number two or value number 12. You don't have, this is not, you know, any good information. So basically, because you don't have any ordering of your notes on graphs, you cannot use the definition of template matching. You cannot use that directly. So we will need to do something else. The second issue with template matching for graphs is what happened if the number of notes in your template does not match the number of notes, you know, in your graph. So for example, here I have four notes. Here I have four notes, fine. Maybe I can find a way to compare the two sets of notes. But here I have seven notes. So how I'm going to compare seven notes to four notes. So that's also, you know, an open issue. Okay. So the third mathematical definition was to use template matching to define convolution. Now the second definition is to use the convolution theorem. So the convolution theorem from Fourier is basically the Fourier transform of the convolution of two functions is the point-wise product of the Fourier transform. This is what you see here. Okay. So the Fourier transform of the convolution of function w and function h is the Fourier transform of f and point-wise multiplication, the Fourier transform of h. Then if you do the inverse Fourier transform, you go back to your convolution. So nice. Okay. We have a very nice formula to do the convolution of w and h. And the thing is, in a general case, doing the Fourier transform is n square complexity. We'll come back to that. However, if your domain like the image grid has some very particular structure, then you can reduce the complexity to n log n by using, you know, fast Fourier transform. Okay. So the question is, can we extend this definition of convolution theorem to graphs? So the question is, how do we redefine Fourier transform for graphs? Okay. And the thing is how to make it fast. Okay. So remember that in the case of template matching, we have linear complexity. So how do we have a fast spectral convolution in linear time for compact kernels? So that's the open question. Okay. So basically, we are going to use these two definitions of convolution to design two classes of graph neural network. So this would be the template matching would be for the spatial graph coordinates. And the convolutional theorem, I'm going to use that for the spectral graph coordinate. And this is the next part that I'm going to talk about now. Okay. So let's talk about how we do spectral convolution. Okay. So there is a book that I like very much, which is the book of Fan Chiang, which is spectral graph theory. So there is everything nice like harmonic analysis, graph theory, combinatorial problems and optimization. So I really recommend, you know, people to read the books if they want to know more and a lot more about these questions. So how do we perform spectral convolution? So we are going to use four steps. So the first step will be to do, will be to define graph laplation. Second step will be to define Fourier functions. Then we will do Fourier transform and eventually convolution theorem. Okay. So what is the graph laplation? So the graph laplation, this is the core operator in spectral graph theory. Okay. So remember how we define a graph, we have a set of vertices, a set of edges, and then we have the adjacency matrix. So is the graph has n vertices, the adjacency matrix is the n by n matrix. So we are going simply to define the laplation, which is also going to be a n by n matrix to be the identity minus the adjacency matrix. And we are going to normalize the adjacency matrix by using the degree of each node. So D is basically a diagonal matrix and the diagonal, each element of the diagonal is basically the degree of the node. Okay. So we are doing, and this is called the normalized laplation. Okay. So this is, I would say this is by default, the definition of laplation that we use for graphs. So we can interpret this, this operator. So the laplation is the same. So the A was the matrix with basically all zeros and the one was representing the connection between edges, right? Yes. So for Facebook, for example, I would say that this is exactly the definition. So if node i, user i is a friend with user j, then you will have adjacency matrix value will be i, i, j equal to one. And if two users are not friends, then you will get the value zero. But sometimes you have a real value for A. For example, for the brain connectivity graph, the value of a, i, j is the degree of connection between the two regions. So basically what we say, the number of fibers that connect region i and region j. So that can be binary. That can be also a continuous value. And also this is symmetric if it's non-oriented graph. Otherwise, yes. So yeah, for, usually it is symmetric. And you want the symmetry for mathematical reasons. But you may have some, so here this is the normalized laplation. But if you have the random walk laplation, then this is non-symmetric. So it's different definition of the laplation. So in the case of laplation, it's very interesting. So in the continuous setting, you have only one definition for the laplation. This is called the Laplace-Beltramur operator. In the discret setting, you have multiple definitions. You can do your own definition of the laplation depending on the assumptions that you are going to use. I understand. Thank you. Okay. So we can interpret the laplation. So the laplation is nothing else than a measure of smoothness of a function on a graph. So this is nothing else than you see. So I'm doing the laplation that I applied to a function h on the graph. And I'm looking at what happened at the vertex i. And if I expand this definition, I will have the value of h i minus the mean value of the neighborhood. So basically, if your signal is smooth, if it doesn't vary much, then this difference will be very small. But if your signal varies a lot, it oscillates a lot, then the difference will be very high. So the laplation is nothing else than a measure of smoothness of function on a graph. All right. So now let's define Fourier functions. So let's take the laplation matrix and let's do a little bit of linear algebra. Let's do eigen decomposition of the graph laplation. So when you do eigen decomposition, you are going to factorize your laplation matrix into three matrices. So you have phi transpose, lambda, and phi. So this matrix, phi of the size n by n, actually have the laplation eigenvectors for each column. And the laplation eigenvectors, they are called the Fourier functions, the famous Fourier functions. And of course, this is an orthonormal basis. So when you do the product between two bases, you will get one, they are the same, and then you get zero if they are orthonormal, if they are different. This is also an invertible matrix. This guy, so this matrix, this is the diagonal matrix of the laplation eigenvalues. So lambda 1 to lambda n. And we know that for the normalized laplation that these values are bounded between zero and between two. So this is the maximum value that you can get. This guy, the laplation eigenvalues, they are known as the spectrum of the graph. So if you take a graph, here you have 27 nodes. If I compute the laplation eigenvalues, and if I put them, I have a signature of the graph, which is called the spectrum of the graph, which will be different for each graph. And here you have, this is what I say, so this is doing again the composition. So if you take your laplation matrix and you apply it to a vector phi of k, then you will get the eigenvalue lambda k times the same vector phi of k. So this is the definition of the eigen decomposition. So you see that four-year functions, they are nothing else than the laplation eigenvectors. Let me illustrate these four-year functions. So we actually already know four-year functions. If you take the grid, so for example, you take here a one-degree, and you compute the four-year functions, so you will get phi zero. Then you will get phi one, which is this one, which is smooth, phi two, which is less, a little less smooth, and phi three, and so on and so on. So this is well known. This is the cosine function and the simuloids. And we use that for image compression. So if we take an image and we project the image on the four-year functions, then the image is going to be, the transformation is going to be sparse. So you only keep the highest coefficient and you can do compression. So this is something that we use for a very long time. For graph, for the graph domain, this is quite interesting. So you see that this is a graph and I'm computing here the first four-year function of the graphs. So you see for phi one, you still have oscillations between positive and negative value, the same between positive and negative value and here as well. What is interesting is that this oscillation depends on the topology of the graph. So it's related to the geometry of the graph, like communities, like hubs and so on. And we know that. So for example, if you want to capture k communities on graph, a very good algorithm is to apply k-means on the first k four-year functions. If you do that, you have something that we call spectrograph theory. And it's a huge literature. And if you want to know more about this, there is this very nice tutorial by Van Long's book about spectrograph clustering and using all these notions of four-year functions. Okay, now let me introduce you Fourier transform. Okay, so for this, I'm going to do the four-year series. Four-year series is nothing else than you take a function h defined on your graph and then you are going to decompose this function using the four-year function. Okay, so I take my function h, I project my function h on each four-year function, phi of k, and I will get, you know, this coefficient of this Fourier series, it's going to be a scalar, multiplied by my function phi of k, okay, of the time n by one, the size n by one, okay? So and doing that, you know, just projecting my function on the four-year functions, give me the Fourier transform. Okay, so the Fourier transform is just, you know, the coefficient of the Fourier series, nothing else. Okay, then h, you know, is basically a linear combination of the Fourier transform times the Fourier functions. Okay, I can rewrite everything in matrix vector representation. And this guy, so doing the phi times the Fourier transform, this is actually the inverse Fourier transform. Okay, so let me summarize this. If I do, if I project h on the Fourier functions, I will have the Fourier transform. Okay, so I'm taking the matrix of the Fourier functions and multiply by h. So this is n by n, this is n by one, so this is n by one. Okay, and now if I do inverse Fourier transform of the Fourier transform, okay, so I would have phi of Fourier transform of h. And this guy is here. Okay, so I just put phi transpose h. And we know that the basis is normal. So this guy is actually identity function, sorry, identity matrix. Okay, so this is identity matrix. So I come back to h. So the inverse Fourier transform of the Fourier transform is h, obviously. Okay, so one thing that you can observe is that the Fourier transform and the inverse Fourier transform can be done in one line of code. Okay, you just take your vector h, you multiply by this matrix, and that's it. And the same also to do the inverse Fourier transform, you take your signal, and you multiply by this matrix. So it's basically just linear operations, just multiplying a matrix by a vector. And this is how you do Fourier transform and inverse Fourier transform on graphs. Okay, now let's do the convolution theorem. So again, the convolution theorem, the Fourier transform of your, the Fourier transform of the convolution is going to be the point-wise product of the Fourier transform of each signal. Okay, so let's say I have w convolution h. So I'm going first to do the Fourier transform of w. Then this is going to be a vector of the size n by 1. Then I'm going to multiply point-wise by another vector, which is the Fourier transform of h. Okay, so how do we get the Fourier transform is just by doing phi transpose w and phi transpose h. And then I'm going to do the inverse Fourier transform to come to go back to the spatial domain. So I'm just going to multiply by the matrix phi. Okay, n by n. So this is what I write here. Okay, I have phi, I have w hat, which is a Fourier transform, and I have this. This I'm going to change it. I'm going to change it to this line. What is this line? Shouldn't it be a phi transpose before w hat? Sorry? Shouldn't there be a phi transpose before w hat? No, the inverse Fourier transform is phi. Okay. So you do phi and you multiply by the Fourier transform, which is a phi transpose w, which I call hat w. So I'm going to use that a lot. I will come back to this. And then here you have the Fourier transform of h, which is just phi transpose h, which is here. Okay. So this guy, okay, this guy is actually what we call the spectral function, okay, the spectral filter. So this guy is a vector of n by 1, okay, and I'm writing, I'm writing here, this vector here. So you see this is a vector of n elements. And this is actually the spectral function, which is evaluated at the eigenvalue lambda 1, which is here. So this is this point here. Then you have w hat lambda 2, which is this value here, and so on and so on. Okay. And then I'm going to rewrite this, you know, I'm going to put this in a diagonal. Okay. So I will do diagonal of this vector. So this will create a matrix of the size n by n. Okay. And I'm putting this guy back here. So I'm going to change the point-wise multiplication of this vector n by 1 and this vector n by 1 by the matrix vector multiplication. And it's going to be the same, right? This is a diagonal matrix, which contains this guy multiplied by this vector. So it's exactly the same, these two lines. But what I want to do that, because I want to get rid of the parenthesis. Okay. So I don't have the parenthesis anymore. And I have just, you know, matrix multiplication. Okay. So this is what I get. Then when I'm going to do something is that we know that when you apply a function on the eigenvalues, okay, if you have some orthogonal basis, then you can put it inside. You can put it inside. And this is what I do here. I put phi and phi transpose inside. And this guy is precisely the definition of the laplacian. Okay. The laplacian, when I do the eigen decomposition is phi lambda phi transpose. Okay. Then, so what I have is basically the spectral function that I applied to the laplacian operator. And this is n by n matrix. And apply to the vector n by one. So at the end, I would get n by one vector. Okay. So you see that if you want to do, so it's important now. So if you want to do a convolution of two functions on graph W and H, what you're going to do is that you're going to take the spectral function of W. You will apply it to the laplacian. And then you multiply by H. Okay. This is the definition of spectral convolution. Okay. And the thing is, this is very expensive in practice to do it. Why it is expensive? It's because the matrix phi is a full matrix. Okay. It contains the n Fourier functions. And they are not zero. Okay. So it's a dense matrix. And you're going to pay the price of n square. And you don't have any FFT. Because the thing is you don't have any FFT for general graph. Okay. So this is a lot. And why it is a lot? Because n, remember, n, this is the number of nodes in your domain. So if you have a big graph, for example, if you have the web, the web has, you know, billions of nodes. n is equal to the billions. So you need to do billions square, which is going to be a huge computation to do. So you cannot really do it. Can I summarize? So H is going to be a function defined over every vertex in your graph, right? And W, instead, is going to be like a kernel as well. Or is it, but in this case? W is going to be a function like this. So W is a spectral function. W hat is a spectral function. So you are working in the frequency space. In the frequency space, you are working with this this is a spectral function. So for example, if you know image processing a little bit, so for example, if you want to do image denoising, if you want to do image denoising, what you know is that you know that the noise is usually in the high frequency part of your image, of your signal. So what you can do is that you can design a spectral filter, which is going to be zero for the high frequency. And you are going to preserve, you know, the low frequency to preserve your geometry. So this is just, you know, doing filtering of the frequencies, you know, contained in your signal. Okay, okay. But the W without the hat would be still a small guy, right? It would be a small filter. Exactly. So W without hat is the spatial filter. Yeah, the small one, right? Which is exactly. So if you have the grid, W will be, you know, a tree by tree, a tree by tree, you know, the patch, for example. Yeah. I see. Okay, okay. Okay. Thanks. Yeah, sure. So in the context of graph, so it's a small property to know is that you don't have any shifting variance. So if you have a grid, and if you are using the convolutional theorem to move around, you know, your function, for example, the function is a Gaussian here. On the grid, you are not going to change the shape of your function. But on a graph, because you have, you know, a regular structure, if you move around your Gaussian, then you will have different shapes. Okay, so this is something that you lose when you go to graphs. But in practice, actually, it has absolutely no effect. So it's not really important. It's just a mathematical property that you lose when you go to graphs. There is another question. There is another question I got here. So can you remind us what is actually the overall goal here? What is the goal of defining these convolutions or the spectral correspondence over these graphs? I think maybe it's not, yeah, if we can remind everyone, it's going to be. Yeah, so the goal of the lecture is to define convolutional graph convolutional nets. Okay, so I need to redefine convolution in the case of graphs. And there are two ways to define convolutions. You can do convolution with template matching, or you can do convolution with a graph spectral theory. So what I'm doing here, I'm redefining convolution in the case of spectral theory. And then I'm going to use this definition of convolution to define graph convolutional nets. So my goal is just to define convolution in the case of graphs so I can design graph convolutional nets. Yeah, sounds great. Okay, so let's go to, okay, so now the first part was, okay, I define spectral convolution. Now I'm going to use spectral convolution to define GCN. Okay. Okay, so the first model, what I call vanilla spectral GCN was introduced actually by Yann Leca and his collaborators, so John Brunner, Zahamba, and Archer-Slam in 2014. I think it was for the first iClear conference. And what they did, they did, they did, you know, the simple idea to do, okay, let's, let's, let's, you know, define a graph spectral convolutional layer. So we know what is, you know, a standard convolutional layer. So you, you, this is the activation at the next layer, A plus one. This is your nonlinear activation. So this is, for example, Rilou. And then I'm going to do the spatial filter. So the template WL, convolution by HL. Okay, so this is in the spatial domain, the graph domain. And then I'm going to do that. And remember that what I just defined, so doing this convolution, the spectral domain, it's just doing that. Okay, so this is the spectral filter applied to the laplation, and then you multiply by HL. Okay, so this guy is, is, I can decompose this guy, I would get the Fourier matrix times the spectral function that I applied to the eigenvalues, phi transpose HL. Okay, and, and, and this is my, this is my spectral filter. Okay, so I do not work directly here. Okay, I work directly here. And, and here, the thing that I'm going to learn, I'm going actually to learn this function, W hat number one. So I'm going to learn the spectral filter, and I'm going to learn it by back propagation. Okay, so I don't need to, you know, handcraft the, the spectral filter. I don't need to do that. This will be learned by, by propagation. So that was really a great idea to do it. And this was the first spectral technique. But, but it has some limitations. So the first limitation is that you don't have any guarantee of special localization of filters. So remember that what we want, we want to have the local reception field, because it's, it's a very good property to be able to extract, you know, multi-scale, multi-scale feature, multi-scale patterns from, from your signal. So you don't have, you don't have this guarantee. The second thing is that how many parameters do you need to learn? So you need to learn n parameters. Okay, you need to learn this W hat number one to W hat number n. So it's n parameters. So it's, again, if the graph is, is large, like, like, like the, the web, you know, or Facebook, then this is going to be billions of parameters to learn. And this is for each layer. So it's going to be really huge. And again, the learning complicity is going to be n square, because your file is a dense matrix. So, so we need to improve this. Okay. So, so, so Jan, and his collaborators, so they improve the, they improve two properties. So the first property was, okay, how do we get localized special filters? Okay. So for this, what, what, what they propose is to, okay, to get localized special filter. So you want something which is localized. What you need to do is to compute smooth spectral filters, something very smooth like this. Okay. So why do you, why do you want smooth spectral filter? It's because if you have smooth in the frequency space, then you are going to be localized in the space domain. Okay. So this is in physics, you know, the Eisenberg intensity principle. And you can see that, you know, with the possible identity. If let's, let's say that k is equal to one, if k is equal to one, you have the first derivative of, of the spectral function. So if you want this to be small, okay, so you're going to have a smooth function. And for k equal to one, you see here is this is going to be the variance of your special filter. So if this is small, if the variance is small, it means that you're going to have a small, you're going to have a special filter with a small compact support. Okay. So if you are smooth in the frequency space, you're going to be localized in the special space. Okay. So you need smoothness. How do you get smoothness for spectral filter? So you can also think about the, the transform of the delta of Dirac, right? So we have, if we have a delta in Dirac in the, in the time domain, then in the frequency, we're going to have basically a flat, a completely flat transform, right? So there's another maybe way to see if someone doesn't quite know the partial identity. Yeah, exactly. Right. And so, so how do you get a smooth spectral filter? So the idea is, okay, we can simply decompose, you know, the spectral filter to be a linear combination of smooth kernels. Okay. So the smooth kernel was chosen to be splines because splines are nice. They are, you know, with compact support and they are smooth. And basically the idea is, okay, now let's learn a vector of k coefficients. And this is a k smooth kernel. Okay. And you learn this coefficient by back propagation. But suddenly, you know, everything is nice because you have locality, localization in space. And the number of parameters that you're going to learn is going to be k parameters. So k, for example, let's say it's nine. Okay. Remember that before, in the case of convolution, so you have a tree by tree, which is nine parameters. So that can be the same. You can have nine parameters to learn. You're going to, you're going to learn a combination of nine spline functions. And that's it. So you have a constant number of parameters to learn per layer. So this is nice. But we still have, you know, the fine matrix. So the learning complexity is still quadratic. Okay. So the question is, how do we learn in linear time? Okay. So how do we learn with respect to the, to the graph size n? So the problem of the quadratic complexity comes from directly from the use of the Laplacian eigenvectors. Okay. So you see that the thing that is, that is annoying in this spectral convolution is not this diagonal matrix. It's not this vector. It's this guy. Okay. This is, this is the five matrix because it's a full matrix, right? It's a dense matrix. And, and then, and then it's n square number of elements. And this is the price that we need to pay. So we know that if we want to avoid the quadratic complexity, we need to avoid the eigenvector composition. Okay. And, and okay. So we can avoid eigenvectors position by simply directly learn function of the Laplacian. Okay. So this is what, what we proposed in 2016. So the spectral function is just going to be, you know, a monomial function of the Laplacian. That's it. So we just have a sum of some parameters that we learned by back propagation, wk, and Laplacian to the power of k. Okay. So, so when we do that, first, there is something which is, which is good is that we're going to have filters that are exactly localized in a k-hop support. Okay. So if we have the Laplacian to the power k, the spectral, I mean, the special filters will be exactly localized in the support of k-hop. So what is the, what is the one-hop neighborhood? So let's say, for example, you have this graph and here I'm going to put a heat source. So the value is going to be one at this node and zero for all other nodes. If I apply the Laplacian to this heat source, then the signal, the support of the signal is going to be increased by one hop. So every, basically every nodes that can be reached by one jump. Okay. That you do that. And if you do two jumps from this, you will reach the second hop neighborhood, which is the orange nodes here. Okay. So if you apply the Laplacian two times, this is going to be the support. Okay. If you apply the Laplacian k times, then you will have the support of k-hop. So you exactly control the size of your special filters. Okay. So that that was the first point. The second point, let me show you that you get learning complexity. Okay. So again, you have your convolution WH. You have your spectral convolution definition. I'm using here as a spectral convolution monomials of the Laplacian. And then I'm going to replace this guy. So the Laplacian power of k times the vector H by the vector xk. Okay. And xk is actually given by your recursive equation. Okay. So recursive is always good. Right. So it's given by this recursive equation, which is the Laplacian times the vector xk-1. And the xk equal to zero is simply the original function H. Okay. So when I do that, you see that this sequence x of k is generated by multiplying a matrix. So the Laplacian and the vector xk-1. So the complexity of doing that is the number of edges. Okay. And you do it that, you know, k times. So number of edges times k. And the thing is for real graph, real world graphs, basically they are all sparse. Okay. Because sparsity is structure. So remember, for example, for the web, the web has billions of web pages. But for each web page, it is in average connected to 50 other web pages. So comparing 50 to 1 billion is nothing. So usually, and the same also for the brain. The brain, it's very highly sparse. The same also for transport networks. So everything, every natural graph is usually sparse because sparsity is structure. Okay. So the number of edges is, you know, some value times n. So at the end of the day, you have linear complexity because for sparse, real world graphs. Okay. Okay. And you see here is that I'm using the Laplacian and I never do any identity composition of the Laplacian. Okay. And there is, so there is a bit of confusion that sometimes I see is that, so I call this spectral, you know, GCN. But this might be misguided because I don't do any spectral operations. Like, you know, I don't use any identity composition with the Laplacian. I don't have any identity vectors, I get values. So at the end of the day, even if I use, you know, the spectral theory to define this GCN, at the end of the day, the computation are all done in the spatial domain using the Laplacian. Okay. I don't use any, I don't use the spectral domain for the computation. I use, I do everything in the spatial domain. So even if we call that spectral GCN, we don't use, you know, in practice, the spectral decomposition. So just, just one, one comment. Okay. And the last, the last comment I want to do is that so graph convolutional layers, again, this is just linear operations. So you just multiply a vector, a matrix by a vector. So we're just doing in our operations. So this is GPU friendly. The issue is that here you are doing sparse linear algebra. And the existing GPU are not optimized for that. So this is, I think, one of the limitations today for graph neural networks. We need to have specialist hardware for graph neural networks. We need to have hardware that adapt to the, to the sparsity of these operations. And we don't have this today. So if we want this to, to, to get far, very far with graph neural network, we need to have this, this, this specialized hardware. What about TPUs? Do you know whether TPUs can handle? That's the same. That's the same. They are optimized for full, you know, linear operations, like full matrices. They are specialized for that. But if you, if you want to do sparse linear algebra, you need specialized hardware to do that. Gotcha. Thanks. Yeah. Okay. So how do we implement? How do we implement this? So for example, we have a signal. We have a function defined on the, on the graph. So N is the number of vertices of your graph. And D is the dimensionality of, of the feature, right? So for each node, you have a feature or vector of D dimension. Okay. So how do we do that? So we have XK. And what we do is that we are just going to shape a stuff to do just linear operations. So XK are going to be arranged in a matrix, you know, X bar, which is of the size of K times ND. Okay. So we just reshape, you know, this XK to be one times ND. And then we have K times ND. And then we need to apply this by the vector that we will learn by back propagation, which is of the size K by one. Okay. We do that. The operation would give you one times ND. You reshape and you get N times D. So this is how, how I implemented, you know, with PyTorch or TensorFlow, that would be the same. And this is how you do this spectral compilation. So again, the properties is that filters are exactly localized. You have a constant number of parameters to learn. So this is K, you know, this is, this is this K parameters that you need to learn by back propagation. You have a learning complexity, a linear learning complicity. But the thing which, which is not good is that here I'm using monomial basis. Okay. So I'm using Laplacian to the power zero Laplacian to the power one, power two, power three, and so on. Okay. This is what I use here. And the thing is monomial basis are unstable for, for, for optimization, because this basis is not orthogonal. So if you change one coefficient, then you are going to change the approximation of your function. So you need orthogonality if you want to learn with stability. Okay. So then you can use your favorite, you know, orthonormal basis, but your favorite orthonormal basis must have a recursive equation. Okay. So this is the only thing that, that, that you, that you need. You need your orthonormal basis to have a recursive equation because this is the key to have the linear complexity. So we use a Chebyshev polynomial. So this is something very well known in signal processing. So we're going to approximate, you know, the spectral convolution with Chebyshev, Chebyshev function, the Chebyshev functions apply to H again, can be represented by XK and XK is given by this recursive equation. Okay. So it's a little more complex than before, but in practice, this is just doing again, multiplication of your Laplacian times the one vector. Okay. At the end of the day, the complexity is still linear. You don't change anything. And this time you have stability during your, during the learning process. Okay. So what we did, we did the sanity check with MNIST. So, and you see that, so this is the number of vertices. So, so for MNIST, the graph is the standard grid. Okay. We use a K nearest neighbor grid to do that. And you see that you have linear complexity. Okay. This is the number of, of vertices. And, and, and you have this number of, of, you have the linear complexity. So this is good for the accuracy. We get to see 99% of accuracy compared to the standard 1 and 5. Okay. So Chebnet, so Chebnet is basically coordinates for arbitrary graph. And we have the same linear learning complexity. Of course, the complexity constant is much larger than, than the standard, than the standard Covenant. So it's something like 20 or 30. So it's much, much smaller to learn on this, but you get, you know, Covenant for any arbitrary graph. So that's, that's, that's what you mean. Another limitation is it's an isotropic model. So, so let me talk a little bit about isotropy versus anisotropy. So if you look at, you know, the standard Covenant, then you are going to produce anisotropic filters like this one. Okay. So you see that this filter is an isotropic. It goes in this direction. Okay. And we can get anisotropic filters with standard Covenant because we are using a grid. And on a grid, and on a grid, we have, you know, directional, we have directions. We know where is the up, where is down, where is left, where is right, right? Or remember that we know the ordering of, of the nodes on a grid. We know that. But this is different for graphs. We don't have any notion of direction. We don't know where is the, where is up, where is down, where is left, where is right. So the thing, the only thing that we can do at this point is that we can only compute isotropic filters. Isotropic filters means that the value of the filter will be the same, you know, for in, in all directions, for, for, for cycles, okay, for, for cycles of the same radius. Okay. So this is, this is what we can get. We can only get isotropic filters when we use a Chibnet because we have no notion of direction on arbitrary graphs. And I will come back to that. I will come back to the isotropy versus anisotropy a bit later. Okay. So what we, what we did also is to, very quickly, I don't, I don't have the time now. Oh, wow, the time is, so I need to speed up a little bit. So we did, we did extend also this spectral convolution from one graph to multiple graphs. So you can do that, you know, it's like extending from 1D signal processing to 2D image processing. So extension is mathematically spread forward to do. And we did that, you know, for, for example, for recommender systems, because we have users of movies and users of graphs. So with that, we also, as I said before, is that you can use your favorite, you know, orthogonal polynomial basis. So we use kinets, because Chibnet are unstable to localize frequency events of interest, which are basically the graph communities. We use that something more powerful, you know, more powerful spectral functions. Okay, which is kinets. Okay. So now let me go to the, to the, to this class of graph components that I call special graph components. And then for this class, I'm going back to the template matching, you know, definition of convolution. So how we do template matching for graphs. So remember that the main issue, the main issue when you want to do template matching for graph is that you don't have any node ordering or positioning for your template. Okay. We don't have any positioning. So basically, the only thing that we have, we have the, the index of the nodes, and that's it. But the index is not enough to match, you know, information between, between nodes. So how can we design template matching to be invariant to node reparameterization? Okay. So you have a graph, this index of the node is maybe, let's say six, but it's completely arbitrary. I can have an index with the number 122, for example. So I want to be able to do template matching independently of the index of this node. Okay. So how, how I do that. So the simplest things you can do is actually to have only one, you know, template vector to do the matching. This is, so you don't have, you know, wj1, wj2, wj3, you don't have this. You just have one vector w and you are doing the matching of this vector with all other, you know, feature on your graph. Okay. This is the simplest template feature matching you can do, which is invariant by node reparameterization. And actually this property is going to be used for most graph neural networks today. Okay. So here is the mathematical definition. So I'm just going to do the, the product between the template vector w at layer l. So this is a d by one. And I have the vector at node j, which is also the dimensionality d by one. Okay. I will get a scatter. So here this is only for one feature. Of course, you will have to get more features. So instead of having a vector d by one, you're going to use a matrix d by d. So this way you can get, you know, d features for each node i. Okay. And then so this is the representation at a node i. I can put everything in vector representation. Okay. This is my, this is my activation at layer a plus one. It is defined on the graph of n vertices. And it has d dimensions. Okay. And this can be rewritten as the adjacency matrix a. So this is the n by n matrix. This is my activation at the layer l. So this is the n by d, you know, matrix. And this is the, the template that I'm going to learn my backpropiation of the size d by d. Okay. So you do, you do this product again at n by d. Okay. So based on this template matching of graph now, I'm going to define two classes of special GCN, which are the isotropic GCN and the anisotropic GCN. So let's start with the isotropic GCN. So this is actually as quite some history. Okay. So the simplest formulation of special GCN was introduced by Scarcelly and his co-author. So he was in 2009 before the deep learning revolution. And then more recently by Thomas Keith, Max Willing, and also Sian Souk-Batter and Archer-Slam and Rob Fergus in 2016. So this is actually this graph neural network. So what I call the vanilla graph convolutional nets. Okay. This is exactly the same definition that I have before. I just here, I put, you know, the GP matrix in such a way that I have the mean value. Okay. I just do the mean value over the neighborhood. Okay. But this is exactly the equation that I used before. Okay. And you see that. So this equation is, it can handle absence of non-ordering. So this is completely invariant to not a parametrization. So again, if this index is maybe six and I change to be 122 is not going to change anything in the computation of the, of H at the next layer. It's not going to change anything. You can also deal with neighborhood of different sizes. Okay. You don't care if you have a neighborhood of four nodes or a neighborhood of 10 nodes, it's not going to change or 100 nodes, it's not going to change anything. You have the local reception field by design with graph neural network. You just need to look at the neighbors. And that's it. So it's given to you. You have weight sharing. Okay. You have weight sharing means that for, for all features, you are going to use the same W whatever the position on the graph. Okay. So this is a convolution property. This formulation is also independent of the graph size, because all operations are done locally. Okay. You just use, you know, local information for the next, for the next layer. So you can have a graph of 10 nodes or you can have a graph, a graph of 10 billion nodes. It doesn't care. So you can do also everything in parallel. And, but this is limited to isotropic capability. So the W is the same for all neighbors. So it's, again, it's an isotropic model. It's going to give the same value for all neighbors. Okay. But at the end of the day, this model can be represented by this figure. So this, so the activation at the next layer is basically a function of the activation at the current layer at index at, at, at, at the node I and the neighborhood of the node I. Okay. And the only thing that we're going to do basically is to change the function, the instantiation of, of the function. And then you will get all family of graph neural network by just, you know, deciding a different function here, but everything is based on this equation. So again, you have your, your, your core node, and then you have your neighborhood to decide what will be the activation at the next layer. Okay. So I'm running out of time. So I'm not going to take too much time on this, but what you can show is that this previous vanilla GCN, I just show you is actually a simplification of ChetNet. So if you truncate the expansion of ChetNet by using the first two ChetNet chef function, that at the end, you end up with the same equation. So, so this is, this is the relationship. Okay. So one interesting GCN is graph sage that was introduced by William Hamilton, Lee and Yuri Leskovic. So let's say for, let's, let's go back to the vanilla GCN. So, and let's suppose that the adjacency matrix has a value one for, for, for the edges. Okay. So I have this equation. So the thing is for this equation, I'm going to treat, you know, the central vertex I and the neighborhood with the same template weight. Okay. But I can differentiate that, you know, I can have a template for the central node, W1, and I can have a template for the one hop neighborhood. Okay. By doing that, you improve already a lot, you know, your, the performance of your, of your graph neural networks. So you go from here to here. So you have again, some templates for the central node and a template for, for the neighborhood. Okay. But this is still an isotropic, isotropic GCN. Okay. Because you are treating all the neighbors with the same weight. Here, this is the mean, but you can change, you can take the sum, you can also take the max, you can take also something more elaborated like LSTM. Okay. Now more recently, people tried to improve the, the theoretical understanding of, of GCN. So there was the graph isomorphism, isomorphism networks are introduced by Yuri Leskovic. In 2018. So the idea is, can we design an architecture that can differentiate graphs that are not isomorphic? So, you know, isomorphic is basically a measure of equivalence between, between graphs. So these two graphs are isomorphic to each other. And of course, you want to treat them the same way. But if you are not isomorphic, you want, especially to treat them in a different way. Okay. So, so there was a graph neural network based on this one, on this, on this definition, but is this, is this still an isotropic GCN? Okay. So, so now I'm going to talk about anisotropic GCN. So again, again, so I go back to what I said before is that standard Kovnet can produce anisotropic filters because there is a, a notion of directions on grids. Okay. So you, you have this anisotropic filter in this direction. GCN like a chemnet, Kylenet, Vanilla, GCN, graph sage and gene, they compute isotropic filters. So you have this kind of, of filters that you learn during the process, but they are, they are isotropic. But we know that anisotropy is very powerful, right? So how do we get back anisotropy in graph neural networks? So you can, you can get anisotropy naturally. For example, if you have H features for like, if you take in chemistry molecules, you know that the bond features can be different. They can be, you know, single, double, aromatic bonds. So naturally you would get anisotropic, you know, GCN. And again, if we want to design a mechanism for anisotropy, we want this mechanism to be independent with respect to the node parameterization. So to do that, we can use, for example, H degrees. And so that was proposed by Monnet H gates that we propose in Getty GCN or attention mechanism in GAT. And the idea is what, what I put here as an illustration. Okay. So here you're going to treat your neighbors in the same way. Okay. So with the same template, but you want, you want to treat your neighbors in a different way, right? If this is J1, you want a different weight than if it was for J2. What do you want that is, for example, if you want to analyze graphs, you know that you have communities of people, which are different. For example, I don't know if it is political, you have Republicans and Democrats. So you don't want, you know, to have the same analysis for, for the same group of people. So you want anisotropy for graph that that's quite important. Okay. So the first model who deal with anisotropy was Monnet. So he was introduced by Federico Monti, Michael Brunstein and their co-author. And the idea was to do, was to use GMM, so Gaussian mixture model, and to learn the parameters of the Gaussian mixture. So here they have k Gaussian mixture model, and then learn the parameters by using the degree of, of the graph. Then there was also GAT. So it was developed by Peter Velikovic and Yoshua Benjo and their co-author was basically to use the attention mechanism developed by Jimmy Badano, Yoshua Benjo and Joe. So we introduce anisotropy in the neighborhood regression function. Okay. And so this is, this is what you see here. So you have, you are going to concatenate with some multi-head, multi-head architecture. And here you have this weight, which are basically the softmax on the neighborhood. Okay. You do the softmax on the neighborhood. So some, some, some notes will be more important than the others, you know, by given by softmax. What we use with Thomas Laurent and me in 2017, we use a simple edge getting mechanism, which is, which is, which is, you know, what a sort of soft attention process compared to the sparse attention mechanism of Yoshua Benjo. And here what we did also, we use edge feature explicitly. And this actually recently we discovered that this is very important for edge prediction task. If you have explicit expedition task, this is important to, to keep it. Okay. So, so this is the model that we used. Okay. Then, okay. So if I take transformer and if I, if I write down the equation of the graph version of transformer, this is the data we get. Okay. So you recognize here the value, here you have the query, here you have the key and here you have the softmax, but the softmax is done in the neighborhood, the one-hop neighborhood. Okay. That would be this. And here I'm going to make a connection with a transformer of Vas Mwani and his collaborators. So what is a transformer? So a standard transformer is actually a special case of graph convolutional nets when the graph is fully connected. Okay. So this is a fully connected graph. So you take any node, I, and this node is going to be connected to all those nodes in your graph and included itself. Okay. So if you look at this equation, the equation I just wrote before, you know, if the neighborhood is this time, not the one-hop neighborhood, but the whole graph, then you would get, you know, the standard equation that if you do an LP and, and, and transformer, you will recognize directly. Okay. We saw this last, we saw this last week. So just exactly. So that's, that's a nice connection. And that's a transition. So, so you see, you have the concatenation. So this is milty head. You have the softmax, the query, the key and the value. And then you have the weight for the milty head. So, and the only thing that I do here mathematically, just having the neighborhood, that, that use, you know, all connection. And when I do that, so there is the question. So what does it mean to do, you know, graph convolutional nets for fully connected graphs? And I think in this case, it become less useful to talk about graphs, because when you, when you have each data connected to all those data, then you don't have any more, you know, specific graph structure. Because a graph, what is, what is really interesting with graph is the sparsity structure, right? Like the brain connectivity, like, you know, the social networks, what is interesting is not everything to be connected to each other. It's only to have a sparse connection between, between, between the nodes. So I think in this case, it would be better to talk about sets than to talk about graphs. And we know that we know that transformers are set neural networks. So, so in some sense, instead of, you know, looking at, you know, a fully connected graphs with, with, with feature, the thing that we should look at is more a set of feature and transformer are really good, you know, to process sets of, of, of, of feature vectors. Okay. So, so there is a lab that I, I, I put here. So the lab is based on, so this is the GCN, the model I, I proposed. And this is with GGM. So this is the deep graph library. So it was developed by NYU Shanghai, by Professor Zhang Zhang. And, and, and here this is the link to the lab. So if you click on this link, you will go directly to the lab. And this is using Google, Google Collab. So you will just need, you know, a Gmail account to access to this and you will be able to run it on, on, on the Google cloud. And what I put here, I put really, you know, the, the only the, the most interesting functions that you need to develop a GCN. So, so maybe tomorrow, yeah, tomorrow, tomorrow, we're going to be going over everything here. Okay, perfect. You will do that. Okay. So, and, and here, I gave, you know, I put some comments on the code. Yeah, I saw, I saw. And also, yeah, also understand DGL, how, how DGL works. So probably do that tomorrow. Yes, yes, yes. Nice. Okay. So let me now, I'm going to the end. So let me talk a little bit about benchmarking graph neural networks. So recently, we, we have this paper of benchmarking graph neural networks. So why we did this benchmark? Because if you look at your most published GCN papers, most of the work actually use small data set, like Cura or TU data set, and only one task classification. And when I, you know, when I started doing some experiment on that, I just realized that if you use GCN, or if you don't use any GCN, you will get statistically the same performance because the star deviation is very high for these small data sets. So, so the thing is, we, we cannot identify, you know, good GCN. We need, we need, we need something else. And also recently, so there, there has been, you know, a new theoretical development for, for GCN. And the question is, you know, how good they are in practice. It's, it's important to, to have some good mathematical justification of GCN. But, you know, we need to be able to prove that this is something that is useful. And, and, and I think it's also benchmarking very essential to make progress in many fields. Like, you know, of course, deep learning with ImageNet by FFA Lee. But the thing is, what I observe is actually people are quite willing to give credits to, to, to benchmarks. Anyway, so we introduced this open benchmark infrastructure. So it's, it's on GitHub. And it's based on PyTorch and DGL. And we introduced, you know, six new medium scan data sets for the fourth fundamental graph problems like, you know, graph classification, graph regression, not classification and edge classification, which I think if you cover these four fundamental graph problems, you already know quite a lot about the performance of your, of your GCN. Can you spend a few words more about these four fundamental graph problems? I think we haven't mentioned them so far, I think. Yeah, exactly. But, but what, so what I mentioned is basically the first part of any, you know, convolutional nets is how do you extract a powerful feature. The rest is quite easy. You know, if you want to do regression, you just use an MLP. If you want to do classification, you should use MLP with cross entropy. The thing I think is, is, I mean, I can take more time to do that. But what I present is, I think it's more, you know, is more interesting than, than doing just these guys. But if you give me another hour, we could do that. I was making the point that I think I understand now how we can build a basically a representation of a graph. But then so you would have like this basically feature per node, but then how would you go from this feature per node to the final task? So maybe we can mention this such that we can give some more. Sure. So, so what you do basically, so for example, if, so you have feature exactly, so you extract convolutional feature per node, and, and then suddenly if you want to do, for example, graph classification. So what you will do, you will do some kind of aggregation, you know, function on these feature nodes. So for example, the most common one is to do the average. So you do the average of all feature nodes. And then on the top of that, you use an MLP. And then you will do classification of your graph. And this would be for always the same kind of structure of the graph, or you have different structures, like different numbers of nodes. So is it like, would you use it? If you use the mean, it's completely independent of the number of nodes. Right, right, right. The same. So if you do the sum, if you do the max, so you have many operators which are independent of the number of nodes. And yeah, so, so we have this. And, and this medium says, this medium size are actually enough to statistically separate the performance of graph neural networks. So we make easy, you know, for new users to add a new, new graph, graph models, and also new data sets. And this is, this is the link to the, to the repo. So let me now explain the graph neural network pipeline. So a standard graph neural network pipeline is composed of three layers. So the first layer is going to be an input layer, and he's going to make an embedding of the input node and edge features. Then you will have a series of graph neural network layers. And finally, you will have a task layer. So there will be a prediction layer for graph node and edge task. Let me now describe in details, you know, each of these three layers. So for the input layer. So again, we will have, you know, the input node and edge features. So this comes from the application. That can be, you know, not feature, for example, for, you know, for a command for a commander system for products, so it will give you, you know, some feature of your product. So what you will do is that you will take this, this whole feature, and you will make, you know, an embedding, a linear embedding, and you will get a vector of the dimensions. We can do the same if we have some edge feature, we can do an embedding of the input edge feature, and we will get a vector of the dimension. So basically the output of the embedding layer will be, for edge, it's going to be a matrix of n nodes and d dimensions for the features. For the edge, it's going to be a matrix of the number of edges times the number of features. And then so we will give that and we will give, you know, this output of the embedding layer is going to be the input of the graph neural network layers, which is here. Then what we will do is that we will apply our favorite graph neural layer, a number of L times. So we have the node and the edge representation at layer L. It will go through the GNN layer and it will give, you know, the representation of H and E at the next layer. And we will do that, you know, n number of times. This will give us, you know, the output of the graph neural network layers. And again, it's going to be a matrix of n nodes and d dimensions for the nodes. And for the edges, it's going to be a matrix of E, which is the number of edges times the dimensionality. Okay, so this is the output of our graph neural network layers. And then finally, for the last layer. So this is, you know, task-based layer. So if we are doing some prediction at the graph levels, what happens is that we are going to take, you know, the output of the graph neural network layers and we're going to make a mean with respect to all nodes of the graph. Okay, so this will give us a representation of the graph of the dimension. Then we will give that through to an MLP, mutilayer perceptron, and it will give us, you know, a score which can be a scalar if we are doing some graph regression, like, you know, chemical property estimation, or it can be a classification if we are trying to classify, you know, molecules to some classes. We can also have, you know, a node-label prediction. So what we will do is that we will take the node representation at the output of the graph neural network, and we will give that to an MLP, and we will get a score for the node i, which can be a scalar for regression, or can be a k-dimensional vector for classification. We can also do, you know, edge-level prediction. So we have a link between node i and node j. It's going to be a continuation of the, you know, the graph neural network representation for node i and node j. We'll give that to an MLP, and again, we'll have a score for the link between node i and node j, and it can be regression or classification. Okay, so quickly, because I'm running out of time, so the task, you have the graph classification task, the graph regression task, sorry. So this is for molecules. So here we want to predict, you know, the molecular solubility. And here you have the table. So this is like, you know, agnostic GCN. So we don't use any graph structure. The lower, the better. And here, this is isotropic GCN, and this is anisotropic GCN. So usually you will see that for most experiments, anisotropic GCN do better job than isotropic GCN because you use some, of course, some directional properties. So this is for graph regression. This is for graph classification. So you have supernodes of images, and you want to classify the image to belong to one of the classes. You also have edge classification. So this is here the combinatorial optimization problem of TSP, so the traveling salesman problem. So you have a graph, and then you want to know if this edge belongs to the solution. So if it belongs to a solution, this is the class one, and if it doesn't belong, this is class zero. And we see that here you need explicit edge feature. So you see that the only model that does a good job compared to the naive eristic is by using explicit edge feature. Okay. So here I'm using this combinatorial example to make a workshop announcement. So this is also what we are organizing next year with Jan, and also Peter Bataglia, Stephanie Jigelka, Andrea Lodi, Stan Osher, Oyovinian, and Max Welling. So we are organizing a workshop on combining deep learning and combinatorial optimization, which I think is very interesting direction of research. Okay, conclusion. So we generalize the convenance to data on graphs. For this, we needed to redesign convolution operator on graphs. So we did that for template matching, which lead to the class of spatial GCN. We also did that with spectral convolution, which lead to the class of spectral convolution, spectral GCN. We have linear complexity for real-world graphs. We have GPU implementation, but yet it's not optimized for the GPU that we have today. We have universal learning capacity. So this is the recent theoretical works. And we can do that for multiple graphs and also for graphs that can change dynamically. Application. So I'm happy now that I don't need to justify anymore why we are doing graph convolutional nets to anybody. So it's getting more and more application. We see that at the last, actually, this week, ICLR conference. So the keyword that gets the most improvement was graph neural networks. And you can have, you know, you have now, you know, workshop and tutorials on graph neural networks at many of the top deep learning and AI conferences. This is the first, probably, tutorials on graph deep learning that we organized at Noreeps in 2017 and CDPR. And also if you want some materials to look more. So we have this AIPAN workshop organized in 2018 and also a follow-up in 2019. And for this, we have the video talks. So if you want to know more about this. So yeah, I would like to thank my co-writer. So Yusha Benjo, Michael Bonstein, Federico Monti, Chaitanya Joshi, Vijay Djivini, Yolio Thomas Laurent, Arsha Slam, Ron Levy, Michael Dufiara, Pierre Mondiali, and Pat Kalman. So thank you. Thank you. It was really impressive. And I think everyone here was stunned by your, the quality of the slides and your explanations. We really, really enjoyed. Like I'm getting so many private messages here. It's like everyone's pretty, very excited. I have actually a few questions if you have some time left. We haven't talked about generative models. Do you have any words about like how we can, for example, generate, I don't know, like new proteins for, I don't know, figuring out whether we can find a cure for this COVID right now. Just, you know, actually, like how do you say, current question for the current world. Yeah, absolutely. So yeah, the community is also working on the graph generative models. So you have two directions. The first direction, you can do it in a recursive way. So what you're going to do is that you are create, you know, your molecule atom after atom. Okay, so you create, you start with an atom, then you will have a candidate for the next atom and also the next bound between the two atoms. And you can do that, you know, it's a kind of LSTM style. And the second direction is to do it, you know, in one shot. So you need a network that can predict what is the length or the size of your molecule and then what are the connection. So you have these two directions. You can do it in a recursive way or you can do it in one shot. So they are different. So the community is more interesting in the recursive way today. And I have a paper on the one shot. And basically, they are performing the same. I mean, it's, I don't see any difference. But you can do it. Yeah. So the only thing is, how do you treat? Yeah, the question is your molecule can have different size. And this is the key, this is the key. I would say the challenge here. So how do you deal with different sizes? But we have different options to do that. I see. So one thing which is very interesting related to the chemistry of that is that so graph, what I want to make is that graph neural network in some sense are too much flexible. Okay. So what you need, so when you go from the standard ConvNet, so this, you know, the grid is very structured. Okay. So you can get a lot of information for the structure of the grid, but you don't have this in graph. Again, you lose, you know, the node ordering and everything. So we need to find a way to, you know, to have more and more structure inside the graph neural networks. One way to do that is the architecture. So the architecture, for example, you would like to combine, for example, if you do chemistry, you would like to combine Schrodinger equation, you know, like Hamilton energy. So people are doing that to constrain better, you know, your graph neural network. So again, graph neural network are in some sense, you know, too much flexible, you need to find a way to add more universal constraints. Yeah, actually about the universal constraints, I got here a question. What do you mean by universal learning capacity? Yeah. So this is the recent works on graph neural networks. So people are trying to, in some sense, you're trying to classify your neural networks, right? There are many publications on neural networks. So how do you classify them? So you need to find mathematical properties like, you know, isotropic properties and isotropic properties. And more recently, they are, you know, theoretical work on, you know, isomorphism and, you know, expressibility of graph neural network, depending on some class of theoretical graphs, graphs are, you know, starting by earlier, like 200 years ago. So we know a lot about graphs. And we want to classify graphs according to some mathematical property. So this is what I was trying to mention that you can design graph neural networks for some special mathematical properties. I see. Thank you. Guys, feel free to ask questions. You can also write to me if you're too shy. I mean, I'm not shy. I can just read. I have a question. And thank you so much for this great lecture. And you mentioned that you created like a benchmark data set type of like, so people can benchmark their different graphs neural networks. But I feel like a lot of those networks also learn some like representation in the graph. And a lot of downstream tasks could be like unsupervised setting, where I think in the benchmarking data sets, you're always just using accuracy, more or less. So it's like you have labels ground truth labels. So it's more in the supervised setting. So do you have any thoughts on how we could benchmark like the graph networks performance in unsupervised setting or like semi supervised setting? Or like by measuring their performance in some common downstream tasks or application? I would like to hear your thoughts on that. Thank you. So I think this is one of the most favorite topics of Yan and the same supervised. Yeah, that's right. As you can tell, I brainwashed the students in the class pretty well. Yes, yes. So that's why I'm asking. No, of course. I mean, of course, one important question is you want to learn efficiently, right? You don't want to have too much labels to be able to predict well. So self supervised learning is one way to do that, right? You want to and you can do that also with graph, right? You can hide some part of the information of your graph and then you can predict this hidden information to get representation. So I guess now it's out for me to follow all the recent GCN work, but I guess if you Google it, there will probably already one or two papers on this idea. I mean, there is nothing special with GCN. So you can apply the same ideas like self supervised learning to GCN. So we don't put that in the benchmark yet. It's a good idea, but that's something maybe we could do. So actually, arguably, all of self supervised learning actually exploits some sort of graph structure, right? So when you do self supervised learning in text, for example, you take a sequence of words and you learn to predict a word in the middle of missing words, whatever they are. There is a graph structure and that graph structure is how many times a word appears some distance away from another word. So make, imagine you have all the words and then you say within this context, make a graph between words. So this would be a very simplified version of it, but make a graph that indicates how many times this word appears at distance three from that other word, right? Then you have another graph for distance one and another one for distance two, et cetera. So that constitutes a graph and it's a graph that sort of indicates in what context two words simultaneously appear. You can think of a text as basically a linear graph and the neighbors that you take when you train a transformer basically is taking a neighborhood in this graph, right? So when you do metric learning, the type of stuff that Isha Mishra talked about, using contrastive training where you have two samples that are similar and two samples you know are dissimilar. This basically is a graph. It's a similarity graph that you're using. You're saying you're telling the system, here are two samples that are linked because I know they are similar. And here are two samples that I know are not linked because I know they're dissimilar. And I'm trying to find a graph embedding essentially. You can think of those neural nets are learning a graph embedding for nodes so that nodes that are linked in the graph have similar vectors and nodes that are not are dissimilar vectors. So there is a very, very strong connection between self-supervised learning and kind of the graph view of a training set. I don't think it's been exploited or kind of realized yet by a lot of people. So there might be really interesting stuff to do there. I don't know what you think about this. Yeah, exactly. This is completely related to the, you know, on the graph, you don't have any non-positioning. And what you are saying is exactly that. So how do we get positioning between nodes that are relevant to your particular application? And you want to do it in a self-supervised way because then you will learn, you know, all possible configurations and you don't need to have labels to do that. So, yeah, this is the point. See, you will get, if you know how to compare nodes, so basically how do you extract positional encoding? Then you will do a great job, you know. That's one of the most important question in graph neural networks and also for NLP and many other applications. Great. Thank you. A question just arrived here. So could you possibly highlight the most important parts of graph with attention? I think we maybe have gone a little faster and someone got a little bit lost. Yeah, so graph attention network. So the first technique was developed by Yoshua Benjo, Peter, and so it's probably, you know, the first work you would like to see. But you can also do like, you can take, you know, the transformer, the standard transformer, and then you can make it, you know, a graph version. It's quite straightforward to do it. Just by multiplying the, just by multiplying with the agency, agency metrics, right? Yeah, exactly. So you can already do it, you know, with PyTorch transformer. So there is a mask. Exactly, with a minus infinity. Yeah, a mask. Exactly. So if you put minus infinity with softmax, you will get zero. Exactly. So I think I'm going to show this tomorrow. So they are going to be... Yeah, exactly. So you can already do graph, you know, transformer very easily with PyTorch. But the thing is, it's going to be a full matrix. Yeah. So it's going to take, it's going to, it's going to use a lot of your GME memory because there are many values that you don't need. So if you want to scale to larger graphs, then you need something that exploits the sparsity like DGL or PyTorch geometry, for example. Yeah, so last week we coded by from scratch. So we actually see all the operations inside. And then maybe we can just add one additional matrix there just to make like this masked part such that we can retrieve the graph convolutional net from the code that we already have written. So that would be, I think, a connection for tomorrow. And hold on, there are more questions coming. Is there any application where using ChebNet might be better than spatial GCN? So I would say they are part of the, you know, isotropic. This is the class I, yeah, this is the class what I call, you know, isotropic GCN. So for me, I mean, of course, it will depend on your data and it will depend on your task. You know, if you have some task where your data is isotropic, this kind of information, then ChebNet will do a very good job for sure. Now, if you have information where isotropic is important, for example, for social networks, you don't want to treat people the same way, you don't want to treat the neighbors the same way, then it's not going to do a good job. So it really depends on your task where isotropy is very important. If isotropic is very important, then you should use ChebNet because ChebNet, you know, is using a whole bit of information about your graph in an isotropic way. And if you are using, you know, GCN, the vanilla GCN, you are just using, you know, the first two terms of approximation of ChebNet. Yes, we can learn the edges, right? We can learn the representation for the edges such that they discriminate between neighbors, right? No, no, no, no, this is, this one is anisotropic. You are becoming isotropic. What I mean by isotropic is that if you have a pure isotropic, you know, graph problems, then you should use ChebNet. ChebNet, otherwise, yeah. But otherwise, yeah, it's better to use anisotropic GCN. Of course. More questions, guys? Hey, I have a question. Thanks for the talk. I was wondering, a lot of these methods require a existing adjacency matrix and for some problems, for example, like you don't, you know that there is a graph structure, but you don't know the underlying connections. Do you know of any work that addresses this problem? Yeah, absolutely. So far, most works focus on having already, you know, the graph structure. And then, and of course, you would like sometimes, sometimes you just have data, like for example, you know, you just have a set of, a set of features. And you want to learn, you know, some graph structure. It's very hard. Very, very hard. So there are some works, you know, doing that. So they are trying to learn some, some graph structure. And at the same time, they're trying to learn, you know, another presentation. So that's, that's promising that that's interesting. And this is also somewhat that I'm trying to do now. But I can tell you, it's very hard to do. And especially because if you, if you, if you let, you know, the adjacency matrix to be a variable, then you are n square. Okay, you have n square unknown parameters to learn. So, so it's, it's not easy. But yeah, this is a, so I would say that these techniques, there are many natural data coming with graphs. Okay, you don't need to build any graphs. And this is already giving you, you know, a lot of, of good tools. Now, if you can give me maybe what you have in mind, what kind of application you have in mind, what you want to use, you know, when you want to learn graphs at the same times, maybe we can talk about it. So I can tell you, of course, Zeming, you know, will correct me. But Zeming is actually working on predicting protein function predict, you know, protein function prediction, basically. And, and so the underlying graph would be the, for example, the contact map or the, the kind of proximity graph of different sites on a protein. And you don't have that. I mean, in most cases, you don't, that's, that's kind of what the things you, you have to predict. So you could use this as some sort of latent, you know, graph variable for your model. Zeming, maybe you had some other idea in, in mind. Yeah, I think actually, so the more specific problem is that some of these graphs, you know the edges and you, you know some of the edges, but you don't know the other ones. For example, in protein function prediction, you can imagine like two proteins that have similar functions as having an edge between them. But they might not have the same function. So you don't know sort of the edge weights and you kind of have like a human labels that are inaccurate. So you know that they're connected in some way, but you don't know the edge weights and you know that there are other proteins that should be connected, but you don't have labels for. So I guess this is more of a graph completion problem. Yeah, and this one is easy. This one, if you have, it's like the semi, you know, the semi graph clustering problem. So if you already have some label, just a few labels and you have some structure around this, that's something you can, you can live with. If you absolutely no structure on the edge, and then you need to learn the graph, that is very hard. I see. Okay. Thank you. Hey, I have a question about splits of the data when you're actually training a graph neural network, because it's like, can you talk about some of the things that you would want to consider when actually splitting the data into say training and validation? Like, you might want to have all of the nodes in the training data for it to actually be exposed to everything that's in the graph data. And you might have a case where different types of edges are in balance in the dataset. Can you talk about when that would be important? What are some of the considerations in splitting the data when you're training? Sorry, I'm not sure I understand the question. So you are talking about unbalanced training sets? Yes. And also like, so if you have like a huge relational dataset, right? You can talk about some of the considerations for splitting the data when you're trying to train a graph network. So for relational datasets, so you may have, you know, millions of small graphs. And it is fine. I mean, because this graph neural network, they are independent of the size of your graph. So this is not an issue to learn some good graph learning representation. There is no issue with that. Now, if you have unbalanced dataset, I don't know. So that's maybe you can maybe apply some standard techniques to do that. So you can also, you know, for cross-entropy, for example, you can wait your cross-entropy depending on the size of each class. So that may be something you can do. But I never thought too much about this. Okay, thank you. Any more questions? I'm still getting things written here, but you can voice yourself if you are. Yeah, I have a question, actually. First of all, thanks a lot for the lecture at this time, especially for you. So how do you deal with cases where the nodes do not have the same dimension? Like if I want to run a small, simple vanilla graph, counter-issue network, but my nodes are something like, even for Facebook, people and then pages. And I want different dimensions. So how do you think about graph, like a very simple graph neural network in that? I don't think it has nothing to do with graph neural networks. If you have different dimensions for your vector, so probably you need to put everything on the same dimension, and then you need to use some indicator function. Like one, when you have the information, and zero when you don't have any information. And this will be used during the computation of the loss. And then when you back propagate, if you don't have any feature information, you will not use it. But nothing has anything to do with graph neural network. Okay, thank you. Hold on, you're writing. I'm reading so much. So maybe I don't understand the question, but I will read it out loud anyway. Is there any GCN which can work on multiple agency matrices together? For example, a bidirectional graph? I don't know what this means. So if the question is about hypergraph, so you know, you may have more than one age connecting your nodes. Yes, there are some work about this. It's an extension to natural extension mathematically. So you can, yeah, you can do that. There is no limitation to go to hypergraph. It's fine. And there are now some datasets for this, for this task. So if there is an application, so students would be interested to do it. There is already a dataset and papers about this. Okay, another question would be, does it make sense to have nodes that are features of a person and do graph classification or have nodes as person and do node classification? I don't know. So often, you know, people ask me the question, can I do a graph given this data? So it's really task dependent. I think it's really, you know, when it's going to be useful or not, when you get, you know, some good relationships. Because what is a graph, you know, it's just your collection of pairwise, you know, or pairwise connections. So that's it. So the question is when it is relevant to solve your task, sometimes it is relevant, sometimes it's not. So it really depends on the, yeah, it's obvious, but it really depends on the data and the task you want to solve. So yeah. Yeah, the student is satisfied with your answer. I think we ran out of questions, unless there are more coming my way. Okay. No, it starts getting bright outside there. Exactly. I was noticing. The sun is rising. That's nice. Okay. I think that was it. Thank you so, so much. It was like, I mean, really, those were so pretty. These slides were so pretty. I had to learn so much from the way it fits this way. Thank you again for waking up so early. I mean, I think this is a fascinating topic. You know, as you know, I've been involved in this at the beginning. And I think it opens a completely new door to applications of machine learning and neural nets. You know, it's a new world. It's a completely different world. I know your PhD advisor had been working on, you know, graph signal processing for a long time. So this was kind of a natural transition for human for you, I guess. But I think we haven't seen the end of this. We're going to be surprised by what's going to come out of this. I mean, there's really already sort of fascinating work in that area in high energy physics, in computational chemistry, in, you know, social network applications. And you can assign to all the big names in the, you know, if you are interested in this topic, if you're listening to this, Yuri Leskovich is one of the big names, you know, in addition to Xavier, obviously, but and Jean Brunin, whom you know, because he's a professor here and he talks about it in this in this course, Mikhail Bernstein is also a big contributor. He's made some really interesting contributions to the to the topic also on sort of different methods than the one that you talked about today on like, you know, like, you know, using graph neural nets for like 3d meshes and for computer graphics and things like that. So I agree. I think I think this is also a field, you know, where there is a back and forth between, you know, mathematics and also applications. So if you look at, for example, this protein stuff, it's very, very hard. But at the same time, we can learn a lot, you know, from the mathematical side, we can, we can, and it's very exciting, right? Because you want both, if you want to be able to make scientific discovery, you need to be, you know, driven by some real world very hard problem. And then at the same time, you have these new tools, you know, coming up with graph theory, neural networks. And it's a way also for us to better understand, you know, why neural networks work so well. And this is, you know, a direction where it looks like, you know, each day they are like a new problem in this in this direction. So the pie is big for everyone, for the young students to come and to to enjoy, you know, this this area of research. Great. Well, thank you again. And enjoy your day. Yeah, thanks again. Thank you so much, guys. All right. Bye-bye. Guys, see you tomorrow.