 Yeah, so I will say that I prepared this lecture from scratch and it's been challenging. It's been the second week. I'm not doing anything but preparing and reading and studying and getting crazy. But I think it's an improvement. I'm growing. Okay. What do you do research on? How to teach better? This week I'll be just reading everything I could about this graph neural net, graph convolutional networks. I don't know. I read maybe a few tens of publications. I'm a bit drunk, to be honest. Okay. So what are we talking today about? Graph convolutional networks exploiting domain sparsity, right? So like yesterday, we saw that Xavier mentioned the three property of natural signals that are locality, sessionarity, and he called it hierarchical. Well, I called compositionality. He used the term compositionality for meaning the whole three things. But again, I guess it's just a, it's just Jurgen. We mean the same thing, right? And so what are these graph convolutional networks? Again, another type of architecture or another way of exploiting what is the structure of your data, right? And so let's actually get it, let's get there from just last week lesson. So last week, let's get, let's have a quick recap, right? We talk about self-attention. In self-attention, we had this set of axes, right? So x, we can go x1, x2, and so on until xt. And you can stack these axes one after each other. You get the capital X, right? Matrix. Each small x is a size of Rn. And then my hidden layer for the whatever x I take in consideration is going to be this linear, linear combination of these vectors in the set. Okay. And we know exactly from, I think, lab number four that a linear combination of vectors can be written as a matrix vector multiplication. And so we have here that the h, it can be, is equal to this capital X, times a, right? And so a contains the coefficient that are scaling these vectors, okay? Then we had, like, we were saying that all these coefficients are positive. They have to sum to one. And then if only one is actually one, then we have the hard attention, okay? And this big X is just this collection of axes, okay? But then, again, it's a set, right? A set means it's not a sequence. There is no order, okay? So far, you should be familiar, right? You should be very actually comfortable with this kind of notation, right? This linear combination of columns, it's just a matrix multiplication, okay? So then I was reading the literature about this graph convolutional network and I read and I read and I'm like, oh, it's actually the same thing. What the heck? So actually, let's get there from this perspective, right? That is my perspective, again, might be not the best, but you know, you have me, so you deal with me. So let's start with this GCN, not graphical, graph convolutional networks. So my A, which is this vector on the left here in the attention, it is containing all the coefficients that are basically waiting these columns. In this case, it's going to be, I'm going to call this my agency vector, okay? And what the heck is this agency vector? So we have to start introducing a little bit of notation. In this case, I introduce here my first vertex, the red one, which is also representing my given X input and is going to have my H hidden layer. Like we were seeing before on the attention part. We had these generic X and generic H. So I'm going to keep using this kind of generic notation. So I have my generic vertex V, where I can have my generic X and my generic H. And then, of course, you're going to have all the other vertices, right? Which I'm going to be calling them VJ, on which you can find the signal, which is going to be the XJ and the HJ. The hidden representation and the input value for these specific vertex or node. And then what? Well, you have many, you know, you have just the whole collection of data points. But then there is a difference now. We have that these nodes, these vertices are actually connected. And so we draw a set of arrows. And so right now, basically, we're going to have that my vector A is going to be having components alpha J, which are equal to one. Whenever there is an incoming arrow from vector VJ to myself, okay? So if you think about how we were doing this before for the attention, we were computing A as the soft argmax or just the argmax, if it's a hard version of the attention between like of my scalar product. The scalar product of all those keys or all those rows times my query, right? So you had all the keys times the query. And then you have this scores. Now we are performing soft argmax or hard argmax. And then you have basically these values that are telling you who you should look at. In this case, here in the graphical graph convolutional network, we have this structure that is given to you already, okay? And so, again, this agency vector can be thought again as this vector with ones corresponding to these vertices that are having arrows pointing towards myself, the red guy, okay? So if you understand this, it's finished. The lesson is concluded, right? Because exactly everything else will follow automatically, right? So D is going to be my one norm, which is what? The number of ones I have, right? So if, in my case here, D is going to be two, right? What is the size of A in this case? In this case, can you tell me if you're following? Are you following? Answer my question. Are you hearing me? Yeah, the number of nodes, right? So the number of nodes in this case is six, right? And we call this from in the self-attention, we were calling this lower case T, right? And so lower case A, A vector is going to be, of course, of size T, because you have to multiply T vectors, right? So you're going to have T nodes, T vectors, and therefore you need T coefficients, right? So of course, I think A has size T and D, which is the number of ones, basically, is going to be the, is going to be basically the degree, right? I think this can be also written as the norm zero, I think. Yeah, this is also norm zero, right? Cool, cool. All right. What next? In self-attention, we had that my hidden layer was this matrix multiplication of my X times A, right? So this means the columns of the X are scaled by the factors inside A. Okay, first issue. So if you have multiple ones, this H is going to be larger for vertices that have many incoming connections, right? And if he has, like, let's say just one incoming connection is going to be, you know, small, right? So this stuff is proportional to the number of incoming connections. So how can we fix that? Oh, hold on, messages, incoming messages. Yeah, of course, you divide by the number of items, right? And so we multiply that with by D to the minus one. Cool, cool, cool. So what next? Maybe we want to rotate things. So let's put there a rotating matrix. And then we haven't considered ourselves, right? So this is basically considering all the incoming edges, but we don't consider ourselves. We might want to consider ourselves, right? As if there would be a self-connection. So we can add this another guy, right? This U-rotated version of the X. Cool. Then just to make the whole thing looking like a neural network, what do we add? Yeah, you know, linear function, of course, right? Rilu, sigmoid, and h, whatever. We said that we have like several of these vertices, right? We don't have just one vertex, right? We don't just have one X. We have many of these guys, right? We have a set of vertices or set of inputs, right? That's for i that goes from one to t. And this leads, therefore, to this matrix notation, right? So you just stack multiple h's. You get a matrix by stacking multiple x's. You know, you rotate multiple x's. You just get the stack. And then you sum these to the attention where the attention, like the agency vector, now it's going to be agency matrix. That's going to be all these columns, right? Where they tell you where are the incoming connections, right? Those incoming arrows. And that d is going to be the inverse of the diagonal where you have all the degrees on the diagonal, okay? Finish. That was it, right? Graph convolutional networks. It looks like attention to me, but okay. So what do we do today for the lab? I have a question so far. I mean, are you with me? Have you been following, right? There is nothing here that we haven't seen last time, basically. Well, nonlinearities, self-connection. Where do features come in? Isn't x a feature? Okay, x is a feature, yes. X is a feature, and the features here are... So there is a graph that is telling you which vertices are connected. Each vertex has a x, which is the input, and then it's going to have a hidden value, right? Are the previous hidden vectors used to compute the new one? Are they? They're not, right, here. You can have multiple layers, right? And so the second layer, the h-layer next layer, is going to be using the hidden layers, the hidden values of the previous layer, right? It's going to be just a normal way, like you stack multiple of these blocks, right? The U is just a term that allows me to consider also my own value, x, okay? So right now, A is going to basically give me the average of these columns that are incoming, and then U allows me to, you know, perform a rotation of my own self-dector. So whenever you have a graph in this case, there are two options. Or it's U, and you're like the V, the red V, or it's the other, which is the VJ, okay? And so here you have two terms. One is taking care of the V, red V, and the other one is taking care of the VJ. Final question. The agency matrix does not have self-connections. The agency matrix has zeros on the diagonal. If you want to consider the agency matrix with the ones on the diagonal, you can have like identity plus the A, right? Okay, so next slide, which is going to be the thing that we are implementing today, okay? Otherwise, did I miss any question? So alpha is in this A, so the diagonals are all zeros? Yeah, yeah, so agency vectors, the agency vector here has A1, only if the VJ, which is my neighbor, is actually connected to me, okay? And since there is no A arrow from myself that goes back into myself, there is no one in correspondence to my own position. So agency matrix has in the diagonal all zeros, and then has ones corresponding to the incoming connections. If you have a non-directed graph, then you have a symmetric matrix because you have the same one for both directions. It's basically having, it's like having an arrow on both directions of the edge, okay? Okay, got it, thank you. Sure. How is X represented? X is a vector which is the first one node, so how do you represent a node using a vector? How you represent a node? So X is a vector, right? Of dimension N. And this is your set of vectors, right? This is your set of inputs. You have one to T. This is from self-attention vector, self-attention set, right? So this is a set. And from this other slide, basically you have that only some of these axes are connected to other axes. So you have a set of axes, and then you're gonna have basically a connectivity, you know, specified between these vertices. So X and H and H next layer and H next layer and so on are basically values in a set, but then the point is that these elements in this set are connected through these arrows, okay? And so that's simply it, like there is no magic here. Like we were telling you. Suppose you have a graph with no vertex. This is a label 1, 2, 3, 4, 5. So how do you convert from label 1 to X1? Each of these is gonna be just a number, right? Whatever. And you're gonna just play with this. You just have to think about, you know, this can be thought as a sequence of, again, words for a sentence, or it can be thought as the pixels in an image, right? It can be just one linear image, you know? Or you can have, you know, whatever, a normal image. So these are just the values, the one that we called in the domain in an RC, right? Whenever we are mapping the domain, capital omega, to these image values, right? So this is simply a set of values. And in this case here, we just specify a specific domain, which has connections between vertices. Simple as that. Anyhow, so we are gonna be checking the code right now so that you can understand everything that is going on, okay? Don't get too scared. But I don't think there is any more craziness going on. The only craziness part is gonna be the type of graph convolutional network we are gonna be implementing right now. And so we're gonna be starting, we're gonna be implementing something cool, because otherwise it would be boring, which is the residual, gated, graphs, convolutional network. What a mouthful. And of course it's from Bresen and Laurent, you can see from the reference below. So here, again, we can think about, you know, our own vertex V, the red guy, which again has this input feature X and the hidden representation H. And then you have the VJ, you know, again with the all representing all the other, right? And then you have all these guys, right? In this specific case, actually, we are gonna be naming as well the edges. So in this case, my edge has also a feature on it, okay? So in this, like graph, in this residual gated graph convolutional network, edges also have representation on them. And so this is called EJ, okay? And so you have all these vertices that were, they were white before, now they have like a color. We're gonna have an edge representation for the input layer X and for the hidden layer, right? So we're gonna have EX and then EH. So what are the update equations for these residual gated graph convolutional network? So since it's a residual, we're gonna start with our residual connection. We have an input X, the pink one, and then we have plus something, right? Something that is always positive. So actually this could diverge and an easy fix for this first equation would be actually having an additional weight multiplying the parentheses, right? Anyhow, let's go for this version. So we have X plus something for which we take the positive part. And inside we're gonna have my rotation of the input, which is exactly the same as we were seeing before, right? So here we have that H equal, rotation of the input X, right? So the same here. We have that H equal, okay, there is the residual and then the rotation of myself. And then we have plus a rotation of the XJ, the incoming J, right? This rotation, it's also scaled by eta. And eta is gonna be our gate. So now you know why it's called residual gated graph convolutional network because we have a gate, eta, which is based on the representation living on the incoming edge, EJ, which is modulating the amplitude of the rotated incoming vertex, XJ, right? And finally, we're gonna be summing for all the edges that are coming towards my own vertex, right? So for all the edges that are incoming, I'm gonna be rotating the vertex representation of the incoming vertex and I'm gonna be then scaling, modulating the amplitude of these incoming rotated vertex with this gate, right? Again, this gate, it's a function of the EJ. So what is EJ? Let's figure out the equation. So we have EJ is going to be a rotation of my initial edge representation that is populated with the input data. So EX is gonna be my input data that is living on the edge. And so I rotate that. I sum the rotated representation of my incoming feature, XJ. And then I sum as well a rotation of my own feature, EX, right? EX is my own feature. I rotate it with the matrix E. Sweet. So this is my EJ representation. And then eta is going to be the following. So it's sort of similar, like a variant of our soft argmax where we have the numerator. At the numerator, we have the sigmoid of my EJ, which is the sum of these three components on the bottom, which is divided by the summation of all the sigmoids of the incoming edges, right? So we have a given edge. We compute the... Usually if you have the soft argmax, you're gonna have the exponential of the specific value divided by the sum of the exponentials. In this case, this gate is given to you by the sigmoid of the given edge divided by the sum of the all incoming edges, right? All incoming connections. Finally, we have that the next layer. So for the hidden layer, the next layer edge representation, we're gonna have a residual connection. So it's gonna be my initial value, EX, plus the positive part of this EJ. Again, this may blow up because you're gonna be summing all of its positive terms. Therefore, I would suggest additional weight multiplying these positive parts such that it can have even negative values. Cool, cool. So that's pretty much it, right? So if we compare to what we were seeing before, before we had that my hidden representation was some nonlinear function. In this case, we chose the... the real, the positive part, right? So in this case, we have F is going to be the positive part here of my rotated representation of myself plus this term over here, which is, so this xad minus one, it means take the average of the incoming axis, right? Because A was equal one for the vertexes that are incoming towards my own vertex. And then I divide by the D, which is the degree, which is the number of incoming edges, right? And so I basically sum all these incoming values, and then I divide it by the number of the incoming values. So I compute the mean and then I rotate the mean, right? And similarly here, we're going to do exactly the same thing. We have the rotation of all the incoming edges, right? So these are all the incoming edges. And then I sum them. But in this case, my eta is not just a constant that is equal to one over the number of incoming connections, but is going to be a number from zero to one, which is weighting my incoming vertex representation based on what is the representation living on the edge. So there are many colors and numbers and symbols, but I don't think it's that different from what we have seen before. The main differences are this gate, which is no longer like a constant factor. Now it's going to be function of the representation. And then we have this residual connection. Again, I would say that here is missing an additional parameter right here and here. I would suggest to have an additional matrix multiplying here and here such that we can allow for positive and negative values. Otherwise, this representation may blow up. Now how do we compute the representation for the second hidden layer? So we can call x hl. So it's going to be my layer l representation. And therefore the xj becomes hlj. And so all we have to do now is going to be basically saying that my h at layer l plus one is going to be this current h, right? But I prefer to use h and x in order to remove this additional index that may create chaos, right? Maybe chaotic. I had a question about if in terms of maybe a potential example, I'm not clear what it means to have this sort of gated, like recurrent type of model in the context of graphs. What is an example of? Yeah, yeah, sure. So this gating part here, the point is that all these different vertices here, they don't have ordering, right? I don't know which one is v1, v2. I mean, I know the order, but this guy here, this red vertex doesn't know how many neurons, how many vertices are connected to its own. And then he doesn't know how to think about them in different ways, unless there is some information coming from this edge. And so this edge allows me to basically change, modulate this incoming message. So this guy here is transiting this x, transits down this line. But then it gets modulated by the representation, or by this gate, which is, again, based on the representation that lives on that edge. So the edge has a representation, and this eta, it gives me a multiplier, basically a factor, that I can use a scalar to multiply each component of this vector here. And so it allows me to tune what kind of part of the vector I may be interested in. So this is going to be anyway trained with deck props, so the network will figure out what the heck is interesting and what is not. But yeah, the rationale here is going to be basically given that all the vertices look the same to me. In this case, because if you remove this part here, you just get the summation of all these h's, right? And this is going to be, oh, just let's average everything. Well, like the thing I told you here, right? So this is exactly what I'm telling you here. In this case here, you just have, hey, average all the vertices, well, average all the representation on the incoming vertices, right? And so this is like, hey, let's blur out everything. It's like, let's throw away all the information. In this case instead, it's going to be, hey, we are not going to be just averaging out all these incoming values, but we're going to be weighting them. We're going to be modulating them based on what we think it might be relevant or might not. So that would be nice. And is that superscript L and L plus 1 for the h's? Does that mean that this is a graph structure over time? Layer, layer, layer. We have several layers in this network. So hL with L equals 0 is going to be my x. It's talking about layers and not time. Yeah, there are several layers, right? So you have multiple layers and all of these layers, all of these layers are still sets, right? So as you have like a set of inputs, you have a set of inputs, then you have a set. So these are my set of inputs, right? Then you're going to have a set of hidden layers, a set of hidden layer or the second layer, and so like a hidden layer at the second layer and so on, right? And so, but in here, we just have sets. The only difference is going to be that in this case, they are sets, but then there is also connections between these elements in the set. Okay, that's the only difference we have. So the only difference between attention and this stuff here is that these guys here are given to you by these agency metrics, instead of being computed with attention, like attending and computing the soft arguments and so on. So this is my perspective from last week lesson. So the only step that is the difference from last week is going to be that these connections are given to you. Finish. Same everything else is going to be basically the same. All right, so time to go to the notebook because it's going to be actually taking forever unless there are imminent questions. All right, so this was taken, it was heavily inspired by the notebook from Xavier, but I changed everything. So yeah, I didn't like what he wrote. Now it's in our format or in my format. So everything is going to be familiar, at least for you. When I first read the thing was like, what's going on here? Okay, okay. So import crap. The only difference here, we have these import OS, which allow me to set an environment variable. So this DGL backend set to PyTorch allows me to tell DGL to use PyTorch. What is DGL? So actually you have to install, people install DGL, it's actually in the environment description. DGL is going to be my library to use convolutional nets on graphs very easily. So we import this stuff and also we import this network X. It allows me to print very pretty charts. Okay, I set some default. Oh, you can see now we are using PyTorch. So first of all, we're going to be seeing these. So this is mini graph classification dataset. It's called mini GCD. It's mini GCD dataset. I specify the number of graphs, the minimum number of values, vectors on the maximum number of vectors, no vectors, vertices, right? And so here I just call these with the different names they have. And then I show you here these different guys, okay? So here you have the first type. It's going to be the circle type. So the circle graph where you have each of these are connected to the other one and see again there is double arrow, right? Then we have the star graph, which is basically everyone connected to the first body. Then we have the wheel graph, okay? So you can understand what it means. Then we have the lollipop. Lollipop, lollipop. No, this is it. Okay, never mind. Anyhow, so it's a cluster of points connected by a string. It looks like a kite to me, but okay. There is the hypercube, which is super cute, which is this crazy guy here. And then there is this classic green, right? This can be thought as like an image or whatever, right? There is a clique, which is a fully connected graph. And then we have these circular ladder graphs. It's a ladder which is closing itself, right? And so what is going to be our task? Our task is going to be given a graph structure. Try to classify as being one or the other, right? So each of these graphs are going to be basically defined by these agency metrics. And given these agency metrics, we are going to be basically trying to figure out whether one graph is one type or the other. The point is that these agency metrics is going to be of variable size, right? Because as you have seen this here before, right? Where is it? You can give a minimum and a maximum number of nodes. And so you can't really do straightforward classification, right? Okay, cool, cool, cool. I didn't say Google. Okay, my Google is protesting here. All right, so let's add some signal to the domain, right? So those are the domain. This is where the information stays, right? So if you have this guy here, where is it? If you have this one, no? This is the domain. And then on top of this, you're going to have the values, the colors. If you have a color image, right? So these are the domains. Then we're going to put some signal on top. So in this case, let's actually read together. We can assign features to nodes and edges of DGL graphs. The features are represented as dictionary of names, strings, and tensors called fields. And data and E, and data and E data are syntax sugar to access the features, data of all nodes and edges. So in this case here, I'm going to be just telling that each of my node information, so my axis, are going to be in the degree, which is the basically incoming number of vertices I have. Okay? So each node, each X has its value and the number of the connected guys. Each edge instead has just a number one. So each edge has a number one. The other one has the number of connected guys. Cool. So here I just generate my training set and testing set, and then I just plot these to just show you that these guys have a feature. Both of them are called fit, right? And there is a fit, not like the foot, the fit, whatever is pronounced the same, a fit for the node and a fit for the E edge, right? And so here we go with the equations for the gated graph convolutional networks. And again, they look terrible because it's a notebook. So we're going to be using this one, right? That are a little bit prettier. All right. So before actually reading these instructions, let's read how the main, let's read the initialization part of this module, okay? So here we can see that we have a few matrices. We have A, B, C, D, and E, right? So we need those matrices. And therefore whenever I start my module, which is going to be just a neural network module from torch by torch, right? We're going to be initializing four different matrices. So A, B, C, and D, E are nn.linear. So actually in this case, there is also the bias. There is not just a rotation. So these are affine transformations, right? Modules. Moreover, we have a batch normalization for the hidden representation and a batch normalization for the edges, right? Whenever we do the forward pass, we send basically the g, the graph. X, capital X is going to be the collection of all these vertices, right? So like we have seen in the attention module, in the attention lesson, we have that my small x. We can have like the set of all the small x's as represented by the big x, right? There is no, like, it's not a sequence. It's just a way of representing a set, right? So in this case, graphs are made of sets of vertices, but where I can specify the relationship, you know, between which vertex is connected to which. So I have capital X and then capital E, X, right? Which is the, all those edges. So we can have like a set of edges and then we can consider the matrix where I have all these columns, right? And so here I'm going to be populating my graph with this representation. On the g and data, I'm going to be defining the variable h, which I just give all my initial representation, and then I'm going to have AX, BX, DX and EX, which are going to be the matrix multiplying all these columns, right? So you're going to get the rotation of all the columns, which are simply obtained by passing the capital X, not the collection of all the x's to my matrix A, B, D and E, okay? Whereas C, C was multiplying, was rotating just the edge representation, right? So we have that C is multiplying the edge. And then we have this function here that is the new function, right? We don't know about this stuff. So let's figure out what it is. So maybe now we have to read what's going on here. So in DGL, the message function are expressed as edge UDF, user define functions. Edge UDF, user define functions, take in a single argument, edges. It has three members, source, destination and data for accessing source node features, destination node features and edge features, right? So whenever we have this edge here, we're going to have a representation living on the edge. Then there is a representation living on the source vertex, right? The incoming vertex I used to call it. And then we have our self, which is the destination vertex, right? So we have source vertex, our edge, connecting the source to destination. And then we have our own destination vertex, right? So we have a representation living on the source vertex, a representation living on the edge, and then a representation living on the destination. And those are x-axis, if they are associated to the first layer of my network, right? Are going to be called h, if they are associated to the second and so on layer of my network, right? So h is my first hidden layer, which is the second layer of our network, right? All right, so back here. Again, so this edge user defined function have a source, the vj destination, just v and the data living on the edge, right? All right, cool. Then the reduced function are node UDFs, right? User defined functions. Node UDFs, user defined functions have a single argument node before you had edges, right? And the node acts like on a given node, right? So which has two members data and mailbox. So data contains the node features and mailbox contains all the incoming messages features, stacked along the second dimension, okay? Finally, we have that update all, which was the function we just seen here, the new function. Update all has two parameters, message function and reduce function. Send messages through all the edges and update all nodes. Optionally, apply a function to update the node features after receive. This is convenient combination for performing send from all the edges, the message, and then receive for all the nodes to reduce function, right? So this is like a condensed version. So let's figure out what are my message function and reduce function, right? So message function. We are gonna be first extracting the BXJ. So the edge is gonna be connecting my VJ to my V. And so I'm extracting here the representation that lives on VJ, right? My BXJ is going to be the BX associated to my vertex J, right? So this guy here, BXJ. Cool. Then I have that my edge EJ is going to be the summation of the rotated edge of this edge, right? The rotated source, right? And then the destination vertex, right? So here you have... So you have the edge representation, like C rotation, C rotation of EX, right? Then we have the D of the source. So DXJ. And then finally EX for the destination. So which is EX, right? Cool. Then I actually store this EJ in this capital E so that we are gonna be ending up with all the representation for later usage because later we're gonna be using this EJ over here on the bottom right. Okay. So now we have computed the message and therefore, so after the message is computed, we are gonna be calling the reduce function. And the reduce function finished to compute the update formulas, right? So we have the AX is going to be the AX for my own data, right? So this AX, capital X is gonna be all the vertices and lower case X is gonna be this one, right? So lower case X. Then I checked my mailbox, right? So the message function send a message through the edge. And now at the receiving end, we get a message, right? So we check the mailbox and we receive this BXJ, right? So here we get BXJ. Then I also listen and we have the representation EJ, right? So that's coming too. Then I compute the sigmoid here. I had a sigmoid for the incoming edge, right? So the sigmoid of the incoming edge. And then all we have to do now is going to be having that my H is going to be my rotated X, right? So AX is the rotated myself, the vertex representation of my own. And then I have to sum, right? Over all the incoming edges of my gate, which is multiplying my incoming representation, incoming rotated representation. And then we divide by all those sigmoids, right? All those sigmoids, which is this one, right? So we multiply sigma to this BX. And then we divide by all of them, right? And by the sum of all of them. And then we sum all these guys, right? So we have the summation of this scaled BJ, which is then also normalized by the sum of all the sigmas. And that's it. So we have now the lower H, which is going to be written in the big container of H's. And so that's how we write down these three equations, right? Four equations. Well, three, right? We haven't seen this last one. So what else? So now we can retrieve H, right? Because we have just updated all the representation, which has been computed here and returned there. Then we can also get the new edges, right? Because we wrote the edge information here, right? So here we were writing the new edge information. And here we've been writing the new H information. So we retrieve the new H and the new E. We divide by the square root of the size such that things don't change with the size of the hidden representation. This is just, you know, technicality, but it allows you to have a consistent scaling factor like we have seen last week during the set-to-set, the attention we were dividing by the square root of the dimension such that the soft arc max was behaving similarly, right? Regardless of the dimension, right? So we don't change the temperature. Then we apply batch normalization such that we get nice gradients and doesn't overfit and, you know, all the nice things that batch norm gives us. Finally, we apply the non-linearity, right? The H, the plus. So we compute this non-linearity for this one and this one. Then we can write that my new H, so the representation for the first hidden layer, so my second layer is going to be my input X plus H, right? So we have the input X plus this guy here, this real positive part. And the same, we're going to have the E representation. It's going to be my initial representation plus this new E. Finish, and we return H and E. I have a multiplayer perceptron and then here I have this stack of layers. So here I just call my gated GCN and so you can see all these matrices, but again, we don't care. Some stuff for collecting and accuracy computation. Okay, let's test the forward pass. So how do we test the forward pass? Here I just define my data and then I have my batch of axes. It's going to be the data that leaves on the vertices. So my X is going to be the data that leaves on the vertices and my E is going to be the data that leaves on the edges. These are all ones and these are just the degree. So I show you a few of these values. Yes, sir. Again? Your E's will be quote unquote all ones just for the first layer or the first stack... Sorry, what's the word? In the rest of them you will pass the outputs of the previous stack. Absolutely, yes. So these are the input values, right? So my graph, which is the domain, I put some signal on at the beginning, which is kind of arbitrary right now. For the nodes, I put how many input connections I have and for the edges I just put one. So I have several layers of this graph convolutional net like here. So you have this gated graph convolutional net has a few layers, right? So if you have L, which is the number of layers, you're going to have as many graph convolutional network layers that are the one I showed you before as the numbers, you know, as L, right? And so you stack several of these layers. And at the beginning you have these degree and all ones and then as you have multiple stacks, you start, you know, having some representation that evolves, right? Makes sense, right? Yes, no. Would that kind of be like the value of E would kind of be like the weight of the edge? Is that wrong? No, the value on E is the representation, right? So E has like a... E here. Okay, right now it's just one, right? But later on in the following up layers, this is going to have a vector and this vector basically allows you to tune this gate for this incoming message. Let's finish the notebook then. Otherwise we don't finish the notebook and then I can answer every question you have. All right, so I show you here like this DGL graph and I had these features. These features are going to be my input. I have in this case 133 nodes and 739 edges. How many are... What is the maximum number of edges I can have? You're following? 133 square, right? Divided by 2. Yeah. On the order of 133 square, right? Cool. So let's execute this one. So we see at the beginning, the network doesn't... cannot really classify this... cannot classify correctly and this is just a stupid thing. So let's actually train. Let's actually figure out how to train this network. So I have my J, my objective function, which is going to be the cross entropy of the, you know, the scores, the batch scores and the batch labels, right? So these are what my network tells me, right? This is batch scores, the logits, and then I have the labels, which are the original... the original labels for the graphs. And then this is actually... it runs fine, so everything was working. We have the forward pass, you know, loss computation, zero grad backward optimizer step. And so we define here a training function, which is exactly the same as we have seen all the time. Let me run this line as well. So training function, we exactly know everything, right? So axes are the data, the features on the nodes, on the vertices. E is going to be the features on the edges. The batch scores, the logits, are basically the output of my model. The J, the objective function, is going to be the cross entropy between the logits and the targets. Then you clear up the gradients, you compute backwards, and then you step, right? These are the five steps. One, two, three, four, five steps. Finish. Evaluation, the same without the updating of the parameters. So here we just have the training dataset and the testing dataset. And we can check what's the progress here so far. So here I just show you the training and the testing accuracy. Oh, sorry, let's put 40 epochs maybe. And so let's see whether it works or not. It's getting better. And, yep, accuracy starts to grow. Test, still low. Okay, it's getting up as well. Yeah, there we go. Convergence, yes, yes. So again, all we want to think about, right? If you think about from the perspective of the attention. We have a set of values, right? And in attention, we didn't have any kind of connection between these values. It's just a set. Everything looks at everyone, right? So the attention, you have to check everything that is going on because you have no idea which one should be looking at what. In this case, we... Okay, that's the main point, right? Xavier said yesterday. The main point is the sparsity in the agency metrics, right? Because the sparsity gives you structure. And structure is the number one that are telling you who's connected with whom. So if everyone is connected with everyone, you get everything one, right? Everywhere. See, it's converging here. So if you have everyone looking at everyone, your agency metrics is going to be just metrics with all ones. If you have just a few vectors that are connected to each other, then you get some sporadic ones, right? So you're gonna get a sparse metrics. Okay, this stuff was going to 100% accuracy before. I guess I should set a seed such that I can show you better trials. That was pretty much everything from it. So really, there is no much big deal, I think, at least from this first perspective and what I learned in this past week about these networks. Are there questions? I can take questions right now. I mean, I didn't want to take forever to finish the class, otherwise, if people have to leave, they can't. It's also, I am nine minutes over, so I'm also not on time, but, you know, better than worse. So what exactly did we predict here? I know we had the classes, the seven classes of graphs in the beginning. So these are the classes. These are the classes, and then I'm gonna be generating down here. I'm generating my... Where is the training data set? I can't see. Hold on. Train, train, train. Here. So training data set. It creates this data set of 350 graphs that have anything between 10 to 20 vertices each, and then there can be anything of those eight classes we have seen before. It can be anything like... Let me unzoom here. So it can be class seven, six, five, four, three, two, and one, okay? And zero. So you can have any of these one, and now you're asking, this stuff has a variable number of vertices, right? And now you're asking your question in your network, which type of graph did I give you, right? And so our convolutional graph, convolutional network tells you which type of graph you're looking at, right? So it's doing basically a classification of your agency metrics, right? Which is specifying the connectivity of these vertices. So the train set is a set of small graphs, right? And the batch size there is 50. So that's 50 graphs of varying sizes, varying number of nodes from 10 to 20. Yes. It's not necessary that each batch should have graphs with the same number of nodes. That's what's done inside. That's what's done in terms of... from DGL for giving you, you know, speed in training, right? So that's done behind your back, right? It's the same as when you train for language model, right? So you want to batch all the sentences with similar lengths such that you don't waste computation. So it's the similar way... in a similar way you can do here as well, right? What was the dimensions of the output look like? Oh, here, right? So my... Yeah, so my here... I have an MLP that goes from hidden dimension of the whatever thing to my output dimension. So what is the output dimension? Let's go figure out. Output dimension number eight, right? So eight are the possible classes. Therefore, I will give you eight... you know, eight vector... a logit of dimension eight. Finally, whenever you have this logit of dimension eight, so it's just a classifier, right? You plug these inside the cross entropy. Here. Loss of my logits against my labels. And this loss is defined down here. It's my cross entropy, which expects logits and then computes, you know, the final score. And then you just run back propagation. So yeah. Did every node have a like a logit? And the label, each node has the same label which corresponds to the class of the overall graph or is that not how that works? Each graph has a vector of logits, right? You want to classify graphs, different graphs. So you provide these graphs to the network and these graphs have arbitrary structure, right? They don't have a finite number of vertices. So you have, you know, let's say you have like 10 graphs and each of them are like, you have a graph of size five, size 10, size 15, size 20. So you have sets with different number of vertices and a specific connection between these vertices. So given these variable length sets, you have to specify, you tell the connection between these vertices, you ask your network to tell you, give me a logit vector and you're going to be showing me basically what graph this belongs to, right? Which family it belongs to. So eight are the possibilities you have. You know, circle, star, blah, blah, blah. Each input graph will be mapped towards one specific of this guy, right? So you just do classification of the type of graph. But the point is that these graphs have a variable number of vertices. So you have to basically query somehow the structure, right? The graph convolutional net has to intrinsically extract what is the typology of connectivity you provide. Oh, okay, cool. Thank you. We are using this example that was shown was classification of graphs. Yeah. But also they can be a use care. Mostly there would be a use case in real world where we have a one single graph where each node represents a particular entity and you need to classify those entities. So then how do we do that? Take some part for graph and train the model over that and then redress the graph. So, okay. This one here, you have like different graphs and then at the end you're going to be getting them all together, right? So whenever you have the gated CNN here, at the end you have like forward, you get this part here, right? And you have mean nodes, right? So DGL, why? You get the mean of all the nodes and then you apply the MLP on this mean representation. This is for this classification of the whole graph, right? But then if you'd like to do the other stuff, right? Which is going to be, you don't have this function, right? You just don't have this line. And you're going to be applying like a vector, like a logit vector, like out of each vector, right? So each vertex, each vertex, you're going to have a logit, like a vector logit for each vertex. Vertex, right? Yes. Yeah, okay. Cool. So you have a logit per vertex to do the training on each of these guys, right? And so that would be the, for example, if there is a, if you go to DGL, the DGL.ai, this class took really a lot of effort, guys. Tutorial, this one, tutorial. No, what is it? There is a, come on. Oh, okay, here. Yeah, cool. So in this case, they are doing, they are doing classification of the nodes, okay? This is the second type of thing. So in this case, you have a karate club. I'm not sure if you're familiar. And there is like the instructor, number zero. And then there is the manager, number 33. And then these edges represent what are the, basically the interaction outside the club, right? In real life, right? So number four, you know, interacts a lot with the instructor outside the club. And number 26 interacts a lot with number, with the manager outside the club. So you only have two labels. You have instructor and manager. And then you'd like to get a label for all these other nodes, all these other vertices. So whenever you do the training part, this is called semi-supervised learning because you only have a few labels, right? So whenever you train this stuff, you're going to have, you know, okay, you have your comnet, not your graph comnet, which is outputting a number of classes for each vertex, right? So each vertex has the full logic. But then during training, you have here that the training is going to be the, you get the logits. These are logits for every vertex. But then when you compute the final loss, which is the negative log likelihood, you only select the labels that you have, which is label for the 33 and the, what's our little guys, right? So you have zero and 33. These are the only two labels you have. And so there is a vector, which is called zero and 33. Summer, yeah, here. This is my label nodes. They are just these two guys. And so in my training loss here, you only select the two nodes that are, that have a label. You enforce those labels to be this one, right? And then you train classical stuff, right? Zero grad, back prop, optimized step. And these allow you to basically propagate throughout the whole network structure, what is this information that has to come out from the logits of two specific vertices, right? So you have several layers stack of convolutional graph, convolutional nets. And then you enforce, you know, those two vertices to output that specific label. And then you back prop and then basically all this information propagates through network, which propagates a kind of representation across this domain. And it shows you, if you do a plotting, this is at the beginning. So this is the representation of the vertices without training. And then after you train, after a few epochs, you can see how this representation gets attracted, right? So zero and 33 gets pulled away such that they are linearly classifiable. Well, they're, you know, easy to tell apart. And then these are basically dragging vertices close to them based on basically on the number of connections they have, sort of, right? And that's how you do classification on vertices rather than classification of graphs, right? Maybe, yeah, yesterday we didn't quite mention how you apply these things. But again, like Xavier, like myself are not maybe too interested in the application part but perhaps more in the algorithmic part. Did I answer your question? That makes sense, yeah. Okay, awesome. At least I make sense sometimes. More questions. No, you're done. You're fed up. But it's less. There are a team of people. Yeah, dinner. I'm hungry. My roommate just ate my dinner, like the hell. These two weeks were crazy. I really worked a lot. All right. Peace. Bye-bye. All right. So this was like, I think, one of my most intensive lectures so far. I prepared it like in under one week and I think I may have covered a lot of materials. So it's totally reasonable to be like somehow overwhelmed at the moment. But then how can you actually squeeze out everything out of this video, right? So there are a few steps I would highly recommend you to follow. Starting with, you know, comprehension issues, right? Again, I might have been confusing. I have also re-recorded a few new chunks because I messed up in class. So if you have any question I have not yet addressed, just type it down in the comment section below this video. Moreover, if you'd like to follow up with me on the latest news about teaching and machine learning and very cute and pretty things, just follow me on Twitter. I will, you know, talk about the latest news over there. Moreover, if you'd like to be always up to date with my latest content, I recommend you to subscribe to my channel and click on the notification bell so that you don't miss any new video. If you like this video, you know, don't forget to press the like button. I really appreciate that. Searching. So each of these videos have an English transcription. We have a Japanese, Spanish, Italian, Turkish, Mandarin, Korean translations as well available for you. If English is not your primary language. And again, if you'd like to help out in the translation process, please feel free to contact me by email or on Twitter. Moreover, you should really, really, really take the time to go over the notebook we have covered today and check every line of code, right? So because there are many things I may have, you know, not have spent enough time today just for sake of, you know, keeping this video under a within a specific time amount. But then you should really go through every line and deeply understand what's going on. And if you find a typo or error or a bug, like many of you have already found, do please report it on GitHub. And if you feel inclined, just you can also send a pull request by fixing this error. That's great because we can, you know, all benefit from your contribution and you also get some value out of, you know, getting your hands dirty, right? With the code. And finally, you're going to be helping me and the whole machine deep learning community that are using this material. And that was pretty much it. Thanks again for sticking with me. And as I've been told to say, like, share and subscribe. See you next time. Bye-bye.