 All right, let's welcome our next speaker, Augusto Stoffel. Augusto is a mathematician and is also a machine learning scientist and leader. Today Augusto is going to talk about using graph neural networks for information extraction using Python. As part of his post-doc, he has done research in the field of algebraic topology and its application to quantum field theory. Well, that sounds really interesting and I'm really looking forward to this talk. Thank you, Augusto, now. Good luck. All right, thanks for having me and thanks to the audience for the enthusiasm at the last talk on Friday evening. Yeah, so I'm going to talk about graph neural networks for information extraction. So let me start situating this topic in the deep learning landscape. Then I also got the chance to say a bit about my work at DIDA. There's basically two camps in deep learning, natural language processing and computer vision. A typical NLP application could be to, well, try to understand maybe some legal documents, some contracts and classify the different sections of a contract and perhaps identify those which may not have a legal standing. Some examples of computer vision applications could be to look for defects in some manufacturing defects in some technical components or remote sensing that is analysis of satellite energy. So say look for some small scale mines. So those, by the way, were all projects we developed for clients at DIDA. Now, those two camps have their very specialized set of tools of models and they work very well. But of course, those are far from the only interesting problems we could tackle with data science. So it will be very interesting to have more generic, more flexible deep learning models. And this is exactly where Graph Neural Network is coming. So to guide our discussion, I'll pick the sample use case which is also inspired by a project we have, which is to understand documents in tabular format. So what we see here is, well, the picture of an invoice or a bill and this document has some headers, some addresses, some dates. And then the interesting part is this table, right? It has the table list, some products or services and these things have like a description, then some prices, some quantities. And what we would like our machine learning model to do is to classify the words in this document appropriately. So let's say this is a date where this was issued. So this would be on the metadata. And now each entry in the table, we are identifying this description part as such. Maybe there's some more fine-grained information here. This description has some units and then a price and then a quantity. And so it goes. So a standard language model, which only looks at the text sequentially. For such a model, this may be this quantity two here would come completely out of the blue, right? Because it follows some information that is not really related to quantities. So we would like to have a model which, yeah, is able to get this broader view of the data. All right, so here is a typical information extraction pipeline. So we have some input data, some input documents. We apply some preprocessing to it. So maybe in that case, our input was this PDF files and we preprocess in a suitable way, such that the data can be fed to a machine learning algorithm, some number crunching algorithm. And then out of this comes some useful prediction from the model. Okay, so where could graphs potentially enter in this picture? And there's two places. So in the preprocessing step, we would like to take this document or this data and encode it using a graph in a way that meaningfully captures the structure of the input data. And then in the second step, this number crunching step, we would like to use an algorithm that then can consume and make good use of this graph data we got. All right, so then this is now a quick agenda for the remainder of the talk. So I'm going to start in this part of the pipeline discussing graphs and graph neural networks. Then as an interlude, I'll make a quick comparison with the convolutional nets, which are like the bread and butter of computer vision. And the hope here is that probably most people have some familiarity with con nets. So you can have a point to compare to understand things and maybe you can even understand con nets a little better after you see this. And then at the end, I'll go back to our use case and this is going to illustrate a bit how the preprocessing would look like. All right, very good. So what is a graph? A graph is just a picture like this one, which you make out of dots, which we call nodes or vertices and arrows connecting the dots. So those we call edges or links or relationships. Okay. And so in a in applications, typically this graph, the nodes and edges will come with additional data some some labels. Okay, so say if this graph represented a social network where the vertices represent people and the edges maybe represent payments people made among themselves. So our nodes would be labeled by the name or other attributes of the person, and maybe our edge would be labeled by a number, which is the amount of the money transfer. Okay, and then just as a quick heads up. This is one version one definition of a graph. There's many variations. If you pick up a random paper, something slightly different might be meant so maybe double parallel edges like this one are not allowed or maybe loops are not allowed. Or maybe this edges can be considered not to be directed. Okay, but now our example here that that would be our notion of breath. And I want to give you a mathematical definition, not to be precious about this but because there is a slight difference between what a graph is and what computer representation of a graph is. And it's good to, we have to have clarity about this one working with graph algorithms. And also it's a fun thing to look at so. So what is a graph or if to be very precise we could call this a directed pseudo graph. So, but I'll just say graph. So graph is given by a set of vertices, the note curly V, then a set of edges that I will denote curly E, and then to function from e to V that I call source and target. Okay, so in my example, the set of vertices would be ABCD and E, the set of edges would be this small letters PQ up to V. And then say, if I take my edge P here, its target, its source is the note A, its target is the note B. And if I look at my edge V, its source is the note E, and its target is also P. So that's it, that's what it is. And then I already made this comment that in applications, vertices and edges come with labels. So the way this is modeled is as functions from V, the set of vertices to some set of vertices labels. So maybe each vertex has a string attached to it or some more complex data structure. And same for the edges. All right, so now how do you represent the graph in Python or any other language? So the first step is to pick an enumeration of the vertices and edges, and that's why I changed my notation a little bit. So before my notes were ABCD and E, and now I'm calling them V0, V1, V2, V3 and V4. I just numbered them, put an order. And then similar thing for the edges. And then there's a couple of alternative ways to represent the graph. One of them is as an adjacency list. So that's a list of pairs of numbers. This list has one entry for each edge in my graph. So say the zero entry of my list describes the edge is zero. And that entry is a pair of numbers. In this case the pair is zero one. And that indicates the fact that the source is V0 and the target is V1. So then it goes, right? So the last entry in the list describes my last vertex and my last edge is six. And it has a source and target both V4. So the number the pair describing this is four comma four. All right, so that's it. Now going to this next step where we want to attach some data to vertices and edges, right? So here's how this could be implemented. So we have to pick two types. So suppose we want to label the vertices with a type node T and the edges with a type edge T, right? So then my class graph would be a little data class like this. Okay, depends on these two types. And then, okay, then we have a list called nodes, which is just a list of objects of type node T. List of edges, which is a list of objects of edge T type. And then an adjacency list like I just described, okay? Which is a list of pairs of integer, integer, okay? And these two source and target methods is just what I just described, right? So if I want to know what the source of the kth edge is, I look at the kth entry of the adjacency list. And then look at the first, that's a tuple, right? And then I look at the first item in there. Good. All right, so now we can get to graph neural networks. And to discuss that, I want to just fix on a fairly simple possibility. So first of all, fix one kind of problem out of many we could pick. And I'm going to discuss the node classification problem. Okay, so what this means is we give us input a graph. And out of it, we want a labeling of the vertices. Let's say it's a binary classification problem. So it's just a function from the vertices to zero and one, okay? Or alternatively, you give me this graph with enumeration of vertices. And then you just have to output a list of numbers, zero and one. The ith entry corresponds to the ith vertex, okay? And I want to make two more simplifying assumptions. One is that my nodes are labeled with a tensor, okay? So some tensor, some array of numbers of a fixed size. And my edges are labeled by an enumeration type, okay? So this just means I have a list of possible edge types. It's a final list, not too big. So say if this was a social network, then maybe each of these tensors would be some numerical encoding of various attributes of a person. And then maybe we could have a few different types of relationships between them. So say two people can be friends or they can be core workers or relatives, okay? So this will be the three entries of my enumeration type. All right. So now I can describe for you a possible graph neural network layer, okay? And what I'm doing here is pretty much taken out of this paper, which is one nice grid. But okay, so here is how this works. So my GNN layer is just some layer in a model which takes a simple graph and outputs a new graph, okay? This new graph has, okay, it has the same topology. So it has the same nodes and edges. I'm not changing the edge types either, okay? But for each node, I'm computing a new feature tensor, okay? And so here is, well, a formula you can use for that. And okay, don't worry if you can follow 100% this, especially if it's the first time you see anything similar. But I want to just read this formula step by step. So we see what's going on. All right, so what is YV? YV is the output feature tensor of the node V, okay? So say let's focus on this guy here, node V4, okay? So I would be describing this tensor here, the output feature tensor of V4, okay? And this is computed as follows. So it's a summation and the summation has one term for each edge incident, so incoming, coming into V, okay? And I call W its source and E the edge itself, okay? So say, in the case of V4, the sum would have three terms, one corresponding to E4, one to E5, and one term corresponding to E6, okay? Now let's see how each of these terms is computed. What I'm supposed to do is to take the input feature tensor of the nodes, which is the source of that edge, okay? So let's focus on this edge here. Its source is V2, so I look at this tensor here, which is the input feature tensor of V2, okay? And then I multiply this tensor by a weight matrix, which is part of the model. It's a learned weight matrix, okay? But then, little detail, there is actually one different weight matrix for each edge type. So what I should do is I look at, okay, I'm looking at edge 4. I should see what its type is. It's one of these possible enumeration elements, and there is one little weight matrix corresponding to each edge type, okay? Each enumeration element. So I think that that appropriate matrix, okay? Then I multiply those things. I multiply by a normalization factor, and then I add these things all up, okay? I may discuss if it's the normalization factor. It's basically the degree of the nodes, so the number of incident edges, except that I only focus on the incoming edges. So I just look at how many edges are coming into V, have V4 in this case as target, and over I restrict to those that have the same type of the edge E, okay? All right, so this normalization factor in this case will make a huge difference, but you can imagine, say, a social network. It's very typical that some nodes have just a few connections. Some nodes have potentially thousands of connections. So a normalization of this kind is important for things not to blow up, okay? And then, last step, I apply a nonlinear activation function. Could be a ReLU, a sigmoid, whatever we like, okay? So that's it. That's a formula for a perfectly fine function GNN layer, okay? So just to mention a bit about, okay, so maybe an alternative way to think of this is you could imagine that each node, so each node in this vertex has its, okay, this incoming input feature tensor, and it has to compute a new tensor, and the way it does that is by receiving a message from each of these arrows that are incident to it. So that would be a way to think about it. So the message would be this guy, and then the node precedes all these messages, and it totalizes, so incorporate those messages into a new value, okay? And so we could have many variations on this scheme. So where here we have this matrix multiplication, we could have maybe some attention-like thing where this value depends not only on the input feature tensor of W, so the origin of this arrow, but also on XV, so the input tensor of the node in question itself, right? We could totalize this tensor using a different formula, maybe some max instead of summation, so many variations are possible here. It's a very flexible scheme. All right, so now we can go to our comparison with convolutional networks, and so let me start as a quick recap on convolutional networks. So this is an algorithm for images, right? And the way this works is we have an image, each pixel in the image we have an input feature tensor, so in the first layer it would be just intensity of red, green, and blue channels, okay? And we want to compute for each pixel a new feature vector, and so the new feature vector, the output feature vector of this pixel at position i, j, is computed as follows. So we look at a little patch of say three by three pixels around it, okay? So for each pixel in this patch, we look at its input feature tensor, then we multiply that by a matrix which is learned as part of the model, and there is one different matrix for each position in the patch, okay? And then we add these things all up, there's nine vectors, and then we apply some nonlinear activation function, okay? So the summation is over this indices k and l, right, between minus one and one, so each pair k and l locates one position of three by three patch, okay? All right, so that was a recap. And now what on earth has this anything to do with graphs? So here's a little graph representing this three by three patch, okay? So in this piece of a graph, each node represents a pixel, and then the arrows represent this relationship of being a neighbor, okay? And moreover, since, well, two pixels can be made neighbors in one of eight different directions, right? Which are this cardinal directions. Those are the labels I attached to my edges, right? Okay, and now, next slide, I'm going to replace the formula here for the continent. I'm just going to rewrite it using this graphy syntax from the previous slide, okay? And so here it is. So let me read again this for you, okay? So this is a formula for the output feature tensor of the pixel p, which is the center here, okay? And the way I compute this is the following. I look at all the neighboring pixels, right? So each neighboring pixel is a node in my graph with a link to p, right? And moreover, this neighboring relation is of one of these four, eight types, right? One of these eight cardinal directions. So for each of these pixels, I look at its original feature tensor, its xq, and I multiply by a matrix. The matrix depends on which position this is, right? So it depends on which cardinal direction this neighbor is coming from. And then I add these things all up, okay? And moreover, the input feature tensor of p itself also plays a role, right? And this is also counted for this. So I could include a loop, a self-edge around p with the label o, and then this term would be included in the summation, would be some zoomed. I just didn't do this to keep the picture a bit clear. All right, good. So this would be a picture representing the entire image, okay? So that's it. So, of course, a graph neural network can take in graphs of arbitrary shapes, while a condomnet is specialized shapes of this graph. But that's fine. So condomnets are a particular case of graph layers. All right, so then now we can go back to our use case, which was about information extraction from tables. All right, so here it is. The idea is we receive a table. It can be like this table. It can be a completely different layout like this one, okay? But from this page, we make a graph. And in this graph, each word is a node. And then we have a bunch of edges, which are added in the following way. So for each word, we divide the page into these four sectors, the stuff that is above, below, and to each side, okay? And then we look at the nearest neighbor of the word in each of these directions, okay? So say above the nearest neighbor is this zip code, 12545. It would be represented by this node, and we add an edge linking them, okay? And of course, this edge has a label which says, okay, it's a neighbor above. And we keep going like this. So say maybe next step, 12545 has this neighbor above the word south. So this would be this guy. As a neighbor to the left would be the nearest would be the word states. So this would be this guy and so on, okay? And that's it. That's how we make a graph out of the page. And in a table, this whole link works that belong to a same line, same column. And this is where our model can learn interesting information. All right, so here just a possible end-to-end model architecture, which is pretty much what we used in this case. So we take in these pages, we collect all the words, we link the words to make a graph, and now we need to associate useful features to the word, right? And there's basically two categories of things you can do. First, you can look at the geometric features. So the coordinates of the bounding box of the word. So this makes some features. And then there's features from the text of the word, right? And then you could come up by hand with some stuff like, oh, is this a purely numerical word? Is it often numerical? Is it in a list of common words? Like is it the name of a month or some common header word? Or I could try to be tensor and design a character CNN that learns on the fly useful features. Anyway, for each word, you concatenate these things. You get some maybe reasonably big feature tensor, maybe pass a linear layer to normalize the size, and then you feed to a couple of GNN blocks, just like the one I described. So that maybe you add some dropout, some batch normalization, some skip connections, stuff that is meant to help with convergence with training time. And yeah, maybe skip connection to help the character CNN converge. And then at the end you add a feed forward block, a softmax, and out of this comes out probabilities for each word to belong to each of the categories, each of the classes in which you want to classify. So that's it. So this is pretty much the model we used in this project. Out of about a thousand training documents, we got a model with an 88% F1 score, which is very good. It's better than the previous system our client had in place where they had to manually label each type of, each new table layout that appeared to them. It also worked better than a prototype that we had developed earlier, which was based on random forests. So that was a pretty big success for us. And that's it. So just to finish up a couple words about implementations. So there are a few different libraries available out there. I think you would all find it by Googling. If you want to read up on this topic. Okay, maybe some more popular places will not, will be a bit disappointing. So the Wikipedia page for Graph Neural Network exists since early this month. But then one suggestion I can give you is to just go to the documentation of any of these packages. I particularly like the PyTorch case. It lists papers for which they provide implementation. So if you go to the papers, it will be a good place to delve into the topic. So yeah, so with that, I'd like to thank you. All right. It was one of its kind. I think it would have been better than having this talk and listening to this to wrap up for the day. We have a question, Augusto, and I'll read it out for you. It's how would you deal with tables in documents with multiple columns, like scientific papers? For example, the table is in the first column while there is regular text in the second column. Yeah, that's, that's a very good question. So in some way, the model can learn, right? That's okay. Maybe if it sees enough documents with two columns, it's going to learn that yes, this can happen. We could try maybe. So since we know that this is a common case, we can try to help it and just try to add some features to help the model learn. That's okay. Maybe at some point it's going to see like a division, a two-column division. Maybe find a way to not link words from one column to the other column because we know that those guys are not related, right? Mm-hmm. Albert? Yeah. So it's always this interplay between, okay, just let the model learn or should we help it with this little extra effort to feed the data in a more suitable way? Makes sense. Right. Thanks for answering the question. So I don't see any other questions in the chat room. So if anybody wants to ask questions to Augusto or wants to interact with him, please feel free to jump out on the parrot breakout room. And Augusto will be there to answer all your questions. Thanks Augusto once again for the really interesting talk.