 about recent work she's done on deep learning for genomics and craft structured data. It flows in yours. So thanks for the introduction. So as you mentioned, I'm going to cover some work that we did in the last year and a half or so on deep learning for genomics and graph structured data. And before I introduce the outline of the presentation, I would like to recall that as you know, deep learning has proven to be very effective in many application domains, such as computer vision or speech or natural language processing. And for many of those domains, there have been some models that have proven to be especially successful and that have become like well established. So for example, for computer vision, convolutional neural networks are kind of the de facto standard and with all of their extensions to tackle specific problems and the same happens for sequential data where recurrent neural networks have proven also to be successful and have been widely used. However, when it comes to the biomedical domain, we find a huge variety, a very rich variety of different kinds of data including text-based data but also image-based data or genomics, transcriptomics, or even graph structured data. And within these kinds of data, we still do not have these very well established models. So in this presentation, I'm going to cover two of these kinds of data. So the first part of the presentation will be devoted to genomic data. I'll start with the motivation. So the data that we use looks like and what are the challenges posed by this data. I'll present the framework that we propose to address the challenges posed by this data and present some results on the 1000 Genomes data set. Then the other half of the presentation will be completely on a different topic. So I will switch gears and cover graph-structured data. For this particular case, again, I'll start with the motivation and the problem formulation. I'll give a very brief overview on some prior work and introduce the graph attention network, which is the framework that we proposed. And I will share with you some results on two different application domains, namely the protein-to-protein interaction vectors and brain connectomes. So let's start with the genomic spark. So this work was done while I was a postdoc at Montreal Institute for Learning Algorithms, and it was a collaboration with many researchers there and also researchers in the Institute of Cardiology de Montréal. And this work was presented at iClear last year. So to give you an overview of the kind of data that we were dealing with, so we're dealing with SNP data. So modern genotyping techniques usually target millions of variants across the genome. And as a consequence, we end up having for each one of the participants a very high-dimensional feature vector, including all of the SNPs. However, even for very large studies, what happens is that the number of participants is still limited. So we end up with these kind of situations where we have a data set, where we have a very limited number of samples that can be orders of magnitude lower than the number of features, or in this case, the number of SNPs that we have per participant. And given this dire imbalance between the number of samples and the number of input features, there are some challenges that appear. So first of all, what happens is if we think of a very simple neural network where we connect all the input to one single output neuron, if we have very high-dimensional input vectors such as the ones that would be the case for SNP data where we could have, for example, a feature vector that could be in the order of hundreds of thousands or even larger than this, what happens is that in this very simple setting we have the same number of parameters as the number of input features that we have. And the same happens for deep neural networks. In that case, what happens in the first layer of the network, so the layer that is connected to the input features, the number of parameters there grows linearly with the number of input features. So we have a parameter explosion. And when this happens, if we do not have enough samples, we are at higher risk of falling into the overfitting regime. Moreover, because we are in very high-dimensional space and we have very few points, it's going to be very hard to cover the whole space and we also encounter the problem of course of dimensionality. So given that, what we decided is to propose a new framework that would parameterize a neural network in order to reduce the number of free parameters that we would learn. And this framework is specifically designed for cases where we have fewer number of samples that can be ordered of magnitude lower than the number of features that we have. So let me give you a bit more details on how this framework looks like. So again, remember that we are in this setting where we have N, which is the number of samples, which is much lower than the number of features F that we have. So given one sample, which is very high-dimensional with F features, we can feed it to a neural network to have some prediction. And all of these layers here will be, for example, fully connected layers. Optionally, we can also have another branch from the neural network that will try to reconstruct our input. So in this kind of setting, we have this parameter explosion that I was mentioning before in two layers. So one is the first layer of the network that is connected to the input. But if we have this reconstruction branch, we also have the parameter explosion at the last layer that is trying to reconstruct the input. So what we do in order to reduce this parameter explosion is tie these parameters to an auxiliary network. So we define an auxiliary network that instead of taking as input one sample, so one participant with all of the features corresponding to this participant, it takes instead one feature. So it takes all the values of one particular SNP across all the trading populations. So if you remember the data set matrix that I showed you in the beginning, where we had the number of samples times the number of features, it would be like transposing these matrix and using these matrix as input to the auxiliary network. So we will have many more examples. So once we take one feature and we feed it to this auxiliary network, we can either handcraft some embedding or learn some embedding. And the motivation behind that is that two features or two SNPs that are similar under some circumstances might end up having similar features. And these similar features will then be used as weights. So what we do basically is have this auxiliary network predict the weights of the layers containing the most of the parameters in the predictor network. So the parameters of the predictor network are not free anymore and are tied to this auxiliary network. And how does this affect the number of free parameters of the network? So to give you an example, if we consider the case where we have 300K SNPs and we are projecting them into lower representation of 100 dimensions, in this first layer of the predictor network we will have 30 million parameters. But if instead of having these three parameters we tied them to the auxiliary network, the number of parameters that we have free instead in the auxiliary network is to give you an example. If we take into account an embedding of 500 dimensions that we project into the 100 dimensional space, we reduce the number of free parameters to 50K. And just to show you that the connection between the two networks matches in terms of matrix sizes, what we have in the first layer of the network is weight matrix of size F, which is the number of features times 100 if I follow the same example as previously. And what we have at the output of the auxiliary network is also F, which will be the number of features which we passed through the auxiliary network times 100 following the same example as before. And as we did for the first layer of the network, we can have a second auxiliary network that predicts and ties the parameters of the reconstruction layer if we eventually have it. So now that you have a brief overview on the Diane Networks framework, let me go over some of the results that we had. So we tested the model on the 1000 Genomes project. The 1000 Genomes project is a large scale comparison of DNA sequences, which is based on the presence of genetic variations. The dataset represents 26 populations, which are then grouped into five larger geographical areas. In total, we have about 3.4K individuals, from which we extracted about 300K SNPs. SNPs were encoded as having 0, 1 or 2 copies of the genetic variation. So here are some of the results. In this plot, in the X axis, I showed the number of free parameters of the model in the lock scale, and in the Y axis, I showed the misclassification error in percentage. So the MRP would be the predictor network without using the auxiliary networks, and the Diane Networks would be the whole framework. So we can see that as we significantly reduce the number of free parameters of the network, we can improve the results. We also compared these frameworks to more traditional frameworks used in the genomics community, that is extracting fixtures via PCA and then learning some kind of classifier on top of it. We chose different number of PCs. I'm just showing here the results for 50 PCs and 200 PCs, since 50 would be the standard number used for this kind of dataset and 200 would be the best result that we achieved. So we can see that compared to more traditional approaches, deep learning models can also do better. So to show you a bit more the results, these are the confusion matrices on the dataset. So we have the confusion matrices over the 26 populations, but also over the five larger geographical regions for a network that has been trained to classify 26 populations but was not given any information on the larger geographical areas. So as we can see here, the network prediction does a pretty good job. It still finds some confusion between populations belonging to the same larger geographical area or between populations that might share some ancestries. And when it comes to larger geographical areas, it does a much better job. So now that we have seen the results, one might ask, can we try to assess what the network is learning? And to do that, we did some experiments and plotted the features that were learned by the network at different points. So I will show you the raw input features and compare them to some features closer to the input to the network and some features that are closer to the last layer of the network. So those are some Disney plots. So that would be the Disney over the 26 populations for the raw data points. And as we can see here, well, everything kind of overlaps. If we go a bit higher in the network and we can see that small clusters start appearing and these small clusters roughly correspond to the 26 populations that we have in the dataset. And as we go up in the hierarchy, we see that these clusters start gathering together. If instead of showing you the clusters with the 26 populations, I show you the same, but with the five larger geographical regions, what we see is that what happens closer to the last layer of the network is that these bigger clusters are gathering together populations that belong to the same larger geographical areas. So to wrap up on this genomics part, we introduced this Diane Network framework, which is a neural network reparameterization that allows us to reduce the number of reparameters that the predictor network has and is specifically designed to handle problems where we have very few number of samples compared to the high-dimensionality of the data. And with the experiments that we did, we showed the potential of deep learning models to tackle genomic-specific tasks. So now that I've finished with the genomics part, let me completely switch gears and go to the graph structure data. And in this second part, I'll cover two works that were published this year. So one would be the Graph Attention Networks. The framework was published at iClear this year. And then the extension and application of the Graph Attention Networks to the persuasion of the cortical meshes, which was presented at the middle about a month ago. So this work is also in collaboration with many researchers at Montreal Institute for Learning Algorithms, University of Cambridge Computer Vision Center, and the MNI here. So some motivation about why use graph structure data. So I think graph structure data, we can find it everywhere, also in the biomedical domain. So we can think of molecules represented as graphs, compounds, brain connectomes, but we can also think of protein to protein interaction network, gene interaction networks, or more broadly, we can think of networks depicting some kind of epidemiology, for example. So hopefully this is good motivation enough to work on graph structure data and see what kind of deep learning models are well suited for the kind of data. So the problem that we are tackling here is the node classification problem. So we're trying to assign one label to each node of our graph. The kind of data that we have is organized as follows. So as input, we have matrix of node features, and we also have an adjacency matrix, which tells us the connectivity of the graph. And as output, what we expect is a matrix of node class probabilities. So for the rest of the presentation, what I will assume for simplicity is that edges are undirected and unweighted. However, most of the work that I will cover is readily applicable to a broader set of graphs. And I will focus on the inductive learning problem, which is the setting where training does not have access to all the nodes up front. So this is the case for graphs that are joined, and we might want to use some of these graphs for training, some of these graphs for validation, and some of these graphs for tests. And there has been, like, a huge amount of literature in trying to design models for graph structure data from the deep learning point of view. And the simplest model one can think of is the model where we have a per node classifier. So we would just feed the network with the node features, but we would completely drop the structure of the graph. And given this natural baseline, some research has been introduced trying to complement the node features with some structural features. And there have also been some attempts to use recurrent neural networks to tackle graph structure data. In this particular case, we are interested in leveraging convolutions for graph structure data. So a natural question that arises is why convolutions? So convolutions have a number of desirable properties. So first of all, they are independent of the input size. Second, they have a fixed number of parameters because we share the weights across all input locations. They are localized, so they act on local neighborhoods. They are able to specify different importance to different nodes, and they are readily applicable to the inductive problems that I was mentioning and that we are interested in. The question now is how do we apply a convolutional filter to a graph? So there are a number of challenges that make this an interesting problem. So for example, it is not straightforward how to think of how to slide the convolutional kernel across all the input positions, so across all the graph positions. And second, what can happen in graphs is that each node may have a different number of neighbors. So how do we deal with that? So there have been two main lines of research to try to address the problem of applying convolutions to graph structure data, and those are categorized into spectral approaches and non-spectral approaches, so to give you a very quick overview on what spectral approaches are. So graph spectral networks operate directly in the spectral domain. So they mean that they exploit the Laplacian matrix of the graph, which is defined as the degree matrix minus the adjacency matrix or some kind of normalized version of it. And once they have the Laplacian matrix, they compute the eigen decomposition of it. With the eigen decomposition, they can transform the graph signal, so the features that we have in each node of the graph, into the spectral domain. And once we are in the spectral domain, a convolution becomes a simple multiplication. So the transformation that we do to compute the features at the output of a convolution using the spectral networks are first transforming the features into the spectral domain by using the eigen decomposition of the Laplacian matrix, then multiplying by some convolutional kernel, and then transforming back the signal into the spatial domain, into the graph domain. So some potential shortcomings of these kind of approaches are that first computing the eigen decomposition of the graph Laplacian might be very expensive, especially if we have graphs where we have a lot of nodes. And second, in this kind of approaches, filters are not localized. So in order to overcome these shortcomings, there have been the Chebyshev nets and the graph convolutional networks introduced. And in this case, rather than computing the Fourier transform of the graph, what they do is use an approximation computed as a Chebyshev polynomial of order k. So in this case, that's how the computation of the convolution would look like, where the Tk of L term corresponds to the Chebyshev polynomial. So some features about these Chebyshev polynomials is that they have a recursive formulation, which I'm not showing here, but you can check. And this highly simplifies the computation. Moreover, they are built in a way that they are weighted sum of all powers of the Laplacian matrix up to some degree k. And so that's interesting because as you take the powers of the Laplacian, you are basically increasing how far you see. So if you take the power two, you will see two hops away or three hops away and so on as you increase the power. So that makes the operator trivially localized. And the number of parameters in this case is fixed. So we can see that the parameter Wk, we have one parameter for each one of the polynomials. And that fixes the parameters, but has another potential problem, which is that it's applying the same parameter to all the neighbors. So we will apply the same Wk to all the nodes that are one hop away, the same way to all the nodes that are two hops away, and so on and so forth. So we are unable to specify different ways to different nodes in the same neighborhoods. We are learning essentially isotropic filters. And so that was for the spectral approaches. There have also been attempts to tackle the convolutional graph problem in the spatial domain, so in the graph domain. So those are interesting approaches because they do not assume any particular graph Laplacian up front. But the problem here is, as I mentioned before, how do we handle the fact that each node, for example, may have a different number of neighbors. So just to give you a couple of examples of things that have been introduced, so there are the molecular fingerprint networks. And in this case, they address the problem of having different degrees per node by building a weight matrix for each one of the node degrees. And that's a good strategy in their particular problem because their node degrees only range from two to five. But clearly this does not scale well if we have a very wide-degree distribution. There have also been other attempts, like the graph sage, but in that case they maintain the computational footprint of the models fixed by sampling a fixed-size neighborhood. So although that might work keeping the neighborhood size constant, it has a problem that we are inherently dropping some data that might be relevant. So what we propose, which are the graph attention networks, those are networks operating on the spatial domain, and what we will do to be able to handle a different number of neighbors per node is leverage the self-attention operator. And this self-attention operator will allow us to emulate convolutions on graphs. So here is what self-attention looks like in the particular case of a graph. So if we have two nodes, which we can call i and j, each one of those nodes might have their own features, like h, i, and h, j. So we will compute some scores, so we will feed these features to an neural network that will output some similarity score between these two nodes. And once we have the scores for all the pairs of nodes in the graph, we can apply softmax nonlinearity to get...