 Now it is exactly 1.30, and time to start the second half of our symposium. So welcome back everyone from the lunch break, at least if you're in Europe, the lunch break, or another type of break if you're on a different continent. It's an honor to have Michael Bronstein present here today as the first speaker of the afternoon. He is very famous for his work on geometric deep learning. And on graph-based learning, more generally, Michael holds positions at the University of Lugano at Imperial College London. And he's also the head of graph learning at Twitter. And he has won an enormously long number of awards, or list of awards. I highlight a few five year C grants to Google Faculty Research Awards, but also in particular the awards by the Royal Society, the Watson Research Merit Award, and recently an award by the Royal Academy of Engineering in the form of the Silver Medal in 2020. So he's a highly decorated scientist. We are happy to have you here, Michael, and to learn more about your view of geometric deep learning. Welcome. The floor is yours. Thank you, Carson, and thank you, everyone, for joining really a privilege to be here. And I feel a little bit of an imposter because the previous speakers have had some background in biology. I am a computer scientist, so I apologize for that in advance. And I should also apologize for somewhat arrogant title, maybe, but let me try to disentangle the meaning of what is actually meant by geometric deep learning. And to do this, let me start by taking you back in history to around year 300 BC, where basically since that time for nearly 2000 years, we used the word geometry synonymously with Euclidean geometry simply because no other types of geometry existed. And this Euclidean monopoly came to an end in the 19th century with examples of non-Euclidean geometries constructed by Lobachevsky, Boye, Gauss, Riemann, and others. And together with the development of what is called the projective geometry, an entire zoo of different geometries emerged. And it happened that towards the end of the century, these studies diverged into disparate fields and mathematicians even were debating which geometry is the geometry and what actually defines the geometry. And interestingly, a way out of this pickle was shown by a young German mathematician called Felix Klein. He was only 23 when he was appointed a full professor at the University of Erlangen in 1872. And he wrote a research paper, basically a research prospectus that entered the history of mathematics as the Erlangen program, where he proposed approaching geometry as the study of invariants or symmetries, basically the properties that remain unchanged under some class of transformations. And this approach immediately created clarity by showing that different geometries could be defined by an appropriate choice of the symmetry transformations. And he used the language of group theory, which was a kind of new shiny mathematical instrument also born in the 90th century to formalize this approach. And the impact of the Erlangen program on geometry and mathematics broadly was very profound. It also spilled into other fields, especially physics. And in physics, symmetry considerations allowed to derive conservation laws from the first principles. There is this astonishing result known as the Noether's theorem. And after several decades, this fundamental principle through the notion of gauge invariance will discuss it later, some generalized form that was developed by Young and Mills in the 50s proved successful to provide a unified framework that describes all the fundamental forces of nature. And this is what we call the standard model. It describes all the physics we currently know today. Gravity doesn't yet fit into this picture, but it's nevertheless remarkable. So I can only repeat the words of another nobillist, Philip Anderson, that it's only slightly overstating to say that physics is the study of symmetry. Now, at this point, you may wonder what all these does have to do with deep learning. And the process, at least in my opinion, the current state of affairs in the field of deep learning reminds a lot the situation of geometry in the 90th century. And on the one hand, in the past decade, deep learning has brought truly a revolution in data science and apologized for this somewhat irreverent picture. And it made possible many tasks that previously, maybe a decade ago, were thought to be beyond reach and in some cases even complete science fiction, such as computer vision, speech recognition, unsupervised language translation, playing intelligent games such as Go or more recently protein folding. But on the other hand, we now have a zoo of different neural network architectures for different kinds of data and very few unifying principles. And as a result, it is difficult to understand how these methods are related, which inevitably, as it happens, leads to the invention and rebranding of the same concepts and sometimes even bitter fights over priority. So what we need, I think, is some form of geometric unification in the spirit of client clients along the program, which I call geometric deep learning. And it has two purposes. First to provide a common mathematical framework in language to study and derive most successful neural network architectures. And second, give a constructive way, a procedure to incorporate prior knowledge and deductive bias that will allow potentially built future architectures in a principled way. So the term geometric deep learning actually made it up for my ERC run 2015 and it became popular after a paper that we published in the single processing magazine in 2017. And nowadays it's almost used synonymously with deep learning on graphs. And one of the frequent questions that I must when I talk about graph neural networks is why do you call it geometric rather than topological, for example. So I hope that this talk will clarify why we call it geometric deep learning. So if we look at machine learning, at least in a simple setting, it is essentially a function estimation problem, right? So we are given an unknown function, let's say a dog or cat classifier that we observe on a training set, examples of, let's say, dog and cat images. And we try to find a function that fits well with training data and allows to predict it on previously unseen test inputs. And what happened over the past decade is that the availability of large, high quality annotated data sets such as ImageNet coincided with the emergence of computational power of graphics hardware, GPUs. And this allowed to design a rich class of function that have the capacity to interpolate such large data sets. And neural networks appear to be a suitable choice to represent such broad class of functions, because as we know, even the simplest choice of architectures such as the perceptron that I show here, probably the earliest and simplest neural network, if we combine just two layers of perceptron, we get a dense class of functions, what is called universal approximation. So we can approximate any continuous function to any desired accuracy. Now the setting of this problem in low dimensions is a classical problem in approximation theory that really has been studied to death over the past century or so. We have very precise control of estimation errors, number of samples, but the situation is entirely different when we go to high dimensions. And we can quickly see that in order to approximate even a simple class of let's say Lipschitz continuous functions, like the example that I show here, where we have a superposition of Gaussian globes put in quadrants of a unit square, we see that the number of samples grows very fast with the dimension. In fact, it is exponentially fast. And this is a phenomenon that is colloquially known as the curse of dimensionality. And in modern machine learning methods, we need to operate with data that lives in thousands or even millions of dimensions. So the curse of dimensionality is always there standing behind the corner, making such a naive approach to learning impossible. And perhaps the best way to see it is in computer vision problems like an image classification, where even tiny images are very high dimensional. But if you look at an image intuitively, it has a lot of structure that is completely broken and thrown away if we need to vectorize an image and provide it as an input to a perceptron. And now if the image is shifted by just one pixel, we see that the vectorized input will be very different and the neural network will need to be shown a lot of examples to learn that shifted inputs must be classified as the same thing. Right. And the remedy for this problem in computer vision came from actually works in neuroscience, such as the classical paper of Google and Vizel that brought them the Nobel Prize in medicine for the study of visual cortex in 1981. And they show that brain neurons are organized into local receptive fields, which serve as an inspiration to a new class of architectures with local shared rates, starting from the Neocognitron of Fukushima and then probably the most famous architecture, the convolutional neural networks, the seminal work of Yan Mekam from the 80s. And this concept of way sharing across the image effectively solves the curse of dimensionality and maintenance approximately in variants of object translation. Now, let me show you another example. What you see here is a molecular graph, and you've probably seen it already before today. And this is a molecule of caffeine, if you're interested in. So the nodes here are atoms and the edges represent chemical bonds. And if we apply it to apply a neural network to this input, for example, to predict some chemical properties such as binding energy to some receptor protein, we can again pass this input into a vector, but you can see now that any arrangement of the node features will do, because in graphs and like images, we don't have any canonical or preferential way of ordering nodes. And molecules appear to be just one example of data with irregular non-nuclidean structure on which we would like to apply deep learning techniques. So other examples are social networks, which are gigantic graphs or different interaction networks or intractomes and biological sciences, manifolds and meshes in computer graphics and actually some models of proteins, for example, that I will show later are also reliant on this model. So all these are examples of data that waits to be dealt with in a principal way. So let me return back to this example of multidimensional image classification that at first glance appeared to be hopeless because of the curse of dimensionality. Fortunately, we do have additional structure that comes from the geometry of the input signal. And this is something that we call geometric prior. And as we'll see, it is a general very powerful principle that gives us hope and optimism in dimensionality course problems. So in our example, in particular of image classification, the input image is not just a d-dimensional vector. It's a signal that is defined on some domain omega, that in these cases is a two-dimensional grid. So I will denote the signals by x of omega, and I will use the red color to represent signals. So the structure of the domain can be described by what is called the symmetry group. So in this case, it's a group of two-dimensional translations that act on the points on omega. So I will denote the points by u and the group elements by this lowercase fractal g. And now in the space of signals, the group actions on the underlying domain are manifested through what is called the group representation that I denote here by a row. So in our example, again, it's a matrix that acts on the d-dimensional vector and you can think of it as what is called the shift or translation operator. Now this geometric structure of the domain that underlies our signal affects the function f that we are trying to learn. And we can have functions that are unaffected by the action of the group, what we call invariant functions. And a good example, again, is the image classification. And here, no matter where the cat is located in the image, we still want to say it's a cat. So this is an example of shift invariance. Now another example, we can have a case where the function has the same output as input. And for example, in image segmentation, the output is also an image. It's a pixel-wise label mask. So we want the output to be transformed in exactly the same way as the input or what we call group-equivariant function. So in this case, it's, again, it's shift-equivariance. Another type of geometric prior is what is typically called scale separation. So this is the underlying principle of, for example, wavelength decomposition. And in some cases, we can construct a multi-scale hierarchy of domains, for example, by coarsening a grid in images. Let's say I denote it by omega prime. So coarsening requires some extra structure. It needs to assimilate nearby points on the domain, producing also a hierarchy of signal spaces that are related by an operator that you know here by p, and sometimes called the coarse gradient. And in this coarse domain, we can define a coarse-scale function that I know here by f prime. And we say that our function is locally stable if it can be approximated as the composition of the operator p and the coarse-scale function. So basically, I can down-sample everything and then apply my classifier. And you can see that while the original function f might depend on long-range interactions on the domain, in locally stable functions, it is possible to separate the interaction across scales by basically first focusing on localized interaction that then propagated towards the coarse scales. And this is, again, a fundamental principle that you encounter everywhere in physics. So one of the classical algorithms is what is called fast multiple methods that we use to represent n particle interaction systems. And these two principles of geometric priors give us a very general blueprint of geometric deep learning that you can probably recognize in the majority of popular deep neural network architectures. We can apply a sequence of equivariant layers such as the convolutional layers in CNNs and possibly an invariant global pooling layer that aggregates everything into a single vector, features of the entire image, for example. In some situations, if we also have the possibility to create a hierarchy of domains by using some coarsening procedure, we can, for example, do what typically is max pooling in CNNs. And I hope that you can recognize all these components and different architectures that are your favorite in your applications. And this is a very general design. It can be applied to different types of geometric structures, for example, grids, global transformation groups, what is called homogeneous spaces, graphs, many forms. This is what we call the 4G of geometric deep learning. And the implementation of these principles in the form of inductive biases leads to some of the most popular architectures that exist today in deep learning, such as convolutional neural networks. That is, I will show an emerge from translational symmetry, graph neural networks, deep set transformers, and different versions of intrinsic CNNs. So let me start with graphs. And probably each of us has a different mental picture when we hear the word graph. For me, maybe because of my work at Twitter, I first think of a social network, models, relations and interactions between different people. So mathematically, the users of a social network are a model based nodes of the graph and the relations between users are pairs of nodes or what is called edges. And we can also assume that nodes have some features that are attached to them, the dimensional vectors. Now, a key structure characteristic of the graph is that we don't have a canonical weight order of nodes. So when I put some numbers on the nodes, I already defined some arbitrary ordering of nodes. So if we arrange the node feature vectors into a matrix of size n by d, and it's the number of nodes in d is the dimension, we automatically prescribe some arbitrary ordering. And the same holds also for the adjacency matrix of the graph, right? So if we number the nodes differently, the rows of the feature matrix and the rows and the columns of the adjacency will be permuted by some permutation matrix p, which is an element of the permutation group of a set of size n. So we have n factorial such elements. It's actually a very large group. Now, if we want to implement the function on the graph that provides a single output for the entire graph, like in our example of a molecule for which we were to predict its binding energy, we need to make sure that it's output is unaffected by the ordering of the input nodes, what we call permutation environment. Now, if we want to make node-wise predictions, for example, I want to classify some of the nodes in the graph, let's say detect some bad users in the social network, I want in this case, the output of the function to change in the same way as the input with the ordering of the nodes, what we call permutation at pre-variance. Now, as we'll see later, the most general form of such functions will be impossible to implement on graphs simply because the permutation group is too large. But attractability of constructing a pretty broad class of such functions is using the local neighborhood. Basically, we're looking at nodes that are connected by an edge to a node i, and we can take the feature vectors, technically they form a multi-set because different nodes might have the same feature vector, and we apply some aggregation function to this multi-set together with the feature vector of the node itself, ideal by five. So, importantly, again, we don't have a canonical way to order the neighbors, so this function by construction must be a permutation invariant. And if I apply it now to every node of the graph and start the result into a feature matrix, I get a function that is permutation equivariant. And it appears that the way this local function phi is constructed is crucial, and its choice determines the expressive power of the resulting architecture. And when phi is injective, it can be shown that the neural network designed in this way is equivalent to the Weiss-Feller-Lehmann graphism or fission test. So, it's a classical algorithm in graph theory that tries to determine if the graphs are isomorphic. So, here I should say that some recent works that showed it for graph neural networks, but actually, these results are much older. And, well, I can mention here the classical paper of Sheryl Schiedze and Bogart from at least a decade preceding the modern works on graph neural networks, but probably some of the designs of graph neural networks that we know today, maybe in a different form and a slightly different formulation, go back to the works in computational chemistry, probably at least to the 90s or maybe even before that. So, let me just remind you what is graph isomorphism. So, we say that two graphs that are represented here by adjacency matrices A and A prime are isomorphic if there exists an edge preserving bijection between them. In other words, we can permute one matrix into another. And basically what the Weiß-Feller-Lehmann test tries to do is to do some kind of iterative color refinement. So, it starts with all the nodes of the graph having the same color, basically some formal discrete label, and then it applies a local injective function to refine the color. And by virtue of injectivity, it means that neighbor codes with different structure will be mapped to different colors. And in this example, we have two types of nodes. We have nodes with two neighbors and nodes with three neighbors. So, they will become green and yellow in this illustration. If I refine this procedure again, we now have three types of neighbors. We have yellow, yellow. We have yellow, green, green and yellow, green. And they will be mapped into violet, red and gray. But if I apply it further, then the colors will not change anymore. So, at this point, I can produce a histogram of colors. And if I apply the same procedure on another graph and they get a different histogram, then I can for sure say that the graphs are not isomorphic. But if the histograms are the same, we actually don't know. So, it's a necessary but insufficient condition. And in fact, there are examples of graphs that are deemed equivalent by the Weiß-Feller-Lehmann test, but they are not isomorphic. Like in an example shown here, we actually know that these tests cannot count triangles in graphs. And there are many works that try to extend message passing schemes that we'll discuss to these higher order structures, basically that come from computational topology. So, let me go back to the way that this local aggregation function looks like. So, we have some permutation and variant aggregation that I denote by squares such as sum and maximum. A learnable function psi that transforms the neighbor features and another function phi that updates the features of an old by using the aggregate of the neighbor features. And I'm omitting a lot of nuances on how to design each of these components. This is actually a very active research topic in deep learning on graphs, but fortunately most architectures fall into one of the three following flavors. So, the first one is convolutional flavor. And this is how some of the early works on graph neural networks look like. They originated from spectral analysis on graphs. And in this setting, we aggregate neighbor features that basically weighted by some fixed coefficient Cij. They depend only on the structure of the graph. And we'll see that why the name convolution comes here because this scheme will boil down to the classical convolutional grids. So, the second flavor is based on attention and the aggregation coefficients here depend on the feature themselves. And there are multiple architectures that fall in this category. Probably the most prominent is the graph attention network paper by Petr Wieliczkowicz. And the most general flavor, we have a nonlinear function that depends on both feature vectors of node i and j. We can regard it as a message that sent of data features of node i. And graph neural networks of this type are called message passing. So, in chemistry applications, they were first introduced by Justin Gilmer from DeepMind in computer graphics. Our paper with Jovang and Justin Solomon from MIT proposed essentially the same thing for computer vision and graphics applications. And if you look at the typical graph neural network architecture, you will immediately recognize an instance of our geometric deep learning blueprint with the permutation group as the geometric prior. So, we have a sequence of permutation equivalent layers. Typically, they are called propagation or diffusion in the literature and possibly a global pulling layer that produces a graph-wise readout. We can also include local pulling layers. Some architectures do it. They are obtained by some form of graph coarsening that can also be learnable. So, let me now say a few words about some interesting special cases of graph neural networks. And the first case is a graph with no edges. So, basically, this is a set. And like the set of nodes in the graph, a set is unordered. So, in this case, we can do two things. We can most straightforwardly consider each element of the set entirely independently and apply some shared function phi to the feature vectors. And this is a permutation equivalent function over the set. This is a special setting of a graph neural network. This is what is called deep sets in machine learning or point net architecture in computer graphics developed in the group of layout givers at Stanford. Now, as another extreme, instead of assuming that each element of a set acts on its own, we can assume that any two elements can interact. So, this is now we have a complete graph. And in this case, the convolutional flavor makes no sense obviously because the aggregation is over all the set of nodes. And the second argument in this function becomes the same for all nodes. So, we need to upgrade to the attentional mechanism. And in this case, we can interpret the attention as some form of learnable software adjacency matrix. So, I hope you can recognize the famous transformer architecture that is now very popular in NLP applications. It is also a particular case of a graph neural network. And I should say that transformers are commonly used in sequential data where we do have an order of nodes. So, this node order information is typically provided in the form of what is called positional encoding. So, it's an extra feature that uniquely identifies the node. And similar approaches exist also for general graphs, many ways you can encode the position or the structure of the nodes. So, the example I show here is from a recent paper with my students where we show that we can count small graph substructures such as triangles and clicks and provide them as a kind of structural encoding that allows the message passing mechanism to adapt to different neighborhood structure. So, this is architecture we call graph substructure network. We can show that it's strictly more powerful than the WL test with the proper choice of the substructure. And it's interesting, actually, you can this way incorporate problem-specific inductive bias. And if we go again back to this example of molecular graphs, inorganic molecules, for example, cycles are prominent structures like you have things like aromatic rings. And again, if you look at the caffeine molecule, it has rings of length six and rings of length five. And what we observe in experiments is that our ability to predict chemical properties of molecules improves dramatically if we provide counts of rings of certain size five or more. And again, because this is a very meaningful inductive bias in these applications. So, you can see that even in the cases when the graph is not given as input, graph neural networks still make sense. And even when the graph is given, you don't necessarily need to stick to it to consider some sacrosanit structure in order to do the message passing. In fact, a lot of recent approaches decouple the computational graph from the input graph. And there are multiple ways you can do it, either in the form of sampling, usually to address issues like scalability, such as the famous graph sage paper, rewiring the graph to remove noise, so the recent paper from Stefan Gunemann, or using larger multi-hook filters where the aggregation is performed on neighbors that are multiple hopes removed. Now, you can also learn the graph on which to run graph neural network that can be optimized for the downstream task. So I call this setting latent graph learning. I think Matthias talked about it earlier today, the multiple works in this direction. And we can make this construction of the graph differential and backpropagate for it. And this graph can also be updated between different layers of the neural network. And our first work that implemented this architecture, this latent graph learning, was what we call dynamic graph CNNs. And we first applied it to computer vision and graphics applications working with 3D point clouds. And I think it's actually a good example of point cloud segmentation. It actually shows nicely why it makes sense to have a dynamic graph. So at first, we use the graph to represent the local structure of the object. So we have a flow of information between nearby points in the point cloud, kind of crude representation of geometry. So it's better than applying the same function at every point. So we have some kind of small structure here. But as we go deeper into the network, our features capture increasingly more semantic information such as telling apart, for example, two engines or two wings of the airplane. So the graph has to be adapted to allow to connect semantic and similar points. And maybe in historical perspective, I think this latent graph learning idea can be related to methods that were called manifold learning or nonlinear dimensionality reduction. And the key premise of manifold learning was that even though our data is high dimensional, it has low intrinsic dimensionality. And usually the metaphor that was used for this concept is this Swiss roll surface. We can think of our data points assembled from some low dimensional manifold with very high co-dimension. And the structure of this manifold was typically captured by nearest neighbor graph that was then embedded by preserving some graph structures such as geodesic distances into a low dimensional space in which it is easier to do machine learning, such as clustering. And the reason why these methods never really worked beyond data visualization is that all these three steps are separate. And it is obvious that, for example, the way that you construct the graph or even how you design the feature space has huge influence on how the downstream task would look like. So now with latent graph learning, you can bring a new life to these algorithms. And maybe I arrogantly call it manifold learning 2.0. We now have a way to build an end-to-end pipeline in which we build both the graph and the filters that operate on this graph. Basically, as a graph neural network with latent graph structure. And we recently used a new version of this latent graph learning. We call differential graph module or DGM for automated diagnosis application. So that was our last year MIGCHI paper with the group of Nasir Nabab from TUM. And we show that this method consistently outperforms GNNs with handcrafted graphs. So here the graph is built in an optimal way for the downstream task, which was automated diagnosis. So let me now move to another type of geometric structures that we are all familiar with. And these are grids. And grids are particular cases of graphs. What I show here, for example, is what is called a ring graph. So it's agreed with periodic boundary condition. And compared to general graphs, the first thing that you notice is that a grid has fixed neighborhood structure. So here we have exactly two neighbors, the green and the blue. Okay. And not only that, the order of the neighbors is fixed. I remind you that before on a general graph, we had to resort to our permutation environment aggregation function phi because we had a multi-set of neighbors that was unordered. Now we have sequential order of nodes. We can always put, for example, the green and the red and then the blue. So if, for example, we choose a linear aggregation function that is aggregated with a sum, we don't have any more permutation variant aggregation. We have a convolution, right? And in fact, if we write it as a matrix vector multiplication, we have a matrix with very special structure that is called circling matrix. So it is a mathematical model for circular convolutions. And you see that circular matrix is formed by shifted copies of a single vector of parameters that they denote here by data. So these are exactly the learnable shared parameters in the CNN layer that I show here. Okay. So one thing that you need to know about circular matrices is that unlike general matrix multiplication, they are commutative. So it means that AB equals PA. And in particular, they commute with a special circular matrix that cyclically shifts the elements of a vector by one position, the shift operation. Okay. So circular matrices commute with shift. And this is just another way of saying that convolution is a shift-equivariant operation. So I should say that here many signal processing references, especially called shift invariance, but the correct mathematical term is shift-equivariance. Okay. Now, this statement also works in another direction, not only that every circular matrix commutes with shift, but also every matrix that commutes with shift is circled. So we get is that convolution is the only linear operation that is shift-equivariant. And I hope you can see here the power of this geometric approach. Basically, convolution is automatically emerges from translational symmetry. Now, I don't know how about you, but when I studied signal processing, nobody explained where the convolution comes from. It was given as a formula that basically somehow steps out of the blue. And unfortunately, this is also the case with many deep learning practitioners that sometimes you tend to apply these methods as a black box without really understanding the origins. Now, let me show you another nice thing. We also know from linear algebra that commuting matrices are jointly diagonalizable. Basically, there exists a common basis in which all convolutions amounts to point-wise products. They become diagonal matrices. And because all circular matrices commute, we can pick up one of them for the convenience of analysis. It is convenient, actually, to look at the eigenvectors of the shift. And surprise, surprise, the eigenvectors of the shift are nothing else but the discrete Fourier basis, the DFT. So all convolutions are diagonalized by the Fourier transform. And you see that even this basic fundamental construction as Fourier transform also comes out of fundamental principle of translation symmetry. So if you wondered what is so special about Fourier transform, here you see it. It is actually part of a bigger picture that is called representation theory, but allow me to skip it. And this relation between the convolution and the Fourier transform, what is called the convolution theorem in signal processing, gives us two ways to perform convolution. It's either by multiplying by a circuit matrix that corresponds to a sliding window along our signal or in the Fourier domain element-wise product of the Fourier transform of the signal and the filter. And on grids, you can do it efficiently because we have redundancy in the Fourier matrix. And it gives rise to what is called fast Fourier transform algorithms. I should say that in the graph learning literature, some of the first works, including my own, used the second way to generalize conversions using the notion of what is called the graph Fourier transform. So I hope I didn't say anything new so far. So let me now move to a more general case where our group formalism will be more prominent. And as we've seen, we can think of conversion as a kind of this pattern-matching sliding window operation that slides a filter over the image and multiplies every time a different page of pixels in that image. Now, let me write it a bit more formally. So we need to define a shift operator that I denote here by T, shifts the filter, I denote it here by psi. And in a product that matches the filter to the image X. So if we do it for every shift, we get the convolution or correlation. The difference is here is academic. Actually, what is called convolution in machine learning is usually correlation. Now, notice one special thing here that the translation group is actually identified with the domain itself because each element of the group, it's a translation, can be represented by a point on the domain to which we translate our filter. Now, this is not the general case. And in general, we will have the filter transformed by some representation of our group that they've noted by wrong. So the convolution or the analogy of convolution now will have values for every element of the group, this lower case factor G. So we can easily show that this operation is actually variant under the group action and allow me to spare the technical details. It comes from the fact that the representation of the inverse of a group element is a joint. So we can move it under the inner product. And you can also see the reason why we could not do these constructional graphs. Basically, the permutation group is too large. It has super exponential number of elements. So basically, we would have to compute the output of this filter for every possible permutation, which is intractable. So the reason implicit assumption here that if the group is discreet and small, if it's continuous, then it's low dimensional. Now, here's an example of how to do conversion on the sphere. So this is an example of a low dimensional manifold. And it's not some exotic construction. Actually, spherical signals are pretty important. They're important, let's say in astrophysics, where a lot of observational data is naturally represented on the sphere. So here I'm showing the cosmic microwave background radiation from primordial universe from the period of the big band, also in representation of molecules. Rotations play an important role. And our group here is what is called the special orthogonal group, SO3. So it's rotations preserving orientation. So if I represent every point on the sphere as a unit three dimensional vector, then the action of the group on these points can be represented as an orthogonal matrix with determinant one, that I denote by r. So the conversion here is defined on SO3. So we get a value of inner products for every rotation of the filter. Now, this is the case where the group is different from the structure of the domain. So the sphere is a two dimensional manifold. SO3 is actually a three dimensional manifold. We can rotate three ways on the sphere. Rotate along the meridians, along the parallels and around itself. So if we were to apply another layer of convolution, we would need to apply it on SO3. So now the output or of the first layer and the input to the second layer is a three dimensional manifold, SO3. So the points on this manifold are rotations themselves. And that I denote here by Q. Now, you see that the sphere in this example, it's an example of a non-neutral space called a manifold, but it is still quite special. So every point on the sphere can be transformed into another point by an element of the symmetry group of rotations. In geometry, we call such spaces homogeneous. Basically, homogeneous spaces are democratic. Every point is equal to any other point. I can map it to any other point. So basically, the key feature of such spaces is a global symmetry structure. And this global symmetry structure doesn't obviously hold for general manifold. So let me introduce the last concept for today, what physicists call the gauge symmetry. Okay, so one thing that we need to note when we apply a sliding window in an image is that it doesn't matter which way we go from one point to another. We'll always arrive at the same result. Now, the situation is dramatically different on the manifold. If I go along the green path or the blue path, you will see that the result will be completely different. In differential geometry, this is called parallel transport. And the result of moving a vector on the manifold is path dependent. Right? So now, I usually, it takes me quite a lot of time to explain basic concepts in differential geometry. So let me try to recap it in two minutes because I'm limited in time. And the crucial thing to start with the difference between manifolds and Euclidean spaces is that manifolds are only locally Euclidean. So a small neighborhood around the point U is homeomorphic or topological equivalent to Euclidean space. So in this case, it's a two-dimensional space that is called the Tangent plane or the Tangent space. So Tangent vectors, basically vectors living in this space, we can manipulate them. We can define an inner product that is called Riemannian metric. And everything that is defined in terms of Riemannian metric is what is called intrinsic. So if I deform my surface in a way that doesn't change the metric, it's called an isometry. It's a metric preserving transformation. We'll see why it is important in a second. So I can do different things locally in this Tangent space. If I want to go back to the manifold, I need to apply what's called an exponential map. Okay, so it's a map that takes a unit step along a geodesic in the direction V at point U. Okay, so one key thing that you need to understand about Tangent vectors or vectors in general that these are abstract geometric entities that exist in their own right. So probably one of the worst crimes against humanity that is committed in teaching linear algebra is that you are told that vectors are arrows or arrays of coordinates. So it's neither arrows nor arrays of coordinates because in general vector space, you don't have any notion of direction. This comes from an extra structure that is called an inner product. You don't have any notion of lengths. So this comes from an extra construction that is called the norm. Same way, vectors are abstract entities. They can only be thought of as arrays of numbers if you provide some reference frame. So if I want to represent the vector on a computer, I need to provide some local reference frame as you notice here by W. With respect to this frame, I can represent my Tangent vectors in these cases as a pair of two coordinates. Okay, now this frame is defined in an arbitrary way. And what we call a gauge is simply a way of choosing this local reference frame at every point on the manifold in a way that depends smoothly on the position. So if I were to change the reference frame, let's say from yellow to red, I can do it by applying what is called a gauge transformation. So it's a group-valued function of the manifold. So in general, it can be what is called the general linear group, any invertible matrix. It is convenient to work with orientation, preserving rotations. So again, we have the special orthogonal group in this case, S of two. These are rotations in the plane. Sometimes it's called the structure group of the Tangent bundle of the manifold in differential geometry. So basically, if we look at two gauges, there always exists an element of this structure group in two-dimensional rotation that translates one gauge into another. It might be different at each point. So here is the main difference from homogeneous spaces that this transformation is not global. It is local. Okay. So if I now want to look convolutional manifolds, I have multiple options. So let's say that we have a scalar function on the manifold and at each point here, we can represent it in this local two-dimensional system of coordinate. So basically, I have an exponential map that goes, I go to the manifold to fetch the value of the function. And I multiply it by a filter side that is also defined in the plane. So this is a direct analogy of the sliding window that we had before. Now, if I fix the gauge, the story ends here, essentially. And this was in fact one of the first works we did for deep learning on manifolds that we called the mesotropic CNNs. Now, the main difficulty here, at least theoretically, is this approach requires some mechanism to compute canonical gauges on manifolds. Now, in theory, this is not possible. In practice, it is. And we actually had a paper at CPR last year, we called G-frames for computing stable local frames of point clouds and meshes. So there are some ugliness, for example, there will be a few points where this frame is not defined. So actually, there are theoretical results, what is called the Poincare-Hope theorem or Hailey-Bowl theorem that tells you that, for example, on this sphere, you cannot have a non-venging smooth vector field. Basically, if you think of a Hailey head, then you will have a vortex of hairs at one point. Okay, but in practice, you can just ignore these things and construct a stable reference frame, but it works in practice reasonably well. Another alternative is to make the filter equivariant to rotations, which in this setting, because it's a scalar thing, it is unaffected by rotations. So basically, it will be invariant to rotations. And as a result, you get radially symmetric isotropic filters. Isotropic, I mean, that they're agnostic to directions. And in a sense, this is a situation that we had in graphs, because we didn't have any way to canonically order the neighbors. But on manifolds, we see that we do have more structure. Instead of rotation invariance, we have orientation ambiguity, basically how to rotate our local frame. So isotropic filters do throw a lot of important information. So actually, all the spectral approaches result in isotropic filters. So another way is to use an anisotropic filter. So I have a directional sensitive filter, but I apply it for all possible rotations and then aggregate the results with, for example, angular max pooling. And it's kind of matching, rotating mask. So that was the very first architecture for deep learning on manifolds. We call geodesic CNNs. And well, in retrospective, it was probably the simplest, but the ugliest thing to do. And if we want to work with vectors, you need to take into account that you cannot simply transport a vector from one point to another. And I remind you that the values of facts, we are taking them from different points than you. So we need some form of parallel transport. And to make a long story short, we can design what is called gauge-equivariant convolution operations. That was a lighter fork from tachocoin and max pooling. And you can also see here, maybe in a more nuanced way, again, the comeback of our geometric deep learning blueprint. So we have deformation invariance, basically by considering intrinsic definition of the filters that are invariant under isometries, geometric deformations of the surface, and also equivariance with respect to the structure of the tangent model of the manifold, basically the change of the gauge. And if you wonder why at all do we care about manifolds, one of the reasons is that in computer graphics and vision, manifolds or the discretizations, usually in the form of meshes are a standard way of modeling 3D options. And what we gain from our geometric perspective is filters that are defined intrinsically, and as a result, they become invariant and inelastic deformations. And you can see in the example on the right that the filter that is kind of drawn on the surface is unaffected by the information. So this brings me probably to the first example of application that has to do with the topic of today's workshop is dealing with drugs or biomolecules, and we are using manifolds as a way of representing protein molecules. Proteins, as you know, they're big biopolymers, you can represent them either as atomic point clouds or as graphs or as secondary structures. So molecular surface, we argue, is a good representation when we want to model protein interactions. And the reason is that it abstracts out the internal fold of the protein that might be irrelevant for dealing with or predicting the way that it interacts with some ligands. And you can see an example here. So when the small molecule binds to this protein, it doesn't see what happens inside. It sees only this transparent molecular surface. So surface base representation is a good way of capturing only the relevant structure of the protein. And there are many other reasons why actually working intrinsically with the manifold is a good idea. Typically, you see some pocket-like structures that bind the small molecules, and also proteins are non-rigid surfaces when they bind to something, their conformation might deform, and as a result, working intrinsically at least affords you some level of deformation invariance. So with my collaborators from EPFL, Bruno Correa, in the Protein Engineering Lab, we developed a geometric deep learning architecture we call MASIF that essentially implements these ideas. So it's an intrinsic conversional neural network that decomposes the protein molecular surface into patches and uses both geometric and chemical features to compute some problem-specific local features or local descriptors or local fingerprints. And we used this architecture for predicting possible interface sites for protein interactions, classifying pockets, basically what kind of ligand the protein binds, and also do fast protein-to-protein interaction search. And this is an example of how it works. So in this case, we have PDL1 cancer immunotherapy target. We can predict where this protein will bind another protein, and we can try to build de novo another protein that will bind to this target. And this is probably better than myself. This is an interesting promising direction for the design of what is called biological drugs or biologics, but here these are large proteins or peptides that can address otherwise probably nearly undruggable PPI interfaces, such as, for example, the program that is used often in cancer immunotherapy. So this appeared on the cover of Nature Methods last year, probably one of the first geometric deep learning papers ever to appear in such a journal. And we now already have examples of designs that actually work in practice. So what I show here are three designs that were experimentally confirmed to bind to PDL1. You can see actually that the proteins that we designed are completely different. So the structures here, for example, one has a single helix, another one has double helixes. And here you can see also the crystal structure. So we actually have now experimental confirmation that shows that the predicted structure, the design binder and the actual protein that we observe in with X-ray crystallography coincides very accurately, less than one angstrom root square error. Now, if we look at graphs, they are really ubiquitous. So we can describe practically any system of relations and interactions as a graph from nanoscales modeling individual molecules to microscales, basically modeling interactions of different biomolecules or metabolites and so on to the macro scale, basically where we can model, for example, patient networks. And probably geometric deep learning is most promising in these applications in biological sciences in drug design or in drug repositioning. And you know better than me that bringing a new drug to market is very expensive and very long business. It takes more than a decade and costs more than a billion dollars. And one of the reasons is that the cost of testing different screening stages is very high. Most of the drugs fail at some of these stages. So another interesting application where graph neural networks are being used is what is called virtual screening. So if you look at the space of possible drug-like molecules in principle, it can be chemically synthesized. It is very large. So it's a combinatorial complex space of something like 10 to the power 16 possible combinations. On the other hand, we can test maybe a few hundreds or a few thousands compounds in the clinic. So this huge graph has to be breached computationally. And graph neural networks really excel in the past couple of years in providing accuracy that is similar to traditional methods such as DFT, but being orders of magnitude faster. And in fact, these approaches were used by the group of Jim Collins at MIT. They had a paper in cell last year where they used GNNs to predict antibiotic activity of different molecules. And they showed a new powerful antibiotic compound they called helicine that actually originated I think as a candidate anti-diabetic drug. Now going to maybe a higher level of obstruction, another promising direction is what is called drug repositioning or combinatorial therapy where you use existing safe drugs for either different targets or in combination in hope to find some synergistic effect. And graph neural networks are also promising here. Sorry, what I show here is from the work of Marinka Zittnik that tries to try to predict side effects of drug combinations, pairwise combinations, using PPI protein to protein interaction graphs. And I'm involved in a big collaboration with Mila and the Gates Foundation where we try to find synergistic drug combinations against COVID-19. So I think I'm out of time. Let me conclude. And we started with this somewhat irreverent desire to imitate the elongate program in machine learning, trying to derive different deep learning architectures from basic principles of invariance and symmetry. And this took us all the way from image classification to drug design. And the approaches we've seen today are actually instances of this common blueprint of geometric deep learning where the architecture or the inductive bias emerges from the assumptions of the domain that underlies the data and the symmetry group, whether it's grids with translations, graphs with rotations or many folds with isometries or gauge transformations. And I hope I convinced you that that geometric deep learning is really a unifying framework for deep architectures that allows to relate different methods for common principles. And these methods have exploded in the past few years, especially there were regards graph neural networks. There are already multiple success stories, especially in the industry, some state of the art results and many tough problems, particularly in biology. And I think as to the promise of these methods, it's quite indicative in my opinion that last year, two major biological journals featured geometric deep learning papers on their cover. One was the MIT paper on antibiotic discovery and another one was our paper on proteins. So I hope that this and next few years, these methods become mainstream and first class citizens in the ML community and possibly lead to some new exciting results in fundamental science. So last but not least, let me acknowledge all my amazing collaborators in these and other projects. And thank you very much for listening. Thank you, Michael, for this excellent talk. It was a pleasure to watch this talk and to listen to you in this exciting field of geometric deep learning. Are there any questions? We still have some time for questions. So there are two questions in the in the Slido channel, in fact, which I will read out. Can you recommend a review or textbook chapter that gives an introduction to geometric deep learning from definitions over theorems, proofs up to applications? Well, excellent question. So there are several books on graphs on deep learning on graphs. So Will Hamilton has recently a book. I should confess that basically this talk is a kind of trial of a book that I'm currently writing with John Brunner, a taco coin in Benton Village Church. So that's why I mentioned that it's a collaboration with these colleagues. So I think it's rather, I don't think that it's there is anything new here. I think it's just a nice perspective. So if we manage to present it pedagogically enough, it will probably also survive the particular fashion with implementation, like deep deep learning, and it can probably transcend the particular methods, I think principles, as El Betos put it, the knowledge of certain principles compensates for the lack of knowledge of certain facts. So that's our aspiration, I hope that we'll be able to be around that. I read a joke yesterday that there are so many reviews on graph convolutional networks now that you need a review of the reviews. Next, some outfits discussion. Fokker, please. Okay, thanks. Michael, great talk. Very inspiring. So I want to ask you about human perception. There seem to be a lot of invariances, but it's not completely invariant, I would think. Is it because humans are not good at group theory, or are there some biological or other way motivating reasons behind that? Right, so this is a fantastic question. So this is very important. Another geometric prior or geometric construction that I didn't mention, this is what we call geometric stability. So indeed, as you're saying, for example, even in images, right, you very seldom have really a translation operation, right? So if you think of a video, let's say two objects are moving in different directions, so only locally they can be described by a translation group. So you can still quantify it. You can say that you have some deformation, some automorphism on the domain that is close enough to a translation, right? So you need some metric to define it. It can be defined in many different ways. So you can think of maybe a smooth deformation field for which you can measure, let's say, the Dirichlet energy. So what we can show that if it is close enough to an element of the group, then we can get approximate invariance or active variance. And actually, if you think of wavelets, that's exactly how wavelets were born. So they come where the Fourier transform from level. It is invariant to shifts, but it's not invariant to approximate shifts or deformations. Wavelets are. And Joan Brunner had in 2012 a paper on scattering networks or scattering transformers, the Van Mela, where they showed actually this property and they related it to convolutional neural networks. So one of the things, for example, why do we use maximum pooling in convolutional neural networks? So it has to do exactly with these local deformations. So convolutional neural networks actually are not only shift equivalents. The use of pooling has much more powerful implications in making them at least approximately invariant to these local deformations. Yeah, I would agree. I mean, I think invariance are quite, or maybe I'm not sure, extremely important, but they are also not perfect. I think in single processing, if you design the optimal rotational shift invariant, whatever filter, I think it becomes pretty trivial and not necessarily very useful, at least that's how I remember it. So I think it's sort of the tension between probably, I mean, it would be my statement, between invariances and exceptions from that as well. So here we try to design it in an equivariant way. So we don't lose this information, but it is accounted properly in the subsequent layer. So of course, one of the requirements that, in order for it to be tractable, that the dimensionality of this group is small, if it's a continuous transformation, if it's a leak group. I should also say that, of course, this geometric stability exists also for graphs. So there are several works, I mentioned drawn, we have the paper as well. Basically, it shows, for example, that spectral filters are insensitive to the perturbations of the underlying graph. So there are many flavors of these results. So this is important because, of course, ideal invariance doesn't exist. If you can show stability, then you can build into your architecture some invariance that is meaningful and everything that deviates from these invariance, which is how real life looks like, only that should be learned. And this gives you already an architecture that has certain built-in properties that are meaningful. Of course, it depends on the problem, whether, for example, you want to have rotation invariance in images, but if you have, you better incorporate this in that device. Okay, thanks a lot. Michael, there's a question again in Slido. Is there a difference between your definition of graph substructure network in form of triangles and cliques and the graphlets of the previous talk by Natasha Pujul? The statistics of subgraphs count. It is very much related. So indeed, well, we actually cite some of Natasha's papers as an inspiration. So graphlets have been known in biological domains forever, I think, from the paper of Milo in Science, where they showed the distribution of these graph motifs or graphlets in real-world networks is very different from random graphs. So they do have prominence in certain data sets, for example, that in social networks, cliques are important, in molecules, cycles are important. So we just show a way how to incorporate it in message-passing algorithms. Again, nothing particularly interesting there. I think the more interesting results are the theoretical analysis. We show that it creates an expressiveness hierarchy that is outside the traditional Weiss-Feller-Lehmann hierarchy, the KWL depth. So we can be strictly more powerful than KWL. There are, of course, higher-order graph neural networks that are equivalent to KWL, but they are non-local and they have higher computational complexity. So with this approach, we have a pre-processing stage that counts the substructures. It can, in the worst case, absorb high complexity and to decay. But in practice, it is actually low polynomial complexity. Then the message passing itself is linear and local. So basically, all the nice properties of message-passing graph neural networks with strictly more powerful expressive power. Thank you. I have one final question. Yeah, so there are no further questions. I have one final question for you from a bioinformatics point of view. When I look at these papers, these applications of geometrically learning and graph convolutional networks in the sciences and the successes that you showed, so some of these are really impressive. I notice that in the list of authors, there's always like a specialized graph machine learning author. So the technology has not advanced to a point where the bioinformaticians, let alone the biologists, could use the technology directly, it seems. So the applications always seem to require specialized knowledge in the field from a dedicated graph machine learner. So first of all, do you share this impression? Second, how far are we from this no longer being the case? My hope is obviously so I do share it at least to some extent. This is what I see or maybe it is a graph learning express or people that work on graph learning, which is actually quite a lot now in the machine learning community. They are attracted by these applications because they are important and the impact can be tangible and big. It might be also the other way around that the bioinformaticians are attracted by this shining new class of architectures. Probably it's kind of a confluence. I think we're probably a couple of years away from these tools that are already out there in standard packages like PyTorch Geometric or DGL, something that didn't exist for a couple of years ago. There are standard implementations of these methods that practically anyone can take off the shelf implementation of a graph neural network adapted to his or her problem and you get something that sounds at least some reasonable baseline. It is probably a little bit pink view on the field. There are many subtleties that are important, but one of the reasons I guess why I'm so much advocating these methods is that they become more mainstream. I think they are becoming. I see this surge of interest, especially in the bioinformatics community, so I think it's a matter of very short time, which is a bad thing for me because then I will be useless. I don't think so, but I thank you very much on behalf of myself and the entire network and the YouTube audience for this exciting talk that you have given and the question and answer session that was a great, another great highlight of this day.