 So it's quarter past one and I welcome you to the afternoon session of the first day of our summer school. And it's my pleasure to introduce the first speaker of the afternoon, Chloé Agard Asen-Cott, an expert at the interface of machine learning and bioinformatics, Chloé Dieter, PhD in computer science at UC Irvine, then was a postdoctoral research scientist at the Max Planck Institute in Tübingen, and then moved on to Paris to mean Paris Tech, the Institute Curie and Inserm, where she, in 2018, became an assistant professor and is now an associate professor. Chloé has worked a lot at this interface between machine learning and computational biology. And she has received a number of honors for her work. Like she became an Alexander von Humboldt research fellow during her postdoctoral studies. And she is now even co-president of the Community of Special Interest for Machine Learning in Computational and Systems Biology, one of the big cosies within the International Society of Computational Biology. She's also very active in the Society of Women in Machine Learning and Data Science, where she co-founded the Paris section of this society. And she received an A&R Young Research Grant in 2019. I'm very happy to have her here. I was also very happy to have her in my lab during her postdoctoral studies. And when I read the title of Chloé's talk, I remember the conversation we had 10 years ago about multiple modalities and how we can do machine learning when we have multiple views on the same data. I must admit, 10 years back, this was a very hypothetical discussion because we hardly ever had more than one view. But now, 10 years later, this is absolutely a reality. It also tells you a bit about how you can develop algorithms now that might become useful in a decade from now, which is maybe like an aspect of our field. So I'm very interested to hear your current perspective on this topic, Chloé. And we are very happy to have you here. So thank you, Carson. Thank you for the introduction. It's a pleasure to be here. It's a pleasure to be actually talking at one of this machine learning and precision medicine summer school because I was one of the organizers of the first such summer school we had in 2014. And now, finally, I'm a speaker. So my goal here is, so I have one of those lectures, of course, not invited talks. And so my goal is to give you an overview of some machine learning techniques for data integration, giving you an overview of everything that would be completely out. I mean, it's impossible. There's so many things happening. So yeah, actually, bouncing back on Carson's remark in his introduction, some of the things I'm going to present are actually a decade old or more because what I want to do here is not so much talk about the very latest research in this area, but maybe lay a bit the foundations of some major ideas that are still very relevant today. OK, so the first thing maybe is to clarify what we mean by, so my title had data integration. So what do we mean by data integration or data fusion in a machine learning perspective? The idea here is that you have multiple views of the data. So often nowadays, we talk about multiple modalities. Sometimes you'll hear in the bioinformatics from Mathiomics, the idea is that you have several representations of the same data set. So you have your learning data set is composed of n samples. And then instead of having just one set of n vectors representing your n samples, you have several such sets. So here I'm representing two of those. So they don't have to be coming from the same space. And so the idea is that you want to learn from both's representations at the same time. So that's what used to be called multi-view machine learning. And the deep learning community seems to be insisting on multimodal machine learning. So I'll tend to use both interchangeably. And the assumption here is that those views are complementary. So they're both bringing information and that there's a benefit to be gained from learning from these two representations at the same time, that you're not encoding the same information in two different views. OK, so first a few examples of multi-view learning problems. I've actually only picked two examples, but so I could have given an entire lecture on that. The first one comes from cancer research and multi-omics data. So this is, I mean, I've picked one paper. You see these references at the end. I've picked one such paper, but there are many others based on TCGA data and other cancer data. Where the question is, can you combine data coming from sequencing, so whether exam sequencing, gene expression data, this type of things. Data from SNPs, sourcing on the nucleotide polymorphisms, methylation data, protein levels. So can you combine all those different omics views of the same samples to identify disease subtypes? So in machine learning terms, this means clustering. And identify which types of, I mean, among all your samples, which are the samples that shared common characteristics. Another example I've picked is much more recent paper, a multimodal prognosis prediction also in cancer. What I liked about this example is that this idea of combining omics data, clinical data, and imaging data. So forget about the details here. I just wanted to put a picture. So I picked it from the paper. But you'll find a bunch of work on multi-omics, so as on this previous slide. But there's also now one more work where you incorporate clinical data and full-site imaging histopathology data, which I think is very interesting because these are very different, I mean, the nature of these different objects is very different. OK, so those were my two examples to help you figure out what it is I'm talking about with this multi-view machine learning. And so now a bit of classification of multi-view machine learning techniques based on the stage of integration. So I'm not the person who came up with this. It's been around for a while. The idea is that you can do integration of multiple views at different stages. So the first stage is early integration. And the idea of early integration is that you're merely going to concatenate the different features to obtain a classical problem, single-view problem. So if you have two views, one with P1 features and the second with P2 features, then you're just like concatenating your vectors, your input vectors. And now you have a single view of P1 plus P2 features. So the nice thing is that it's really easy to do. I mean, if you know, if you have software that runs machine learning algorithms, you just like inputs the concatenation of your different views. And here you are. The limitations are numerous. One is that you're going to think hard about how you're normalizing your data if you have measurements that are on different scales. What does it mean to give as input to your algorithm numbers that for some of them are doing expression levels and for others are presence or absence of mesolation? So it is a bit just combining apples and oranges, which makes both learning and interpretability difficult. And another limitation is that, I mean, especially if you're talking multi-omics, each of your view is already typically a very high-dimensional vector. So you have the next question for, I don't know, 20,000 transcripts, SNPs for SNP data, presence, absence for a million SNPs, mesolation data for, I don't know, how many CPD islands. So if you're just using one of those early integration things, you're making your cursive dimensionality issue even stronger than what we already have in single-view omics data. OK, so the second family of approaches would be light integration, where you would learn different models on each of your views. And then you would combine all, you would combine the outputs of those views. So that's maybe not very clear in my notation, but so here I have two views. So I chose vectors of P1 and P2. So I'm learning one model on the first view, a second model on the second view. And now I have a third function, a third model G that is combining what I've learned with F1 with what I've learned with F2. And you could either use a preset function for G. So if you're doing a regression problem, you could average your outputs. If you're doing classification, you can use majority vote. But of course, very often what you're trying to do if you're employing one of those approaches is to learn this function. So you're going to learn first all your models and then learn how to combine them. So this has the advantage. I mean, this addresses some of the limitations I've listed before for the early integration. So namely, because you're dealing with each of your views separately to start with, you don't have these problems that I was calling mixing outputs and oranges. And also, your cursive dimensionality is limited to cursive dimensionality on each of your views instead of on the sum of all your views. And again, it's fairly easy to set up if you have a classical machine learning library. One of the limitations of late integration is that if you want to do something a bit smart here in G, you're going back to what would be ensemble learning. So you can learn different models. Then you want to combine them in a smart way. And we know that ensemble learning works better if models are uncorrelated. And here, of course, I've said the assumption was that your different views were complementary and not encoding the same information. But at the same time, you would expect that they are correlated. And this makes it difficult to benefit from more than one view. And my limited experience with this type of problems, with this type of approaches, is that if you have one of your models that is performing really well, it's difficult to add more information and make the whole thing perform better. OK, so now I'm coming to what I find as the most interesting type of approaches, which is what I'm going to be talking about for the reminder of my talk, which is intermediate integration. So here's the idea is that really you're going to jointly learn from the two views at the same time with a specific, I mean, a dedicated algorithm that is specific to multi-view problems. So I'm really going to learn in one go a function from the, I mean, across the different views, on the over the union of the different views. OK, so from a modeling point of view, it's much more satisfying. So you're explicitly modeling the fact that you have multiple data sources. And you can do, I mean, then we'll talk about this. If there's relationships between your different sources, you can model them. But from a practical point of view, this requires coming up with new algorithms. And so this makes it more difficult. OK, so that's what we're going to be talking about today. All right, so there's several ideas on how to do intermediate data integration, machine learning, multi-view machine learning that I want to talk about. And the first idea, several families of ideas. The first one is the idea of embedding the data in a common feature space. So the idea is that you want to take your different views of the data. You want to map them to a single space in which you have a common representation of the data, and then apply a classical machine learning algorithm to this common representation of the data, which is expected to be something more meaningful than just a concatenation of the two views. OK, so when I'm talking about embedding the data in a common feature space, which may be the first thing that comes to mind, is that you want to find a low-dimensional representation of the data that comes from all your different views. So I think depending on the slides, I'm only writing two views or I'm writing a capital V views. But most of what I'm talking about can be generalized if I'm writing it with two views. So most of what I'm talking about can be generalized to an arbitrary number of views. OK, so what I want to do here is learn a representation of my data. So I have a certain number of features in each of my input spaces in each of my views. And I want to learn a representation of my data in a space whose dimensionality is much smaller than the sum of P1 and P2. And the first example of this is joint non-negative matrix vectorization. So you might know already non-negative matrix vectorization, it's a classical technique for dimensionality reduction in machine learning. So here at the bottom of my slide, I'm presenting NMF, so non-negative matrix vectorization, for a single view. So here I have a data set. So we're talking here about unsupervised machine learning. So I have a single view and samples, P dimensions. The idea of NMF is that you're going to find a new representation of the matrix. Here it's this new representation is called W. You still have N rows, but you have only D dimensions and D is much smaller than P. So this is this matrix here. And you're going to find it by decomposing the data matrix and the product of such a matrix W and a matrix H that gives you the correspondence between each of this D dimension and the P dimensions that you originally had. So it's called non-negative because you're imposing the constraint that all the entries in W and H are positive. And so formulation of this problem is to find, so if D is set, find W and H so as to minimize the forbidden norm of the difference between X and W, H. So what this means is merely that you're going to take the sum over the squares of the differences element-wise between you two matrices. So if you want W, U, H to be as close as possible to X element per element. So this is a big family of approaches. And NMF has been widely studied for several algorithms that exist that are fairly efficient. There's several variants which you can add constraints to these problems. And so it comes in many flavors. And so one of this flavor is a multi-view approach where you want to learn a view to still W, a representation W of your data, D dimensional, but such that you have an approximation so that W approximates, sorry, that W H approximates each of your views. So that's what I have here. So you could do non-negative matrix factorization on each of your views independently. And then you would have, if you have V views, you would have V representations. But what you're doing here that you enforce that the representation in the low-dimensional representation is the same for all the views. So I'm not going to talk more about this because I think Anais Boudou may talk a bit about it tomorrow. And if you're curious about examples of application of this technique to multi-omics data, you can check out this review paper here by Laura Contini and Anais Boudou. And so that was my first approach to learn a lower-dimensional representation of data from several different views. So this is the old school type of thing to do, matrix factorization. Nowadays, what you do when you're cool is deep learning. And so you probably know deep learning. You can always see it as learning of representation of the data. So you put whatever number of hidden layer between your input and your output. And the last hidden layer, you can always interpret it as a new representation of your data. And so there's a bunch of approaches that have been developed for deep multi-view learning where, in essence, you're going to put your views, your different views as inputs to the neural network. But instead, I mean, unlike what you would do with, I mean, if you were just doing early integration, you will start by, you will, for the first layer or the first few layers, you would connect the views to their own hidden layers and not connect the entire input layer with the entire next hidden layer. So here at this stage, you learn an intermediate representation of each of your views. And here in blue, you have a global intermediate representation. So again, on my drawing, I have only have here one hidden layer specific to each view and one hidden layer that is common to both views. You could imagine having many more intermediate layers. And you could do that in either a supervised or unsupervised ways with feed. So supervised would be just a feed forward architecture with, so you would, and unsupervised would be an autoencoder. So when you would want the output to match the input. And the idea is that because here you have a single representation at the end, I mean, because, sorry, the idea is that this architecture will start by transforming if you want each of your views so as to make them as amenable as possible to be transformed into a last representation common to both views. And so typically when you do this, you also use your intermediate layers would be a smaller dimension than your input layers, which is why it's inspired similar to the NMF. It's also a dimensional iterative action technique. OK. So that was the first point of my first idea, embedding the data in a common feature space. On the other hand, you might want to, if you're working, if you like kernels, you might want to map the data to a higher dimensional, I mean, typically the hell of a space that's higher dimensional than your input data. And that is, again, common to that in which a common representation to different views lives. So here, I'm talking about hell of a space. I'm talking about kernel methods. So what I'm talking about here is actually about learning a kernel, which is going to work on pairs of elements represented by my different views. And I thought I would insert here a little primer on kernel methods. I don't know if everybody who's listening today is very familiar with kernels. So the idea of kernel methods is to build nonlinear models by transforming your data by mapping into a space, new space that is typically higher dimensional. And in this new space, you're going to learn a linear function that's going to be nonlinear on your input space. So I have here a small example, which I like, because I don't have to draw a higher dimensional space, which is difficult if I only have two dimensions on my slides. So here, I have two features. And I have a model. So you can imagine that it's a decision boundary separating two classes. This model is nonlinear. But this nonlinear model is actually linear in a different feature space. That's over x1 square and x2 square. So kernels allow you to generalize this in the sense that so you can use a mapping much more complicated than what I've used here to go from your input space that has p dimension to a new space that has many more dimensions. And a dot product in the new space is simply a kernel over the initial space. So my original input data, a kernel over the original inputs, just means dot product between the images of x and x prime in the new space. And why do people like working with kernels if they use something called a kernel trick? So it tells you that if you have an algorithm in which your input x appear in dot products, then you can replace this dot products by a kernel. It's equivalent to mapping the data to a new feature space through this function phi and replacing the dot products between the images of the point x and x prime through phi with a kernel. So this sounds stupid. It doesn't sound like a trick at all. It means I've replaced writing this with this. And it's so far, I mean, dot product between phi of x, phi of x prime is the same thing as k of x and x prime. Doesn't look so much like a trick, but it actually is a trick when computing k is easier than computing phi. And you have cases in which actually you can find a k in situations where you don't have an explicit phi. And even if you're not in such a situation, on the example here with the quadratic kernel, so the quadratic mapping was simply mapping your P features to themselves plus the pairwise products between all features, so from x1 square, x1, x2, x1, x3, and so on and so forth, until you reach x space squared. So it's quite a large number of features. With some, I mean, you have some coefficients to take into account if you want these two expressions to match, but taking the images of x and x prime through this function and then applying the dot products is equivalent to taking the dot products in the initial space, adding a constant, and putting everything square. So this is what we call the quadratic kernel. And this is a situation where you can imagine that it's easier to keep my P features, compute the dot product at a constant, multiply by itself, versus then first computing all my features. So it's not very interesting for quadratic kernels, but it becomes interesting for you can imagine, for instance, a higher degree polynomial. Kernels are also interesting in computational biology, because there's been a bunch of kernels that have been developed for biological data. And so I'm going just to mention a few examples. I've put pointers to this if you're interested. So you have kernels based actually not on vector or representation of the data, but just on sequences. And you can use such kernels to build kernels directly over sequences, and in particular, coding sequences. You can build kernels based on networks. So whether graph kernels for objects that themselves can be represented as graphs, so graph kernels for molecules, or if you have a network on which your nodes are the objects that you want to build a kernel on, you have things like diffusion kernels that can also give you a kernel directly without first finding a mapping. We also have kernels for SNPs in genome-wide association studies. And all those things are possible because kernels can be interpreted as measures of similarities. So in order to build a kernel, what you need to do is to find some notion of similarity. What does it mean for two objects to be similar? So what does it mean, for instance, for two protein sequences to be similar? And then you just need just, sorry, just as sometimes difficult. But if you have something that, in terms of meaning, can be interpreted as a similarity, and in addition, verifies a number of mathematical properties, and you can use it as a kernel. And the reason why a kernel can be interpreted as a measure of similarities is because it's a dot product. And if you think of dot products in Euclidean space, the dot product between those two vector x and x prime is proportional to the cosine of the angle. So if the vector are collinear, the dot product is large. The cosine of the angle is 1. And those two vectors are orthogonal. The dot is there very different. The cosine of the angle is 0. OK, so kernels have a long history of being used in bioinformatics. So I think it makes sense to wonder if you're interested in integrating data in multiomics context to wonder about the kernel approaches that can be used. And so the idea of here is going to be the idea of multiple kernel learning. So you have your different omics data, and you know from the literature how to build one kernel for each of your data types. So each of those k's here is actually a matrix of n times size n by n. So n is, again, your number of samples. And each entry of this matrix is the kernel function applied to sample i and sample l. So it's a dot product between sample i and sample l, but in this high dimensional space. So a sum of kernel is a kernel. So you can build a multi-view kernel that is the sum of linear combinations of those different kernels. And you know that if because you know it's a kernel, you know that there exists a heap of space and a mapping from your input data to this heap of space such that this kernel is a dot product over this heap of space, which means that you can use any kernel algorithms, kernel-based with this linear combination of kernels. So now I haven't seen much about this coefficient here and the linear combination. So one of the things you can do is normalize your kernel matrices so that they all have one on the diagonals and just take the sum. But actually there's something interesting in the case of support vector machines is that you can also learn ideal optimal values for this coefficients. So here I've written the dual formulation of support vector machines. So if you're not very familiar with SVMs, so this is one of the formulations, one of the ways of formulating the optimization problems that when you solve it gives you an SVM decision function. So what it's doing is that it's learning a function that is for, so the decision function is going to be a linear combination of the product between the labels of the samples from the dataset and the kernels between a new sample, the one you're trying to label and samples from the dataset. And so, and one thing that's also important to note is that we know how to solve this problem efficiently and exactly. So one thing we also know from the theory of SVMs is that, so here what you're trying to do is find those coefficient alpha of the linear combination. But we also know that you have the better performance when this function here that you're trying to maximize is smaller. So I'm solving this problem. I have a certain performance. If I have other data points and I'm solving this problem and I'm obtaining model that has a smaller minimum, then my second model is going to be performing batches in the first one. Which means that we know how to optimize the kernel while learning by looking. So here, all what I've done is that I've replaced this kernel here by my multiple kernel, my linear combination of the kernels of the different views and I can replace. And I know that I want this whole thing to be minimal. So I'm going to look for the values here of the coefficients mu of the linear combinations that minimize the overall optimum. So this is ideas that have been proposed quite a while ago now. Almost 20 years ago. And so maybe to summarize what we're doing here is that we start from a kernel on each of our different views. And then we can learn a linear combination of those kernels that is optimal at the same time as we're learning an SVM that uses this kernel to make its decision. Okay, so in the two, I mean, I've presented you so far approaches for intermediate integration in multi-view machine learning, whereas the idea was to map all the different views onto a common subspace and then learn a function on the subspace. A second family of ideas would be to learn different models on each views, but at the same time, forth them too much. So what I'm talking about here is that if I have my two views, so one on a space of dimension P1 and a second on a space of dimension P2, I'm going to learn two models, F1 and F2. But unlike what I was doing in the late integration, now I want to learn them. And at the same time, as I learned them, I want to enforce that those two models make, they agree that they make similar predictions. And then in the end, I can build a final model that is a simple combinations. So average or weighted average majority vote of those different view specific models. One of the nice advantages of this type of integration is that in the end, because you obtain one model per view, you're able to make predictions if you're in a certified setting or cluster points, integrate points, if you're in a clustering application, for instance, even if one of your views is missing. So those approaches are particularly suited to problems where you might not have all your different types of data available to you at prediction time, which I think makes a lot of sense in multi-omics settings. You might have a bunch of different type of information that's learning stage or like gene expression, mesolation data, mutation data, and so on and so forth. But you want to be able to make predictions even if one of those modalities is missing whether because no one ever acquired it or because there's some quality control issue on it. So I think those approaches are particularly interesting for this reason. And here, I want to talk about, again, two major ideas. One is CCA, and the other is to impose these agreements through regularization. Okay, so CCA is really old technique, hard-telling, 1936, it's almost 100 years old. The idea is that you want to find basis vectors, so vectors of new space on which you're going to project your data, so as to maximize the correlation between the projections of two views. So this is, here's the problem you want to solve. So I want that I project view one over vector one, if it's W1, and I project view two over vector W2, then the dot product here between those two projections is maximum, and I can impose some constraint that just to have a unique solution to the problem. So it's a bit similar in spirit to PCA, where you want to find directions on which to project your data, so as to maximize its variance, but here you have two views, and you want to project each of them on, I have a new representation for each of them in such a way that they're maximally correlated in the new space. Another way to look at it, I mean, there's two formulations, the equivalent is that you want to minimize the disagreement between the projections. So here is again, I'm projecting data one, data X1, my view, my first view, sorry, on W1. I'm projecting my second view on W2, and I want the difference between those two to be minimal. So this naturally extends to more than two views. And there's many extensions of CCAs, I mean, because it's been around for so long, including using kernels, there's those multi-view extensions, you have catering in 71, you have bunch of work on CCA. So here again, with CCA, we're talking about dimensionality reduction, and each of your view is going to be sent to different representations, so keep that in mind. It's just that those different representations are going to agree among themselves. Okay, so it's not exactly this idea of, I mean, it's exactly this idea of learning different models on different views and making them agree among themselves, but in a supervised setup, you're still missing one step. How do I go from having those different representations that are still one representation per view to having a single unified model on top of the views? And the answers for that is typically through regularization. So here's the idea is, I mean, I hope you're familiar with the idea of regularization, just to explain it in a few words. This is, so here I'm in a supervised learning setup, and supervised learning is typically empirical risk minimization, which means that I define myself a family of models, so capital F, and I'm looking for the model in this family that minimizes the empirical risk, and the empirical risk is simply the average error that my model is performing on this data set. So here, L is the loss function, so L of YI, FV of XI is the errors that I'm making by predicting F of X instead of Y, so when the true label is Y. And so all the supervised machine learning algorithms you know are almost all following this setup, provided you add a regularization term, sorry, I don't know why it says loss risk here, but it should be regularization. A typical regularization function is the Elton norm of the weight vector of a regression coefficient, for instance. So regression enters this framework, SVMs enters this framework, and so that's when you can do view per view. And so regularizers can also be used to tie the different views together. So here you have in black, you can see the exact same thing as before, so for each of your different views. So if I was only solving, finding the minimum over the sum for this different views of my empirical risk with a view-specific regularizer, it would be equivalent to finding V different models and learning them independently from each other. But I can add an additional regularizer over the consensus, that's going to encourage my V different solutions to be similar to each other. And so there's many examples of that. One of them is on the SVM. So an approach from 2005 to 2006, that's called SVM2K. So if I go back to the primal formulation of the SVM, this is the problem you're trying to solve. So you're minimizing, you're looking for a weight vector. So I'm not using kernels here. Here you're looking for a weight vector W that is such that you minimize the sum of regularization term to norm plus the sum of slight variables or slight variables or the errors that you're making on each of your samples. So the prediction function is Wx plus B. The true label is Y. And the slack tells you how far you are from having Y times Wx plus B greater than one. And what SVM2K is doing is repeating this problem. So here you have it for the first view. If I'm solving, if I'm minimizing this, I'm finding an SVM over my first view. If I'm minimizing this, I'm finding an SVM over my second view. And what we're doing with SVM2K is add a constraint here that is the consensus. So those eta i are small when the two views agree in their prediction. And so here you have a number of constraints that are again those constraints here for the first view and for the second view. And in orange, I've highlighted the constraints that tells you that the two views should agree. And then in the end, your prediction, your final prediction is just the average of the two predictions. And of course this extends to more than two views. So that's the first example. You can also do that with lasso type of approaches. So lasso is an empirical risk minimization on the linear model again. So here I wrote it for quadratic, sorry, for regression problem. So I'm using the quadratic error. I have an L1 norm here regularizing my regression coefficients. And one way you can tie together several lassoes on different views is what I've written here. So here in black, you have the same thing as before. So learning on your view views separately. And then you're using what's in orange to tie your views together. So you're using this matrix MIV. So MIV is a contribution of view V to label I. So to the label of sample I in your training set. And you want this whole matrix M to be sparse and no rank. And this is what's going to tie your different models together. And in the end, so you've learned that your final model it's a linear combination of the models you've learned on each view. And the coefficient for this linear model, it depends. I mean, it's sort of an average of those coefficients, those contributions of the different uses, different labels. You could imagine different regularizers that would tie your views together. Actually, I mean, I had one in mind and then I looked for a reference for it in the literature and I couldn't find it. So I don't know if I'm the only person thought about it. I don't probably unlikely. Maybe I wasn't good at searching the literature or maybe it actually doesn't work in practice. But that's one of the ways you can tie together several lessons. And you can apply this approach to NMF as well. So in the, so I had presented you NMF for a single view earlier on. And then I told you if you do joint NMF by looking for a same matrix W that would be common to your different NMFs for each of the different views. But what you can also do is allow yourself to have a different projections, a different WV for each of the views. But then you want to use a regularizer to tie together those different matrices here. They're not too different from each other. So again, a regularization approach. Okay, all right. So I've already talked about two of my favorite things in machine learning, which are kernels and regularization. The one thing I haven't talked about so far is graphs. And so that's what I want to be talking about now is how people have been using graphs to do multi-view machine learning. Okay, so first a small recap on graphs. So you can use graphs to model relationships between entities. So graph is a set of vertices and edges. And the edges are pairs of vertices. So here you have a graphical representation of the graph. So it has nine vertices and a number of edges I haven't counted. And you can also represent a graph by a matrix. So it's called a adjacency matrix. I mean, by several types of matrices actually, but here is the adjacency matrix. This is one of the most common ways of doing that. So you have as many lines, so it's square matrix. You have as many lines and columns as you have nodes of vertices in your graph. And each entry of this matrix is non-zero if and only if and only if there is an edge between the two vertices. So here there's an edge between the first and the third vertexes. So there is a one at position one, three in my matrix. There's a bunch of variants on this idea. One of them is that instead of having ones and zeros, you can have weights over the edges. So here you would write weights, so labels on the edges. And inside the adjacency matrix, just being a binary matrix, you would have real value numbers. And you can also orient the edges. So you would have arrows on the edges and then the adjacency matrix wouldn't be symmetric and symmetric anymore. So having an edge from V1 to V3 wouldn't be the same as having an edge from V3 to V1. And graphs has been used a lot, in particular in bioinformatics to represent two things. First knowledge, prior knowledge that you have about relationships between entities. So maybe one of the most classical examples in biology is biological pathways. So your nodes would be edges. The relationships tell you that those edges tell you that those genes belong to the same pathway or work together towards achieving a biological function. And you can also build graphs from any data. You can always build a graph as a matrix of similarity. So your adjacency matrix would be a similarity matrix. You could threshold it and say, if you wanted binary, so you compute similarities. I don't know, you use dot product and you leave in space, for instance, on your input data. And then you would have, okay, those feature one and feature three, they're similar. The similarity is above a certain threshold. So I put one here, I put an edge between them. If you feature one and feature two, there aren't, so I don't put any edge between them. Okay, so graphs, graphs can be used to represent a relationship between samples. So you would have each of two different views, gives you a different adjacency matrix which is so N by N matrix. So each of you have used a few different graphs, but with the same nodes. And then supervised learning can be seen as a note labeling problem. So if I have some of the labels for some of this edge, for some of those nodes, how do I learn the labels for the missing nodes? Not going to talk about this because I think Anais will talk about it tomorrow. What I want to talk about is the idea of using graphs to tie views together in the context when features from each view can be mapped to a single vertex in the graph. So let me clarify what I think about that. So when you're working with omics data in particular, you can very often match your different omics features to genes. So if your features are RNA transcripts on the one side and protein levels on the other side, of course, you know how to map those to the same gene. You can also do this with SNPs. And I'm mentioning this here. This is something we've been exploring because it's ML, FPM work. It's something we've been exploring in the context of GWIS with Christelle and Jen. So Christelle, Benz, Tina, and Jen, Jero. So how do maps, SNPs to genes and the main idea is that you map SNPs to gene based on positions on the DNA sequence. But also you can map SNPs to genes based on known regulatory information. So you know that this map regulates the expression of this gene, so you map it. And also proximity, not on the sequence, but in 3D space, which should also include some regulatory informations. So then what this means is that you have your graph, which comes, for instance, from prior knowledge, like biological networks, let's say protein-protein interaction networks. And then so each, this is a graph of your features. Each node, each vertex is a feature. And then each of the views is a label of this node. So you can have one, you can see this, as you have one graph per view, same graph structure for the different views, but different feature values for each view and for each sample. And so that what I've tried to represent here for one view is that I have a graph structure here that is common to everyone, but sample I has a different label for each node, for example, I mean, sample I here has a different label for this node, which is the second gene, as a different sample, so sample L would have. And now there's several things you can do with such a representation. One would be to say, okay, so I have now one, each of my samples is represented by a graph or several graphs. One per view. And I know how to build kernels on graphs. So I can use graph kernels to build one kernel per view and apply multiple kernel approaches. And so one successful approach for this is called PAMOC. It's a pathway, multiomics graph kernel based approach. And it does, I mean, I've summarized it in one sentence, but this is what it does. So each of your sample is a different set of labels over this graph. You use graph kernels that don't care so much about the structure of the graph, but care about the labels of the nodes. And then you can use those to build kernels over each of the views of your data. I think Crystal will also talk about some examples in this field tomorrow. So what I want to talk about now is how to use graph regularization to guide feature selection with such approaches. So I'm cheating a bit here because I'm actually not in a multi-view framework anymore. I mean, we could discuss how to extend those idea to multi-views learning, but to the best of my knowledge, it hasn't been done yet. And so I'm going to cheat and say that you have a graph that is built from one or several of your views and that you want to use this to select features based on another of those views. So there's several, so if I come back to my picture here, now the thing is that I want to use this data representation to learn a model in such a, and I want to learn which of those nodes are important for my output. So there's several type of families of approaches you can take, but they're all based on regularizations. I mean, also they want to talk about. And so there's several proposals for this that have been built on the lasso. And so you want to use the lasso to encourage sparsities. This is the future selection part. And then you want to use a graph to build the regularizers that encourages connected features to have similar weight. And so this is again your classical lasso formulation. Here's this is regularization parameter for the sparsity. And now you're adding this additional regularizer. So it's two examples of this regularizer as a generalized fuse lasso proposed by Tip Shirani and others in 2005, where here this regularizer is a constraint over the absolute difference between the regression coefficients on regression coefficients that correspond to features that are connected by an edge. So what this using this regularizer does is encouraging that two features that are connected on your network, they get similar weights. The network constraint lasso also does that in a slightly different ways. It uses a Laplace in regularization. I'm not going to get into the details for today, but it's a very similar idea. It's going to use square difference between the square of the difference between the coefficients and the absolute difference. In that set up, you also have other type of approaches where you're really focusing on selection. And when you start by computing the relevance for each feature, so if I go back to my old drawing here, I've used my end samples to compute for each of those nodes a relevance. So this can be correlation between the feature and the outcome can be based on the statistical test in the nonlinear measure of independence like H6. And now I'll use the graph to select few features with higher reverence that are connected on the graph. So these are approaches that have been explored in particular in the previous iteration of this ITN. So I thought it was historically relevant as well. And the formulation is very similar to that of the lasso instead of minimizing an error term, the risk you're maximizing a relevance under some constraints. And so one of the constraints is on the size of the number of features you're selecting. So you want to make sure that the size of the set S which is your set of selected features is small. And then you have a constraint on the graph. And so we have two variants of that, the one we proposed with Carson and others in 2013. Whereas this graph constraint looks like the graph Laplacian constraint of the network connected lasso. Whereas this constraint, you're going to penalize having disconnected solutions. So what I mean by that is that every time you have an edge between the features that is selected and a feature that is not selected, this is going to decrease your, the time you want to maximize. And then a few years later in collaboration with the lab of Provence de Menet, the lab of Carson proposed SIGMOD, which has a slightly different penalty, which instead of penalizing disconnected solutions and courageous connected solutions. Okay, I'm almost done with my overviews of this as ideas, but I think it's very interesting to think about using prior knowledge to tie together different views in omics data. And so I've shown you how this can be done with graphs, but this can also be done. I mean, you're going to tell me that it's a form of graph as well. You can also do this in deep learning by building knowledge and form your network architectures. So this is something that is sometimes called visible ML, visible machine learning. I don't like it so much because it really applies to deep learning. So I draw the call that knowledge and form deep learning architectures. The idea here is that instead of building a classical feed forward fully connected architecture over your neural network, you're going to connect different layers. I mean, you're going to give biological meaning to each of your internal jet levels, each of your hidden layers. So for instance, if your input features are genes, then you can have the next hidden layers. You're only going to connect to the next hidden layers to you're going to connect together genes that belong to the same complexes. And then you're only going to build connections between different complexes that belong to the same pathways. And so this is one example, there's many variations on this idea. The idea here is really that instead of having something fully connected with like no, I mean, just out of the box architecture, you only make connections between things that make sense to you. Between features that for you, it makes sense to connect based on prior knowledge. And this allows you to reduce your hypothesis space or reduce the number of possibilities for models that you're learning. It's tying together things that features that have meaning together. And I think it's also one of the ways to address this cursor dimensionality that is playing us in bioinformatics. I'm also mentioning that Peline, who is one of the fellows of the current ITN, working with Joaquin Dupazzo, as far as I know has been working on similar ideas during her PhD. All right, so I'm almost at the end of what I wanted to show you today. And the last thing I want to talk about is feature selection and interpretability. So a few of the messages I've talked about were focused on feature selection, but one of the limitations of all those approaches, I mean, of several of the approach that I've presented is that it's difficult to get anything explainable, interpretable out of it and to know which are the relevance features from your input, I mean, from your inputs or from your data, what is really driving your algorithm. And we have many opportunities in bioinformatics to wish for interpretable models. And so that's what I've tried to represent here is that we have two views and there are many features in each of the views, but I'd like to find out in each of the views which are the important features. So actually just jot it down a few ideas here. Of course, the first thing you can do is that if you're doing all the old age integration, you can apply a single view feature selection techniques. So anything you can think of from using sparse from learning sparse models to using feature importance random forest using sharp or lime and so on and so forth using attention, if you're using deep learning, all of this applies to your early or late integration technique but you don't benefit from joint learning. So yes, the only answer I have to that is that a number of those multi-view algorithms I've talked about. So some of those were already, I had already spotted it built in like this with, I mean, this fancy lasso or the graph-based methods where nodes are features. So those were graph-guided feature selection but also, for instance, for NMF or for CCA and you have sparse variants that allow you to build sparse models that are only using part of the features. Of course, here, for all those approaches, you have this issue that sparsity here is achieved with L1 regularization and L1 regularization is pretty unstable. So if you have many correlated features to start with your L1 regularizer is going to select, I mean, it's not going to be stable in the sense that if you have small variations in your input data, so you remove one sample you run it on a different day on a different machine this type of thing is you get different sets of explanatory features in the end. And so this is a bunch of approaches that try to address this in particular that is also one of the reasons for developing all those graph-based feature selection approaches to in order to try and stabilize this but we're still not as something ideal. So those were my concluding words. I will share my screen with all the references which are coming, which are the last slides of the talks. But now I've been talking for a very long time and I'd like you to ask me questions. Thank you very much, Chloe, for this excellent talk on data integration, we've got a very holistic overview about the various approaches that exist. I'm sure there will be questions. Let us start with Dejan. Thank you very much for this presentation. It's going to be very useful to me in the new so I'm extremely happy to have the possibility to attend this lecture. My question is, do you have any recommendation when performing data integration with one of the data type being images and especially are there some graph-based methods that are particularly relevant or commonly used for the type of data? Okay, so this is something I know very little about but I know that there's deep learning approaches that I've been trying to address this and there's in particular, I think it's this paper as well, I can send you the reference later on. It's called page PAGE. It's not really easy to look for a paper called PAGE on Google Solar where they're doing is, I mean, it's this idea of no knowledge and from neural network architecture and they're combining this with the multi-view deep learning I was showing before. So they have this on the one hand and then they have a different inputs that would be, that comes from whole slide images and the classical, I mean, what now passes for classical for deep learning on whole slide images, network here and in the end, it joins the different views. I think that's so, yeah, I don't know. So this is in particularly tying together the different information right between the omics and the images. People in spatial transcriptomics have been trying to do this but I'm not really aware of a particular reference here. I'm gonna look at this paper, thank you very much. Sure thing. Thank you, Jan and Chloe. Giovanni is next. Hi, thank you for the talk. I really appreciate it. I have a quick question. I hope it's not something that I missed from the talk. So my question is to many complicated models like something like neural networks, they're often trained with gradient descent or some similar optimizers, but a lot of the constraints that you mentioned in some of the models you presented are not differentiable. Like I remember there is the rank of a matrix in one of the models that you presented. Is there any general way to try to overcome this challenge if we want to use these more interesting constraints for some model that can only really be optimized to create the descent for... They're typically optimized with gradient descent. So I mean, it's just, for instance, the L1 constraint, it's not differentiable but you pretend it is because it's convex and it has only one point of non-differentiation. I would look into how, I mean, this using rank constraints on lassoes is a fairly common thing to do. I would look specifically at how the problems are being solved and I'm pretty sure you can translate this to a neural network approach. But it was not something I had talked about and I would also not be offended as you had missed one second of information in all what I've just delivered. Thank you. I have further questions for Chloe. I have one question. Oh, no, Pellin has a wand. Yes, Pellin says she has problems with her mic and would like to put it in writing. Yes, that's no problem at all. I'll read it out to the question. So Pellin is asking, thank you for the talk. It was very informative for my side. My question is related to a deep multi-view learning on slide 12. May I learn what is the difference if we first combine multi-views and obtain the encoding information by using autoencoder rather than getting the combination of each last hidden layers of each model of each view? So are you talking about like, what would be the difference between this and just concatenating the different views? Or did I miss something? Yes, yes, so Pellin confirmed that this is the answer to your question is yes. Okay, so I think the difference is that having those intermediate layers that are specific to each view allow you to address this limitation that was talking about, about combining apples and oranges. As you see those different views, they don't live in the same, you know, they're not in comparable scales or those type of things. I think having first some layers that are specific to each view and then something that's global allows you to sort of first homogenize the data in the different views and then learn something global. I hope this answers the question. Yes, thanks. She says yes and thanks. Okay, thank you. Great. So my question is the following. We talked this morning a bit about confounders in Magnus's talk and this whole feed of data integration could also, well, on one hand it may benefit the study. On the other hand, the more sources you integrate the more risks there are that you integrate some confounding or that there is some systematic difference between data views or data sources that introduce some signal into your joint data set which is not biological but somehow technical or confounding related. So do you see or do you know methods that address this problem or do you think this could be a line of future work in this field or how would you go about this kind of confounding risk in data integration? So this is a very interesting question and I'm actually not entirely sure that you have more risk of confounding here. I mean, with multiple views and in single views would actually tend to hope that some of the technical biases if you don't have the same technical biases or different views, maybe hopefully you might have things that compensate each other or maybe it's even worse because if you've acquired your difference, I mean different data differently across different views then it's just you're only going to learn technical bias. I'm not sure. I don't know if there's something specific to multi-view to do or if it's just the same problem as when you treat each view independently from each other. Yeah, I don't have an answer to your question but I'm not familiar with any work that would be specific to confounders in multi-view learning but I'm also not, you know, I mean I'm not familiar with it, but it might exist, right? No, I agree. I mean, there's not really, I'm also not aware of work that is specific to the multi-view setting. I mean, there's a lot for the single view. Yeah, yeah. But I'm not sure if there's anything about the multi-views that would allow you to address it in a different ways than in the different single views. I think maybe some of the algorithms that address it on single views would need to be ported to multi-view. Yeah, I think one has to think carefully about how related these confounders then would be about in the different views and if they cancel each other out or if they only exist in one view I think it depends very much on this. Yeah, why they just accumulate because they exist in all views. Yeah. Exactly. Definitely. Okay. Thank you. Interesting. Are there further questions for Chloe? Can I just jump into this? Falka, please. So I'm a little bit confused. I mean, confounding happens when you leave out something. I mean, that's the typical case. And the ones where you add too much inputs, I mean, that can happen, but isn't that extremely rare? I mean, it's one of these Perl's conditions, Bechtel or something. One of these strange things that you suddenly introduce dependencies which aren't there if you exclude this variable. But typically the issue of my understanding is that confounding if you leave them out and you have a hidden confounder, that's the problem. If you assume too many inputs, I mean, that might not be nice. And eventually also, of course, might not be very helpful for many statistical reasons, but it's not, I mean, my understanding would be that that is not the big issue. But maybe I'm a sense of it. Well, I think it depends because if you, I mean, if a confounding is due to, you're missing biological information, then of course, if you edit, if you had multiple views and you add more information, it helped. But if you're confounding, it's like due to technical biases. Maybe I don't understand. So imagine you've like measured, you're trying to like cluster, you're trying to cluster different samples from cancer patients and all the different views have been acquired in two different clinics, I mean, two different labs. And so actually what's the major difference between all those samples is whether they've been measured in lab one or in lab two. Yeah, then you should include the identifier for the lab and then you remove the code. Right, so yeah, that's the same. So if you use one, if you use what, I mean, in this case, clearly, if you inquiry this information whether it's single view or multiple view, you should be able to address it. But if you're forgotten to include it. Yeah, that's the thing. If you forget to include it, then you have a problem. That's the problem with confounders, right? He's forgetting to include them. Well, another example to forget would be if you have a genetic data set and you now start adding another population to that data set where the phenotypic ratio is a bit different. Do you have more cases of a particular phenotype when this phenotype may start to correlate with this, with the membership in this additional cohort in this of a different geographic origin. This is a very typical problem in genetics that you have this kind of population structure and then data integration actually harms because the new data source correlates with the phenotype then. But it's the same thing, right? If you include features that capture population structure or you do anything, any other of those population structure correction approaches and... It should be okay. And if you don't do it on the single views, it's also bad for the single view approach. So... Exactly. I think we agree probably, but it was just against my intuition that you said too many inputs can be harmful. I mean, there can be in some odd ways, but in most cases, I mean, I think the simple, I mean, what Ruben sometimes criticizes because I mean, more inputs cannot hurt, which of course is not quite true in this screen-rate view, but the more dangerous thing, I think is to leave something out, which is what I think. For sure. My logic would be, if I may, folk and gluey. So my logic would be that the more you measure the higher the risk that the device that you're using to measure introduces confounding. That then the measurements correlate or that the measurements start correlating or create correlations between the device it was measured on and the phenotype. And because the different views are not, the different views are not measured with the same machine. Again, we see this in medical data all the time. If there's an upgrade to a machine in a hospital, then the measurement starts changing. So if there's a slight difference in class ratios before and after the upgrade of the machine, then the machine correlates with the phenotype, for example, the machine version in that case. Yeah, but the trivial solution, which is probably not a good solution is to include the identifier of your device. Absolutely, but it's often difficult to do this in a holistic and systematic way. So often one misses, then the crucial thing that has changed maybe sometimes it may be just a software update of the machine that measured one view and then you have a confounder you didn't notice here. Of course, also what you mentioned, I mean, one solution to confounding, I mean, one is to include the inputs or the other one is to the stratification or to preferably send also removed. Yeah, but all those techniques are not particularly multi-view specific, right? It's just, yeah. Yeah, maybe the danger becomes, whatever. No, probably not. I guess you could apply a multi-view clustering techniques and then see whether if your clusters seem to you to have more to do with biology or with some technical artifacts. Good, thank you. Then we send a round of applause to you, Chloe, for this lecture that was wonderful. We have a 20-minute break now until we continue with Katharine Röder's talk at three. Thank you, Chloe. Thank you. See you all after a short break.