 It's my great pleasure to welcome Caroline Uller, Caroline as faculty at ETH Zurich and at MIT. She does very exciting work in machine learning genomics and statistics. She's from a background, a mathematician who worked on algebraic statistics and did a PhD in statistics at UC Berkeley. And then over the last few years, she moved into more applied fields like genomics and machine learning and is studying very exciting problems there. So her work addresses problems such as graphical model, learning causal inference, autoencoders and in computational biology questions such as the spatial organization of DNA and multimodal data integration which we'll learn about more in a few minutes. Caroline is a star in the field. She has won numerous awards like an NSF career award, Sloan Research Fellowship and assignments investigator in the mathematical modeling of living systems award. You're very happy to have her here and to learn more about her current work. Thank you, Caroline. We are looking forward to your talk. Thank you very much Carsten for this introduction and for having me here. So let me just get started. And start with some motivation for this talk. So as Carsten said, we have been very, very excited and interested in problems related to data integration and multi-domain data integration. And here I mainly work on applications to biology but I also wanted to make clear that the same kinds of questions arise in many, many different areas. So if we start with a biological kinds of applications I'm excited by questions in single cell biology because I'm interested in how is the packing of the DNA related to gene regulation? And so then of course you have, nowadays what is exciting is that you can actually observe the cell and all of its were in many different kinds of, with many different kinds of data modalities which will tell you different things about the cell. So for example, you can take images of cells and highlight many of the different features that you see here. At the same time, of course at the single cell level you can nowadays take RNA-seq profiles the very high throughput, chips, heat, attacks, heat, et cetera, et cetera, right? That will give you other different kinds of insights into cells. But one of the big problems or challenges in single cell biology is still that many of these data modalities are acquiring this data is highly destructive to the cell. So it's still very, very difficult to do, to get different modalities in the same cell, right? So for example, if you get an RNA-seq profile then you have to fix, your cell is going to be destroyed. If you get an image and you're fixing the cell again, your cell is going to be destroyed. So you don't get the same, different kinds of modalities in the same cell. And what you can do of course is you have a population of cells and then some of them you take out for imaging, some of them you take out for sequencing but it's not the same cell, right? So you have to infer kind of, or in order to understand how, you know, for example, the packing of the DNA which you can measure very well as in a DAPI stained image, how that relates to gene expression, you really have to infer somehow which what would the image have looked like? Were I able to image it also based on the RNA-seq profile and also the other way around? But very similar questions of course arise in all kinds of other applications, right? So for example, you may want to integrate and translate between audio and video in order to get better spatial correspondence for object detection or, you know, you have different kinds of data modalities like video router, leader, et cetera that you want to integrate and also translate in order to compensate for missing or corrupted data, right? So those are the kinds of questions in terms of multi-domain data integration and translation on the observational data side that I've become very interested in. But then of course, what is super exciting, I think, in terms of biology and genomics is this opportunity that you can actually perform really, really large-scale interventional screens. And this has become possible through the CRISPR system, right? Where nowadays you can actually intervene on cells at the very single cell level again to get single cell RNA-seq profiles and really know which gene got actually knocked out or perturbed in the screen at the single cell level. So that's super exciting. At the same time, there are all these huge drug screens. And so, but also here, right? The problem that the cell is destroyed by these observations is again a problem because really what I would like to be able to say is, you know, how would the cell have looked like before the intervention, right? When I measure it after the intervention or the other way around, right? But I can still not do both because getting the measurement means destroying the cell. Similarly, you know, often you maybe have a drug screen, right? That looks at many different cell types, but, and this is something that we'll be talking about. Now comes, you know, this disease, which is, you know, COVID-19, right? Which affects particular cell types very much. And now the question is, you know, can you infer from these drug screens that have already been performed on some cell types? Can you now infer? Well, what would these drugs do on this, you know, diseased other cell type that you have not yet really measured, right? Have not yet tested all these drugs on. You can, of course, get some data from this particular cell type, but, you know, you probably don't want to go through all these drugs again and test all of these again. And of course, similar questions arise, you know, of course, also in many other areas, right? Where you can nowadays actually perform interventional experiments, you know, if you think of A-B testing in advertisement, I'd say personalized ads. If you think of online education, which actually makes, you know, performing interventions much, much easier or also in manufacturing, et cetera. So everywhere there you have like observational data, interventional data, you would like to integrate these all or translate between them or, you know, translate between before intervention, after intervention to really be able to figure out the causal effects. So this is a bit of a high level overview. And now what I want to do is actually give you four specific applications that have motivated what I will then be presenting in terms of machine learning methods. So these are the four applications and you'll see they're all in the same flavor as I presented before. So these transport problems or, you know, transfer or translation, you know, they all have like different kinds of names in different kinds of areas. So the first one is the one where I want to transport between different data modalities that I mentioned, okay? So here is the real application that I care about. It's kind of like what I said before. So you have a population of cells. I can take out some of them for imaging. These are DAPI stained images of the cell nucleus. So here you see the packing of the DNA and I can take out some of them for sequencing. So here I have single cell RNA-seq data. There is no paired data for imaging and sequencing at this level. And so now, because I know that they come from the same population of cells, I would really like to be able to learn the map. That can bring me from RNA-seq to images or the other way around from images to RNA-seq. Okay, I would like to be able to answer the question. Hey, you give me a particular RNA-seq profile. How would that cell have looked like? Were I also able to image it? Or you give me a particular image. How would this cell have looked like? Were I also able to get its RNA-seq profile? Right, only that will actually tell me something about the biology of like, how are these two things actually related to each other? Okay, so that's the first problem that I want to discuss. The next one is very, very related. Again, it comes up because getting these images is highly destructive to the cell. So in particular, if you think of these images that need fixing of the cell, it means that I can never get access to the standard time series data, right? Because once I get this image, then the cell is destroyed and I cannot see it. In particular, I cannot see it over long processes. Say for example, cancer progression, right? Inside the body, for example, you can anyways not get that even if you can get some short progressions. So then what do I want to be able to do? Well, again, I can take some population of cell at time point one and the population of cells at time point two, right? And from time point one, I take out some cells from imaging, then I let them progress throughout the next time points. At time point two, I take out some other cells for imaging, right? Again, they're not the same cells. So here I have a representation of the population of time point one and a representation of the population of time point two. And what I would really like to be able to answer is how would this particular cell have looked like at time point one or the other way around, right? Again, I cannot get to see this between these two different time points, right? I really need to infer this. And of course, we'll also have to talk about how do we actually validate these kinds of methods which we will, okay? So very similar question. Again, I hope I can actually infer this function, right? Because I know it is at the same population of cells that is progressing over time. So obviously something should in fact be remaining there, right? But I really want to be able to do this at the single cell level because, you know, for example, I mean, here really I'm motivated by cancer early detection problems where, you know, I have these cells here and I really want to be able to tell how have they looked like at earlier time points? Would I have been able to already then detect that they are on a path to becoming cancerous? I really in the end want to be able to do something like inferring cell lineages, et cetera, although I don't have access to that. Okay, so that's the next question. So all of this are the observational level. Of course, you can ask these things also with RNA-seq data or any data set that you care about. So then I come to the interventional setting which I mentioned also on the previous slide and here I'm particularly motivated and you'll see it by the SARS-CoV-2 application but, you know, where I want to infer, you know, how these drugs would have affected a different cell type that I don't get to see. But here, you know, I can also ask the same question in terms of mice and humans. So say, you know, and obviously there's something a pharma industry would really love to be able to do, right? I have here, I'm measuring the expression in mice of, you know, all of these different drugs and some controls and, you know, and humans, of course, have access to the expression in controls and different kinds of cell types. And I would like to infer what all these different drugs are to me. So again, you can think that, you know, something like this should be possible, right? Because I do have here controls and controls and hopefully I can learn enough about this map and learn enough about, you know, what these drugs actually do, right? As compared to the control so that I can really fill up this little square here. Okay, so this is another transport question where I want to transport now, it's a causal question, right? I want to transport the effect of a particular drug or particular intervention to a different environment which in this case is humans. Or, you know, maybe more realistic, a different cell type which is how we're going to use this. Okay, and then comes the last problem. And I mean, at least to me, this last problem is actually very different than the others and we are at least solving it in a very different way than all the other problems just because we don't know yet in fact how to put it into the same framework but I'm of course very interested in any of you have ideas of how to do this. So this is a problem that occurs and again, it's in the interventional setting and here we're thinking about gene knockouts. So when you, nowadays, you know, gene knockouts I mean, there are libraries to knock out any of the genes you like, right? But of course humans have 20,000 genes and you don't want to and cannot in terms of, you know, how many experiments it requires cannot go in and knock out any combination of genes, right? So that's a huge number and certainly not now possible and probably will not be possible just because this number is so huge. So what you really want to and what has been done is, you know, I get actually already a whole lot of different kinds of knockouts, you know, I knock out, for example, gene A, I knock out gene B, I knock out gene C and D, I knock out gene E, et cetera. So this is all of the data that you get to see and of course you have controls and you have all of these different knockouts and what I would like to be able to do is, now of course I don't get to see all of the others, right? I would really love to be able to predict the different kinds of knockout that I have not seen, okay? So from all of these that I have seen, well, can I predict what knocking out gene F will do or what knocking out gene H, I and A will do, okay? So I have a whole lot of data over here and I really want to fill in this kind of table here where I have all these other combinations of knockouts that I have not yet seen, okay? To us, this is a very different problem because somehow, you know, you're trying to predict a different kind of distribution, right? This is an interventional distribution is different than anything you have seen here. Of course, also here, you know, the distributions are different, but you know, they at least come from the same environment, for example, or, you know, they come over time, et cetera. But here it seems like in order to figure out how the distribution shift actually goes, right? You really need to understand something about the underlying gene regulatory network, this underlying causal graph here that you have. Because only if you know this, can you try to figure out, well, if I intervene on a different node, what is actually going to happen and what way is the distribution going to shift? So here we're really taking a graphical models approach, you know, inferring this causal network here and then from there being able to say like, hey, you know, because I now inferred this causal graph, well, and now you tell me you're going to knock out gene F and I know gene F is here, well, then I can predict what actually happens, you know, when you knock out this particular gene, okay? So that's kind of the overview. These are the four questions that I want to talk about. And we will see that, you know, all of these three questions are actually solved in this, I mean, they're all have different kinds of methods in the end, but they're all building up very, very heavily on autoencoders. So I'll spend a lot of the time in this talk talking about autoencoders. Because we use them so heavily in our research, we spent quite a bit of time in terms of actually trying to understand the theory and why they work, when they work, how they work. So that's what I want to talk about here. And then also, but what I'll start off with so that I leave, you know, most of the time for these applications and for autoencoders, I thought I'll still start off with just graphical models also because I'll get back to this application in the end where I'm going to use all of these things together in the SARS-CoV-2 application that I mentioned. Okay, so I'll start off with this last one here and then move on to, you know, all of these other problems that have autoencoders as their backbone. But you see that they all sound like kind of similar, right? But I think they are actually really quite different in terms of, you know, if you're working with interventions, whether you want to predict a different, so transport between interventions, so predict the effect of a different intervention, it seems very, very different than when you want to transport the effect of a given intervention from, you know, one environment to another, say from mouse to humans, from one cell type to another, et cetera. Okay, so although they sound different, it sounds similar, all these transport problems, they can actually be quite different in nature in the end. Okay, so I'll start here just because, you know, as Carsten said, we've spent a lot of time in recent years working, I mean, actually in the last, maybe 10 years or so working on graphical models and causal inference, et cetera, et cetera. Okay, so where are we in terms of causal inference and how do you actually learn gene regulatory networks when all you have is RNA-seq data? So that's data on the nodes and you would like to learn a network on, a network that connects all of these nodes and is causal in the sense so that it can tell me what happens if I knock out a particular gene, this network will actually tell me what, which genes, which expression of genes will actually change. Now, of course, causality is such an old field, right? That it obviously has a long history, it's such an important field of asking, you know, why does something happen? And so the framework we're using is a one that was proposed and introduced by Sewell Wright in the 1920s to study heredity in different kinds of species. So he really introduced diagrams, these networks, these directed graphs to represent causal relationships which in his applications were heredity. So because they're causal, obviously there are no directed cycles, right? We can talk about, I mean, there are nowadays extensions to also allowing directed cycles. I'm going to use this one here in this lecture. So no directed cycles because of course causality can only go forward in time. And then of course, if you think of gene expression, right? So these are noisy, this is not deterministic. So I have every node as a random variable and what this causal graph represents is a very simple actually kind of model. It's that every node, which is a random variable is some function of its direct parents and noise, okay? So for example, X4 here is a function, of course not necessarily linear or anything like that, right? There's a gene regulation of its direct parents. So in this case, X2 and X3 and some noise, again, not necessarily Gaussian, right? Because of course, if you look at gene expression data and particular single cell is certainly not Gaussian. Okay, so that's the model, but now of course I don't get to see the network, right? All I get to see is observations here on the nodes. So observations on what are these gene expression levels and humans, this will be a network of 20,000 nodes. And now I would like to infer this network and now it is, you know, and this has been done, maybe actually I'll already go down. So, you know, the earlier work has in particular been in the observational setting and it's very clear that it is very hard to learn causal graphs in the purely observational setting, right? In general, not identify the full graph, I'll have that also on the next slide. And, you know, the problem also, even if you have latent variables, et cetera, it's just very, very hard to say anything causal if you just have observational data. So that's why what I'm super excited about is that in the genomics context, you actually do get a huge amount of interventional data, right? This gene knockout data is interventional data that can really, really help you to get to the causality, to the underlying causality. And that's somehow the new question, right? That has not been available before. And so, you know, this work didn't have to think about interventions and how to deal with interventional data. So that's, I think that what is the really exciting opportunity is that we have interventions and we can think about them. So in particular, we have knockout interventions. Those are the most invasive ones. Those are hard interventions in the sense that, you know, you go in, you choose a node and you just set it to a value, in this case zero, right? And the knockout, I'm just going to set the node to zero. So that means I'm actually changing graph structure, right? If you go out and you choose a node and set it to zero, then, you know, in this case, X1 has no more effect on X2. So I'm actually changing the graph structure. And then there are all the other interventions. So think of overexpression, et cetera, right? So these are soft interventions where you go in and you just change how a node acts on the next node, okay? So all of that would be called a soft intervention. And so the question is, of course, you know, what can you do with interventions, right? How much do they help you? Can you develop algorithms? It's not so easy, right? Because if you think about it, you know, if you have observational data, that data comes from one distribution. If I intervene on a node and I'm shifting the distribution, right? That's a different distribution. So now you have data from all these different distributions and together you kind of want to learn one graph. So think about it in that way, right? That's really what makes it hard, right? It's not so easy to actually combine all of these different data sets together to actually learn one thing in the end, which is something that, you know, should connect all of these different distributions together. Okay, so what can we say? So first of all, you know, we have to, if we want to come up with algorithms to actually do this, we have to think about, well, what can we even identify? Right? So if you would have an infinite amount of data, what could you identify about the causal structure? And so this is something that has been known for a long time, right? Which I alluded to before in the observational setting, you know, in general, even if every node is observed, even if you have an infinite amount of data, you cannot identify the full causal graph. You can only identify it up to something that is known as a Markov equivalence class. Meaning, for example, and this I'm sure you're all familiar with, if you just have observational data, you cannot say whether X causes Y or Y causes X. All you know is that they are correlated. Okay, so, sorry. So in this particular setting here, you can actually not identify which one of these two graphs is the right one. However, you can of course, you know, distinguish this graph from, for example, this graph, et cetera. Okay, so this is what you can identify. Now, what we also know now is in the interventional setting, right? So you would hope that if you have interventional data, you can actually identify more, and you can. So for hard interventions, this was done in this first paper here, Bielman's group. So that's a really nice paper that tells you, you know, how much more can you identify in these Markov equivalence classes? And what we showed, and this was left open as a question there is that, you know, actually for soft interventions, which are much less invasive, you actually get exactly the same interventional Markov equivalence classes. Okay, so no matter whether you perform these super invasive interventions or the less invasive ones, you can actually identify the same amount about regarding the underlying causal relationships. Okay, so now we know what we can identify in the best case, right? So if we want to now come up with algorithms, then they need to be able to identify this, right? So you can now talk about consistency because before that you couldn't, right? Before that every algorithm is just some heuristic because you don't even know what you can learn in the best case when you have the number of samples go to infinity. Okay, so I should say, so at this point in this particular paper, there was an algorithm proposed for doing this in the interventional setting, but we proved that it's actually not consistent. So even if you have an infinite amount of samples, it doesn't, it will not converge to the correct in general, not converge to the correct causal mark of equivalence class. Okay, so then what do you do? So there are no such algorithms. So we have to come up with a new way of thinking about these causal questions. And it's a very, very simple and intuitive way of thinking about it. So, you know, general algorithms before we're usually going through the, so kind of, okay, so causality or causal inferences, NP heart. So you need to, you know, do some kinds of searches over the space of graphs and hopefully then still be able to prove that they actually converge to the correct graph. So usually these were doing some greedy search over the space of graphs, you know, taking the score function, for example, the BIC, et cetera. And it's pretty amazing that you can actually show, you know, you can define some set of moves that you can actually show that without intervention, just observational data that these searches are actually consistent, but this doesn't work in the interventional setting. Just because it is somehow a mixture of these different kinds of graphs, you know, you can just make them get stuck on their search path. So let's just think a bit differently about causality and it's just such a simple idea, but, you know, so we want to learn a causal graph. It's a directed acyclic graph, right? So what defines a directed acyclic graph? Well, it's defined by the undirected graph which is known as the skeleton and a permutation and ordering of the nodes, right? Because if I give you an ordering and the skeleton then you, an undirected graph, then you just orient all of your edges according to the ordering of the nodes so that, you know, you have an edge pointing forward if pointing from I to J, if I is smaller than J, okay? So really the hard thing about actually causal inference or learning the underlying causal structure discovery is that you have to learn the permutation because we all know how to learn undirected graphs. In particular, if I give you the correct permutation, well, you just do, you know, some conditional independence that's basically you just regress on your parents and you're done, right? You actually have the underlying, it gets the corresponding causal graph. So the only hard part is actually learning the permutation. Okay, so then instead of doing a search over graphs, let's in fact do a search over permutations. So the space of permutations is very, very nice. So here's where some geometry comes in, right? Space of permutations is actually a nice polytope. It's known as the permutahedron. And so now I want to do a search over all permutations. Well, you know, you'll need to do greedy search, right? The space actually becomes large quite fast. So I'll do greedy search on the space moving around from one permutation to the other. For every permutation, I can construct the corresponding DAG with the corresponding undirected graph. And now I need a score function, okay? And so the score function, we'll just use the simplest possible score function, which is just the number of edges in the graph. Okay, just the sparsity. So without any assumptions on that the true graph is sparse. So it should be clear on this. You can still show that if you do this, then, you know, the sparsest graph is the true graph. Okay? In particular, if the true graph is the full graph, then the sparsest graph is just a full polytope because all true graphs are a mark of equivalent to each other. Okay? So this also works when the true graph is the complete graph, meaning you don't have any sparsity constraints. So that's how it works. So you just start in a permutation. You construct its corresponding undirected graph. You know how many edges there are. You look at the neighbors, you construct their undirected graphs. If they have, you know, at least as many edge, most as any edges as where you were, you just walk there and then you continue on. So really the hard part about showing this is that you don't get stuck, right? I need to show that every local minimum and the search is actually a global minimum so that whatever happens, you know, you're walking around in the end when you're stuck, when there is no more, nothing to go to you know that you actually reach the correct graph. And that's in fact the case. So this is for observational data. And the nice thing is that this extends directly to interventional setting with hard intervention, soft interventions. Now we even have it with unknown intervention targets. Now even have it with latent variables, et cetera, et cetera. Okay, so this permutation framework is really quite powerful in that regard. Why is it? Well, if you think of, you know, somehow before when I was talking about that the interventions shift the distribution, that's the case. Now the problem is when you work over graphs, right? Then the interventions also correspond to different graphs. However, the permutation is always the same, right? The permutation is somehow with this invariant, right? Whether I have an intervention here or there, well, the ordering of the variables is not going to change. And so that's what we're using here, right? That's why you can get consistent algorithms where you're actually searching over permutations instead of over graphs, for example. Okay, so now in terms of, you know, so we have a question, right? These algorithms are MP hearts, so how much can you scale them? Of course, if your true graph is the full graph, then you know, you're just, I mean, usually you do some depth first search here that you're saying like, hey, you know, I'll search like five steps away and then I'll just say I'll stop off. So you're never going to get over this MP heart. I mean, if it's the full graph for you to be sure that you actually output the correct graph, you would have to go through all permutations. So obviously if it's the full graph, you know, these things don't scale as well, but in general, you know, for these kinds of biological applications, you can easily do thousands of variables. So what is interesting, I think, is that it is as fast as G lasso in Python. So G lasso is what is used for learning on direct graphs, right? So you don't really pay a penalty in terms of actually learning a direct graph. So with causal graph here. These are quite fast. And as I said, you can add an interventional data. This really helps, right? A completely different game when you actually have interventional data. It's just, it makes the problem so much more manageable to actually try to learn a causal graph, right? Because an intervention really tells you what happened. You know, who is upstream and who is downstream. So this can really, really help you. So if you want to play around with any of this, we have, you know, everything and nice Python packages with, you know, interventional data also, you know, different kinds of pre-processed, a perturb seek data, for example, where you can actually play around with all of these different kinds of algorithms. Okay, and I will show you an application in the end when we get to SARS-CoV-2. Okay, where we really use these kinds of algorithms for actually figuring out these underlying gene regulatory networks. Okay, so that's a bit of an overview in terms of, you know, where are we in terms of causal structure discovery and actually learning gene regulatory networks, you know, do these algorithms scale nowadays. Yes, they do 20,000, that's still a bit hard. Usually you anyways pre-process, so for the 1,000, 2,000, this, you know, definitely doable. If you really want to spend a lot of time, then maybe you can do 20,000. And people have done 20,000, in fact, these algorithms when they look at brain and neurons and, you know, want to figure out the network between neurons. So Frederick Averhart is someone who has applied the algorithm I just showed you, also on, you know, 100,000 nodes. So they do scale. So that's maybe important to know. Okay, so that's a little bit here. And maybe I will just quickly see if there is any questions that I should answer before I go on. Let me quickly check. So not on Slido, maybe in the network? Changing topics. No, so there are no questions at the moment. I would have had one about the scalability, but this you answered already. So I think you can... Perfect. Okay, good. So I hope you see how, I mean, okay, maybe I can say about how we validate these algorithms. But, you know, since you have, but, you know, since you have knockout data or, you know, all kinds of other perturbations on genes data, how we validate it is we usually use only a portion of it, of the interventions, right? And we try to predict the effect of an intervention that we don't use for actually learning our gene regulatory networks. So hopefully then that is exactly what this question here is, right? How can we use all of our interventional data that we have to predict the effect of a new intervention that is yet unseen? So that's exactly how we validate this types of algorithms. Okay, so that hopefully answers this kind of question. Of course, I don't think we're anywhere there that we can just say we're done with it. I think these algorithms, you know, this causal inference problem is just a very hard one. I do think that there is a whole lot to do in terms of, you know, off target effects. There is noise, there is measurement noise. How can we better take care of this, et cetera, et cetera. All the things that are, you know, also RNA-seq, single cell RNA-seq specific and intervention specific that really should go into better algorithms here so that we can do a better job at actually solving this one problem. And of course, maybe completely different approaches than graphical models because I still don't think that it is necessary to learn the full underlying graphical model in order to be a good at predicting, you know, the effect of a new intervention. So I think this is just where we are right now. Okay, so I'll go now move over to these other three transport problems, right? Which all sound very similar but we're going to take a very different approach which is using auto encoder. So not actually trying to even learn the full underlying causal graph. In particular, based on images, you don't even know what the causal variables are, right? So really do need to take a different kind of approach. I mean, in RNA-seq at least you have a coordinate system, you have genes, right? So these can be my causal variables and images, you know, what are you going to take as your coordinate system? So that's an important question where also auto encoders can be helpful as we'll see here. So all of this is going to be on auto encoders in particular. I'm going to present this work here that God accepted to PNS earlier this week. So what are auto encoders? I'm sure many of you are familiar with auto encoders. So these are special neural networks that, you know, we really came to love in our group and many others do as well. So these are special auto encoders as special neural networks in the sense that they're not classification networks. They are a function from RD to RD, okay? So if you put in here an image, outcomes an image of the same size. If you put in our single cell RNA-seq outcomes, you know, an expression vector of the same size, okay? And they consist of two parts, an encoder and a decoder usually or classically. This space here in the middle, which is known as the latent space has been lower dimensional than the input space because, you know, I mean, first of all, okay, so the intuition comes, maybe for example, if you think of PCA, right? You want to find some lower dimensional representation of your data. But also, you know, what was the thinking is that, hey, if we make it very high dimensional, then, you know, this neural network could, it has the capacity to just learn the identity function and that's probably not what you care about, right? You want to find some meaningful representation here in the latent space. That captures something interesting about, you know, whatever your images are that you're putting in your single cell RNA-seq or maybe you can cluster your data nicely or, you know, you want to find a meaningful representation of your data, right? So that was the intuition that, you know, this should be lower dimensional and that's how, you know, all auto encoders are currently done, right? So, okay, so how are these auto encoders trained? I can say that. So if you have a training set, RNA-seq images, whatever you have and what you want to do is you just train it to minimize reconstruction error. So I have my image and I want whatever comes out on this side to be as similar as possible to what I put it, okay? So Psi is the function that maps me from Rd to Rd or from input to output space. So Psi, XIs are my training examples and so Psi of XI is whatever comes out and, you know, XI is whatever I put in. So I want this here, say in two norm to be minimized, right? Overall, my training examples. So that's how you train it. And now of course the question is when I put in new images that were not in my training set, what happens, okay? So this is the space that is usually used for downstream analysis in terms of looking at your data using the representation that is learned, et cetera. Okay, so I want to understand, you know, there are many, many questions here. I want to understand what do these neural networks actually learn? What is this representation that is learned? How should I choose depth and width, right? So this is the depth and then here, of course, you have a width somewhere here. How should I choose deep things? And in general, what is the function class that is actually learned by these neural networks? So that's the question I want to answer. And in particular, so we see this in the classification setting that, you know, people are going more and more to this over parameterized setting, right? You just want to make these neural networks as large as possible, even though they can then get down to loss zero. It seems like they're still generalizing very, very well. Okay. And so that's the setting I want to analyze, namely the setting, which was, you know, not the one that is usually used, namely where here, for example, I actually have this layer be very big. I'm over parameterized. I could learn the identity function. What I'm going to show you is actually that a not encoder does not learn an identity function, which is really interesting. So it might actually learn something very, very interesting here in the latent layer, even though you're just over parameterized. Okay. So this is the setting I'm going to look at. So I have N training examples. I'm going to make my neural network very large. So in particular, N is smaller than RK. So, and now usually, as I said, they're used, they're trained using some stochastic gradient descent methods, et cetera. It initialized close to zero. So now for the theoretical analysis, I will be analyzing just gradient descent, but I'll show you in, you know, all of the different experimental tests that it really doesn't matter what kind of optimization method you use, you'll see exactly the same results. And this is what we're trying to optimize as I showed already on the previous slide. Okay. So now let's look at the following question in the over parameterized setting. And let me just actually simplify this. Let's just look at the linear setting. Okay. I want to just so that we all go through this exercise of just trying to figure out what is actually happening. So psi in general is a highly nonlinear function, right? Which represents the full network. Let me now just for the sake of just trying to think through this problem, let me say psi is just a linear function. And let me say I just have one training example. Okay. Okay, this is great. So let me say I have just one training example. So then the problem becomes this here, right? And I want to minimize this two norms squared, whatever you want to take. So I want to minimize this. So arg min over all a in this case in rk times k, right? So obviously, you know, this problem here can be, so this thing here, I can get it down to zero, right? The loss, let me know if I take away the arc. So I can certainly get this down to zero, right? Because it was highly over parameterized. In particular, there are many, many solutions that make this zero. So for example, a equal to the identity matrix will certainly make this zero, right? Because then here you just get x minus x. Or, you know, the projection onto x will make this zero, right? So this will be a rank one solution. Projection onto x will make this zero. And similarly, there are many, many solutions that will make this zero. So the question is, so, you know, when you train, obviously, you know, you will get the solution out. So you're going to get some a matrix out that will make this loss basically zero up to, you know, numerical precision or as long as you train, right? But the question is, well, what is the neural network learning, right? Is it the identity matrix? Is it a rank one matrix? Is it something completely different? So what is it that the neural network is learning in order to get a loss zero? Okay, if it's the identity matrix, then that's really not that interesting, right? Because that means that I'm just taking my images and I'm putting it out. I don't have to do any training, right? To get the identity matrix, whatever you're going to put in, the same will come out. So certainly no interesting representation will be learned in the latent space. Okay, so let me show you an experiment. Okay, so here is, you know, a very standardly used unit autoencoder, convolutional, deep, et cetera, you know, has all of the features that you usually use. So let's just see, I just trained it on, not I, a group trained it on one example. And let's see what actually happens when you just have one training example, what is the function that is learned? Okay, so let's see what happens. So here you put in any other images or noise or, you know, whatever you like and you'll see what comes out is always the same image. Okay, so that's, I mean, so what it means it learned is the point function, right? It maps anything you put in to this rabbit's. So it certainly didn't learn the identity function. It learned a very special function, right? That maps everything to the training example. So that's interesting. So the question is, is this a general phenomenon? Is this surprising to you? And I would argue it should be surprising to you. Let me tell you why. So first of all, as I showed already before, there are infinitely many solutions in the over-parameterized setting, right? There are infinitely many functions that will map the rabbit to the rabbit, but it can do whatever it wants on all of the other images. So it is certainly surprising because in particular, if you take a shallow convolutional network and I train it on one training example, hey, what comes out, you put in anything else, you'll see that actually what comes out is something that looks quite close to the identity function. Okay, it's not quite the identity function. You see here that, right? It messed it up a little bit. It's becoming a bit more blurry. So it's, but it's certainly a high-rank solution and not a point functional to one image. And similarly, if you look at linear networks and it doesn't matter whether convolutional, fully connected, deep shallow, whatever you want to do, if it's a linear network, then you'll see that in fact, whatever comes out. So here I'm just showing it on two training examples just because then you see it nicer that it's actually learning a projection onto your training examples. Okay, so you see here that it's learning a combination of the airplane and the fog. And in fact, I should say that these are things that we can prove. So I should always distinguish between what do we only have experimental evidence and what can we actually prove? So certainly this year you can prove quite easily. And this year here what we can prove is that you're not going to memorize your training example. So it's certainly necessary to have depth for convolutional networks in order to learn, be able to learn something like memorization which we had on this previous slide here. Okay, so the question is really what is happening, right? So can I generalize this phenomenon to more than one image? Why, what is the importance of depth in the convolutional setting? What is the importance of the non-linearity, right? As we saw, if you have just linear network sentences not going to happen. Now, these are the questions that I want to answer. So now I think, you know, all of you who have played around with auto encoders you'll know that if I put in a lot of images it's not the case that, you know whatever you put in always at the output is the training image. So you already know that's definitely not going to be the case. So the question is what are the functions? What is the function class that is actually being learned? Okay, so let me do it on many images. Okay, so now I trained on many images. I don't remember, maybe a hundred here. So I trained on many images and I'm going to put in some corrupted training images. Quite heavily corrupted, right? I'm removing 50% of the image. And I'm asking, what comes out? Okay, so this is the standard thing when you apply an auto encoder, only look at the first part here. This is what comes out. So certainly not a training image comes out. Okay, so definitely it's not the case in general that, you know, you have the same phenomena as what I showed you when you just train a hugely deep auto encoder on one training image. So if you have more training images in general what comes out is not your training image. And all of you know that that's actually the case, right? Okay, so what is happening? So what is really nice and this is how we analyze the kind of unfound and we're able to identify what the function class is that is learned by auto encoders is that you can, because it's a function from RD to RD well, what you can do is you can iterate the function, right? So whatever comes out, the image that comes out I can put it back in and see, you know, I just iterate the map and see what happens, okay? So that's what we did here. So okay, you take, you get an image out we just put it back into the auto encoder to train the auto encoder and you see what comes out and you just iterate and iterate and you'll see that in the end you actually get one of your training images back, okay? So that's a, I think a very nice phenomenon of, you know to tell you what is the function class that is learned by these overparameterized auto encoders. So these are functions that are heavily contractive at the training examples. Okay, it's a very special function class, right? That means that these auto encoders are self regularizing, right? They like to stay close to the training examples which, you know, and I'll argue about maybe on the next slide how useful this is but let's maybe just go through a couple more of just to make clear we all understand what this phenomenon means. So let's see here, for example so maybe first we go through the others. I mean, you see here that, you know even if you remove 50% of the image in the center and you do this thing of this corrupted just run these corrupted images. Oh, so here it says also on how many images I trained or no, this is not the case. Okay, so here 421 so 421 out of 500 if you just put in oh, yes, sorry. So this means 500 training images were used. So if you corrupt these 400 you corrupt these training images by, you know, setting half of it to noise and you just iterate the map you see that in 421 of the cases you actually end up in the correct training image. So that means the basins of attractions are really quite large, right? And it depends a bit on where, you know what kind of part of the image you're removing but this is still quite impressive that you're actually ending up at the right image and here are just some examples, right? So let me just show you in 2D how this works. So here my training examples are these stars, okay? And what we did now is so we trained so this is a map from R2 to R2, okay? And now we just put a grid on this R2 and we looked at, we just iterated the map from each point on the grid and we saw where does this point converge to? So first of all, what is amazing is that each one of the points converges to one of the training examples, okay? This is also not clear a priori that there are no other maybe that there is, you know that it could cycle around or that it is attracted to some other point. So all of them are attracted to our training samples and in color you just see, you know all of these points here will be converging to this training example all of the red points will be converging to this one, et cetera. So here you see the basins of attraction. So they certainly don't look like, you know nicely Euclidean bowls, et cetera. So they are slightly different and so this is certainly an important question to answer which we don't have an answer to is, you know what is the metric that is learned by an alternate code, right? How is the space actually broken up? Okay, so in terms of actually, you know learning these big networks, if you see it so we trained on 500 image net examples quite small networks, right? And, you know, some non-linearity, et cetera and what is quite impressive is that you can make all these 500 examples be fixed points, attractive fixed points. So you can prove that, right? I just, all I have to do is look at, you know at the derivative at all of these examples and the eigen values there, right? So I need all of them to be smaller than one and then I know that it is an attractive fixed point. And so I can prove that these 500 training examples are in fact attractive fixed points. Now what is hard to prove is that there are no other attractors, right? So like here in this example, you know I cannot prove that there are no other attractors I can just test at all of these grid points that there is no, that all of them converge to one of my training examples. And so that's the same what we did here. I mean, we took so many examples we took like all kinds of noises all kinds of other images, et cetera and all of them always converge to one of our training examples. So, but of course we don't have a proof that there are no other attractive fixed points in this whole space. Okay, so you can really do this for very, very large amounts of training data. And so maybe this is the most important slide just to really understand what is the function class that is learned by auto encoders. Okay, so what is it what we can prove? So as I showed you before it doesn't matter how many training examples you have as long as you're over-parameterized you will get them as attractive fixed points. What we can prove is we can only prove it for one training example, okay? So for one training example we can prove and here if I write suitable conditions on the non-linearity and initialization so all of the standard non-linearities actually fall into this class of suitable conditions initialization is just the standard one close to zero. Actually any training example can be made in a tractor with appropriate depth. Okay, so what we have here is a formula for these eigenvalues of the Jacobian. So what you care about is the maximum eigenvalue and absolute value of the Jacobian for it to be an attractive fixed point you need it to be smaller than one. So we have a formula for this that just depends on D. So D is the depth, K is the width and to non-linear initialization and you can just figure out what is the depth for example needed in order to make this an attractive fixed point. Okay, so this is quite, I think, I mean still quite surprising even if it's on one sample on one training example, right? Because I mean, if you think about how, so if you actually don't, and this is really a consequence of training, right? Because if you don't train your neural network this is not going to happen. Your training example is not going to be an attractive fixed point, right? It's a super strong constraint that all of the eigenvalues at your training example of the Jacobian have to be smaller than one. If you have that D, right? So, or whatever. So you're in a high dimensional space, so say in that layer on this case with K, from K dimensional space K is very, very large then the probability that each one of the directions is smaller than one, right? The smaller than 45 degrees is one half, right? So you get the probability of one half to the power of K and K is very large. So in general, this is not the case. You really need training in order for this to happen. Okay, so now of course, as I said, we all know that deep over-parameterized convolutional neural networks can interpolate training data, but I do want to just highlight that interpolation is not the same as what we're showing here, right? This is memorization, although in literature, memorization and interpolation is often used interchangeably. I would argue that that's problematic, right? Memorization is more than interpolation, right? Because memorization also requires that you're able to recover the training example, not just that you're able to store it. So in particular, a function like this certainly interpolates our three training examples, right? It goes through the three training examples, but you're certainly not able to recover the training examples. So what we're showing is that actually over-parameterized auto encoders are able to recover the training examples and the mechanism to recover them is just iterating the map, okay? So what we saw at the beginning, when you just have one training example, or very few and a huge depth, then what happens is you're actually learning a piecewise constant function. So whatever comes out is already one of your training examples. However, in general, what you learn when you have many examples and not too much depth is actually a function that is just interpolating the training examples and contract the further training examples, meaning that if I iterate this map, then I get out my training examples. So this year, we know converges to this function over here if you iterate it, okay? So this means that these auto encoders are very nice self-regularizing functions, right? When you have them be over-parameterized, they don't learn some crazy functions, although they have the capacity to learn all these crazy functions. They actually learn functions that we would say make a lot of sense, right? They are still staying close to the training examples and in particular, if you iterate them, you'll get, yeah. Sorry to interrupt. There's one question on Slido, which maybe fits here very well. The question is thanks for the talk. What do you mean when you say that the over-parameterized auto encoders are self-regularizing? Okay, perfect. I'll explain it here. So self-regularizing means, you know, they have the capacity to learn any function, like the crazy function that I'm showing here. They have the capacity to learn the identity function, right? But they are instead learning a function that is close to the training image. So they are self-regularizing themselves in terms of actually learning a, in this case, a function that is highly contractive at the training examples, right? So this is a regularity assumption. So in particular, if you look at it before, it's like learning a low rank function, right? Absolute rank would be a different way of regularizing that that's what you could see in the linear setting, right? You're learning a projection onto your training examples. That's self-regularization. Here, when you have non-linearities, the self-regularization that is happening, even though you're more and more open, in fact, even though you over-parameterize more and more or you add more and more depth, in fact, you're going to see a function that becomes more and more like this, right? What you see, so with one training example, it becomes more and more like a stepwise function. So no matter what you put in, you immediately get out a training example. Well, if you're not super, super over-parameterized, you don't get out immediately a training example, but if you iterate the map, you will actually get out one of your training examples. So that's certainly a form of self-regularization. Thank you. Thank you for the question. Also maybe a question about generalization, since I will just answer it since many people ask about this, hey, these functions don't seem to generalize. Okay, so we have to, first of all, what is the definition of generalization in alternate coders? I also think that is kind of confusing how it is currently used. So often it is used as learning the identity map, or if you generalize well, if you learn a function as close to the identity. Well, I would argue that learning the identity map doesn't even require training. So how can that be a function that generalizes well? So I mean, I don't need to train anything. If you really want to get the identity map, then just take the identity map. So we certainly need a different kind of definition of generalization for alternate coders. It should not be identity map. I don't think there is any definition yet, but maybe it could be something like, hey, you still need to be able to get back your example if you heavily corrupt it. Something like that, right? Which would be, you kind of want to be close to the identity map, but certainly not the identity map. You don't want to get out, for example, your corrupted training example, right? If you put in a training example that is corrupted, you want to get out the training example itself. So probably it is something like you want it actually to be contractive at the training examples, but at the same time be close to the identity map. But I think something like this, a very good notion of generalization in terms for alternate coders is just not defined yet. But it might indeed be something like, contractive at the training examples, but close to the identity function. And that's exactly what we're showing, what these functions are actually learning. Okay, so now you can also ask, what happens with width and depth if you increase depth in terms of overparameterization, does it have different kinds of effects? And this is quite interesting, I think here is, to see what is the different effects, and this kind of shows you that maybe you want to increase width instead of increasing depth, which I've already alluded to in the previous slides. So here we're looking at the maximum eigenvalue. So for the training example to be attractive, we know it has to be smaller than one. So what you see here quite nicely with increasing width is that what happens to the distribution of the maximum eigenvalue is that it changes the variance, okay? So it just makes the variance become smaller. So that's how somehow it makes the examples more attractive, whereas increasing depth makes the whole distribution shift. Okay, so why does this make, and you know, you see the same if I just look at the top 10% of eigenvalues, it's exactly the same. So increasing width makes the variance smaller of the distribution, increasing depth shifts the whole distribution, okay? So that's quite interesting. So if you think about it, if generalization is kind of how I defined it just before on the previous slide, if you want to find something that's kind of close to the identity of the map, but still nicely contractive at the training examples, well then what you want to do is really increasing width, right? Because you want to increasing width and then increase depth a little bit so that they all become a contractive because you want to have the eigenvalues all close to one, certainly not too much over close to zero because then this thing happens where just everything immediately gets mapped to your training examples. So you want to have them all close to one and then just move the distribution so have so much depth so that, you know, they're all contractive but certainly not super, super contractive. So this kind of research also tells, give you a bit of insight of how one should actually choose the network architecture in order to learn meaningful representations. So let me do the following. So what I want to show you here is that the same kind of framework can also be used to embed sequences. So instead of, you know, training that you map one image to itself, I could train that I have, you know, I have a hundred images that I want to, that I want to alter in code. Well, let me just map the first image to the second, the second to the third, et cetera and the last image to the first and see what happens, right? So what happens if you put in random noise then because you're breaking these attractor conditions, right? So we could maybe think that maybe, hey, with this the whole thing will break down and, you know, my training examples are not going to be memorized anymore. So we did this for movies. So here you'll see nice Mickey Mouse, the first Mickey Mouse movie. So you see here, so I trained, you know, we trained on a sequence of images, mapping the first one to the next one, et cetera. And we just start with random noise and what you get out is actually the full sequence. Okay, so you just memorize the full sequence. You can do this on multiple sequences. So here we trained on two sequences, you know, one is counting upwards, one is counting downwards on M-ness numbers and, you know, you just iterate it from noise and you'll see that one sequence, you know, you actually don't jump around between different sequences, although the numbers are so similar, you just keep inside your sequence. And here you can see it in 2D just so that you really see what is going on. I started randomly, you see all these points, they already went out to one of their limit cycles and they're just keeping cycling around and here you see once it converged how it actually looks like. Okay, so the sequence encoding, this is actually quite interesting here. So we found that sequence encoding is much more effective at memorization than if you just have single images, which is very similar to how our brains work as well, right? Much easier to actually remember whole sequences of things than to memorize each one of the events separately. And that's what you see here. So if you alter and code a hundred images, you need, you know, actually, even with this depth, a large depth and width, you still don't memorize all of them. However, you know, here, if you encode them as a sequence, so same hundred images, but now I encode them as, in this case, five sequences of 20 images each, you see that, you know, already with that six and quite, and the large width, you can memorize all of them. And you see that, you know, here in particular, if we just take all of the images and we just encode them as one long sequence, same hundred images, right? I am already able to memorize all of them with a very, very small network, okay? So it's much, much more effective to actually alter and code to memorize images, or, you know, this associative memory aspect is much more effective when you do this in sequences instead of in single images, which is quite interesting, and certainly something that needs to be analyzed and understood in terms of how our memory works as well. In terms of math, this is actually very easy to prove, right? It's much easier to have the largest eigenvalue be smaller than one. If you take, if you multiply up all eigenvalues, then if you have to have each eigenvalue be smaller than one, and that's why we also got to trying this out. So it's nice when you have the mathematical intuition and then you'd actually see that this really happens. Okay, so this was all done by an amazing PhD student that we have, Adith Radhakrishnan on this whole work, and in collaboration with Misha Belkin, who just moved to UC San Diego. Okay, so let me now tell you a bit about, so this is all the work on actually understanding all the encoder. This has helped us hugely in terms of the applications. And now let me see how long I have. So it's 10, oh, four. Okay, so if I have another 15 minutes or so, and then I'll take questions. So I'll actually want to tell you a bit about the applications of where we use this, right? For these autoencoder applications here. Okay, and in particular, the last application, the SARS-CoV-2 application is where we really used our insights in overparameterization. These first two applications here were what motivated us to actually look at the theory of autoencoder. So you'll see that we're not yet used it. We are using it in the third application when graphical models and overparameterized autoencoders actually all come together. Okay, so, but these are built based on autoencoders. So let's see how we can use autoencoders. In fact, for example, to translate between different data modalities or to translate between different time points. Okay, so we're translating between different modalities. So how do you do this using autoencoders? Okay, so what do we do? So say this is RNA seek, this is imaging, this is attack seek, this is whatever you want to look at, single cell, high C, et cetera. Okay, so I would like to be able to translate between these different modalities. So what am I doing? I have here four different autoencoders. Okay, each one of them goes from a data modality. So in this case, RNA seek to some latent space and back again. And the latent space is shared among all these different autoencoders. Okay, so they all map to the same latent space. Now, why does this make sense? Well, in this application, we're assuming that it's the same population of cells, right? I just took some out for imaging and some out for sequencing. So it's the same cells. So they should be matched in the same latent space and the same distribution in the latent space, right? Because it is the same population of cells. So what we do to kind of couple all these autoencoders, although they're decoupled, right? But they're only coupled in the latent space, is that you have a discriminator in the latent space to make sure that the two distributions are the same in the latent space because it's the same population of cells. Okay, so what you're doing to enforce this as well, you have an additional penalty which tells you, hey, if you can tell, I mean, I'm now the discriminator, if I can tell that this data point comes from RNA seek land as compared to image land, well, then I'm going to be punished, right? I really need the two distributions to match in the latent space. Okay, and then with that, of course, now I have a map, right? That goes from RNA seek to latent space. And I have a map that goes from latent space to image. So if I just concatenate the two maps, what I can do is you give me a particular RNA seek profile, I can tell you how that image looks like. And of course, the other way around. And just to show you how this works, I mean here just because it's easier to understand it on, you know, faces, say now my data modalities are black haired, females, blonde haired, females brown, black haired males, blonde haired males, et cetera. These are all the different data modalities. These are the only real images. I trained it, of course, another real images of blonde haired females. I put in this black haired female and this here is a generated image. It's not a real image, right? Generated image of how this woman would look like were she blonde? How this woman would look like were she a man and black haired? How this woman would have looked like were she a man and blonde hair? Okay, so that's exactly concatenating these maps and actually generating the corresponding image. And so here is how it works on, you know, the problem that we actually care about. So this is in T cells of, you know, that he stained images and RNA seek profile and the other way around RNA seek profile to actually get out the image. And now, of course, you know, as I said, you know, these things cannot be measured in the same cell. So how can we ever be able to validate something like this? Well, how you validate it is in this case, I mean, how we validated it is that what you can do in an image is, of course, you can use other color channels, right? And so here, for example, we took other proteins and, you know, we have RNA seek, right? Or in this case, so we have an image, right? We start with an image. Yeah, so you start with an image, right? You go to the latent space and you can predict RNA seek. Well, in this case, you know, unfortunately there is still a gap between RNA seek and protein levels. That's unfortunately what we have to deal with. And so we can predict the protein levels and from there, of course, this can then be validated, at least for some proteins. I cannot do it on all these, you know, like what I would have an RNA seek where I have 20,000 genes. This I cannot do in the image, but at least I can take two or three other proteins in the same image and look at them and see if I'm actually able to predict their values. And that's exactly how we did the validation here. So of course it's limited. It's only like two or three at the same time, right? You can do it over other ones, but that's at least one way of validating that what you're doing actually is working, right? That I can really go from an image to predicting, in this case, the protein levels. But image, I mean, the DAPI stage image. And Karsten, you have another question. Kaolin, there's another question that is very topical here on Slido. This is how do you trade off between reconstruction error minimization and maximizing the matching of the distributions of the two modalities in, I assume, in latent space? Yeah, okay, those are great questions. So usually actually how we do it is maybe a bit different. I mean, the trade off of course comes in, but okay. So let me tell you how we train these things. So usually we first start off. So there will be one of them data modalities that is more informative, meaning it needs a larger dimensional latent space in order to be able to be reconstructed well, okay? So that's how you choose. So in terms of the trade off, it's more about how do you choose the dimension of the latent space so that this can actually both be low, right? So usually for us here in this application, we would start with images and we would start with a latent space dimension large enough so that we're happy with the reconstruction, whatever that means for you. For us, we care about certain features in the image that at least have to be reconstructed well, okay? So that will depend on the application. Then only do we add in the RNA seek to match the distribution in the latent space. And this will now actually be not so hard anymore because this is much less informed. So that's how we do it. So it depends, you know, this is not a real trade off in the sense that, you know, you just have to choose the latent space dimension large enough. And as we saw in the previous part of the talk, you should anyways choose it very large because this is not going to hurt you, okay? So this is really not a real trade off anymore in some sense. So choose it large, yeah. Okay, so just because I always get these questions, people are familiar with CycleGAN. This is very different to CycleGAN in the sense that, you know, CycleGAN requires you, so CycleGAN goes directly between different modalities. CycleGAN requires you to have one discriminator in each latent space. Whereas here you just have a discriminator in, sorry, one discriminator in each modality, right? Which is usually very high dimensional. Here you have only one discriminator in the latent space. Also for biological applications, you do want to have all your data in the latent space because now you can do all the downstream analysis in your latent space. For example, clustering, for example, and we'll see another application on the next slide. Also canonical correlation analysis. And this advantage there is, you know, you need to have the same input and output dimension, right? So in particular, there is no way you can do canonical correlation analysis combining images and RNA-seq. Okay, so these are, you know, the two main kinds of approaches, which just don't work here, okay? So that's this, how you can use alternate coders to actually very easily combine different modalities and translate between different kinds of modalities. And we've validated that here in this T cell example, you'll have to read it in order to figure out what is exactly the biology that we cared about finding out here. Okay, so then comes the next problem about moving between different time points for cell lineage tracing, for example. Okay, so if, and here I don't have, oh yeah, I do have the reference. So this is a really nice paper that came out of the broad for doing lineage tracing on using single cell RNA-seq data. Okay, and so their idea, and this is, you know, a very nice standard statistical framework for, you know, you have a distribution of cells, right? At time point one and at time point two. And then I would like to know, how does this distribution of cells fit onto this distribution of cells? So the standard statistical approach is optimal transport, right? That it actually does exactly that problem that tries to find this map, this transport map that minimizes the effort of moving this whole sand pile, which is a distribution to this whole sand pile, which is a distribution. And it's a really nice paper where did this in RNA-seq, using RNA-seq data. So now the question is, how do you do this on images? So we really care about images, as I told you, right? I care about, you know, applications to cancer. I want to be able to detect cancer as early as possible. In particular, you know, I won't like to be able to do it earlier than pathologists currently. So that means I need to be able to generate my own data of how this cancer cell would have looked like at earlier time points, right? Before a pathologist right now can tell me that it is actually on the path of getting cancers. Okay, so how do I do this with images, right? So how do I define a transport map on images? Because in an image, as I said, there is no coordinate system, right? I mean, you know, pixel one in an image of a cell doesn't correspond to anything in the second cell in pixel one. I mean, these cells, you know, have different shapes. They have like, you know, they're all kinds of orientations. You cannot really orient them so that they correspond to each other. So how do you do this? How can you come up with a coordinate system like RNA-seq, this worked wonderfully, right? Gene one corresponds to Gene one in another cell so I can actually define a transport map or I can define a, you know, a loss function, right? Which tells me how different these two genes are. But how do you do this right on images? And again, you know, auto encoders are one way of getting a joint coordinate system. So I take here, I have four populations of cells. In this case, it's cell lines, but we also did it on tissues and single cells there. This is just to show you how this works. So you have these different cell lines. You map it into the latent space. And now, as I said, you know, the latent space is amazing because now you can do everything based on that you know, right? That you can do with RNA-seq, you can now do it with images because now you have a coordinate system. They're all in a joint coordinate system and you can do whatever you were used to doing with RNA-seq now on images. For example, you can do optimal transport. Okay, so I can learn the transport map of how I get from the metastatic images back, how this particular image or this particular cell would have looked like was it still in the normal state? So now this is in this latent space, but of course an autoencoder allows you from any point in the latent space to generate its corresponding image. And that's exactly what we did here. So take a metastatic cell. This is real cells. All of the others are generated cells. They're not real. They take a cell like this, map it to the latent space, use my transport map to move all the way backwards to the normal state and all of these images are generated. Okay, now I can actually generate images how this cell would have looked like at all of your time points. Now, again, exactly this thing I cannot validate, but what we did validate is we looked at the activation of fibroblasts now over time in an experiment so that we were actually able to validate the kinds of features that we found predictive of going forwards in time or backwards in time in this particular system. So that's another, I think very nice way of, the autoencoder is super helpful for any of these things, right? Getting joint coordinate spaces or moving between data modalities as we saw on the previous slide. Okay, so all of this motivated our work on autoencoders and looking at the theory of like, how should we choose our architectures? What are the functions that are learned? Why do these autoencoders work so well for all these particular applications? And that's what motivated what I presented before. So now that we have all this insight on over parameterization, let's actually try to use it. And so here is how we're using it now in the SARS-CoV-2 application, okay? So as I said, so here, and this is the last application I just want to show. So here, this is the problem of drug repurposing, right? So in COVID-19, you know, we don't want to go through the whole FDA approval all the way from the beginning. You want to be using to do it fast, right? You want to be using drugs that are already approved that can maybe be repurposed for a different kind of disease. And so there is this huge, I'm sure many of you are familiar with this huge CMAP data which is available by the broad, right? Which is an amazing resource which has 1.2 million samples. These are 1000-dimensional. So these are expression, represent expression vectors, many, many different kinds of perturbations. We particularly cared about these FDA approved drugs and applied to many, many different cell types. So hundreds of different cell types. That's what you see here in different colors, okay? So this data is available. So now what we would like to be able to do is, you know, predict, so use any of these drugs and predict the effect of this drug on a different cell type. And now of course we can validate it, right? In this data set, just because we have all these different cell types and we have these drugs that have been applied to many different cell types. Because then if it works in these cell types and hopefully it will also work on, you know, a SARS-CoV-2 infected cell that we care about, right? Okay, so if you are a machine learning person, then probably you'll think about it. Hey, you know, as a style transfer problem, right? Maybe a drug can just be a style. So you probably know this, right? So here I have a neutral face, a smiley face. And this has worked very well in vision problems. I encode them into the latent space. So this is my neutral face, this is my smiley face. So this vector here corresponds to putting on a smile. Now here comes a new person, neutral face. I want to make the image of this person smile. So I take this vector, I put it here and I look at what this point is. I decode it to the image space and hey, this person comes out smiling. Okay, so this works really, really well. So this style transfer works really, really well for image applications and many other applications. So the question is, can I do this for drugs, right? I have a cell type. I know its effect. I know how it changes when I add the drug. I have a new cell type. I want to predict what happens when I add the same drug. So can I just take this vector, which corresponds to adding the drug, move it over here and see what comes up. Okay, and I just want to, there is work related to this in the linear setting and here when you actually know the underlying graph. So it's very nice work that says that in general you cannot do this. Okay, perturbations are not the same, it's just a style. There are if and only if conditions, if you know the underlying causal graph that tells you when actually, you can transport the causal effect. But still, you know, the question is here we don't have a causal graph. We can certainly not check any of these conditions. So does this work, right? We can actually check it if it works on this particular data set. And let me tell you that it doesn't work. Okay, so in general, so usually, I mean, in general, you use these underparameterized autoencoders, right? And so here what I'm showing you is I'm taking two different cell types and I'm taking here. So what I want is the effect of the drug in cell type one and cell type two to be aligned, right? Because only then can I transport over this effect, move it over here and actually get out what the real effect of the drug. If these two drug effects are not aligned, then this cannot work, right? So here on the previous slide, these two vectors are not aligned, this approach cannot work. So what I'm showing you is the angles between the effect of the drug. So each point here is a different drug applied in the, sorry, it's the angle between the same drug applied to two different cell types, okay? So these angles go between this one and one. And this is for these underparameterized autoencoders and you see whether you do it in the original space or in an autoencoder with an underparameterized latent space, these angles are always the same. They don't really change, so they don't make these things better aligned or worse aligned, it doesn't really matter whether you just do this in the original space or in the latent space of the autoencoder. And of course, you know, when you use these underparameterized autoencoders, your reconstruction is not super good as well, right? So what happens here when you use these overparameterized autoencoders? So our intuition, and we don't have a proof yet, came from these autoencoders being attractive to the training examples. So we were hoping that instead of just being attractive to zero-dimensional things like points, this would also hold for lines. So that, you know, the things will become more aligned if they're actually similar to each other. And it's pretty nice that that's exactly what happens. Okay, you see that there is, you know, these things are much more aligned, right? This either one or minus one, meaning that they're actually aligned as lines. And of course, you get perfect reconstruction because we are overparameterized. So, you know, the training examples will certainly be matched to each other. And as we saw, you know, you're still close to the identity map, so you actually will generalize well also on other unseen examples. And in fact, the alignment is similar to PCA, just when you do PCA, of course, you get rid of all the information. And you'll see the same thing here. I'm showing it to you just for two different cell types. That's the same thing. Also, if you look at all cell types that are available in CNAB. Okay, so this intuition can really help you to, you know, try to actually get a latent space where you can transport the effect of one drug from one cell type to another cell type, just by it being more aligned or more contractive. And with that, let me end. I mean, we do have a list of drugs, which, and now I should also say how we use the graphical models. So this is a nice way of validating, right? That once you get out a list of drugs, that these drugs, if you look at the regulatory network, you see where, you know, which genes these drugs are targeting, for example. Then you can see that, you know, a drug is of course only effective if it's upstream from the differentially expressed genes in a disease, right? So it has to be upstream from it. So if you learn, so what we did is, for all these drugs that we have here in the end, we did learn these gene regulatory networks and, you know, ordered them by, you know, how many of the differentially expressed genes in the disease are actually downstream from the targets of the drug. And that's how we got to this particular rib K1 protein, or, you know, which is a target of particular different kinds of drugs that we have listed here. But I think this is a really interesting, it's kind of nice that this is the one that came out, that is the most upstream to the differentially expressed genes in this disease. It's changing its role with aging, which is of course something that we know that this disease is highly age dependent. And this is something we didn't put into our analysis as a constraint, it actually directly grinds the SARS-CoV-2 proteins, okay? So these are the drugs that now we're actually testing also experimentally. So I think it's nice that you can use this over-parameterization to actually come up with better ways of, you know, trying to predict the effect of the drug on different subtypes. Okay, and with that, I'll end. So I hope I was able to show you that we developed theoretical and algorithmic frameworks for integrating and translating between observational, interventional data, using causality and autoencoders. And I'm really excited about autoencoders. I think they're extremely useful for data integration and translation. But also, of course, for studying, you know, properties of neural networks, I hope, as you saw. They're just easier because they map from RD to RD, right? So that makes it much, much easier. The self-regurization, I'm putting here again. I hope I was able to explain what I mean by that. This also provides a new and biologically plausible mechanism for associative memory, which is just given by iteration. Of course, I didn't talk about this, but, you know, that you can recover training examples just by iterating from noise is a huge privacy concern. It does mean that, you know, you maybe don't want to share trained neural networks between different hospitals, for example, because I would be able to figure out what the training examples were. So these are certainly things that require a bit more analysis. And there are many, many open problems. I refer to this one here that we would really like to learn the metric that is learned by an autoencoder in the latent space. You know, what are these basins of attraction? Want to get similar results for classification, definition of generalization I already mentioned, et cetera. Okay, so I'd like to thank, you know, this work wouldn't have been possible without a really, really amazing group of people. In particular, the work I talked about was Audit's work on the autoencoders on the theoretical properties. And then on all the applications of these autoencoders that was Karen's work, you know, over time and between different data modalities. The SARS-CoV-2 was actually a big project that we all put together in terms of like combining all of the different things that we're doing in our lab. And on the causality side, again, we have many different people working on it. With Raj, Karen's work, Chandler's work, a lot of the, if you're using the Python package, Chandler put together a really, really nice and I think very intuitive package here. So this and of course, then a lot of funding as well. Okay, and so thank you very much for your attention and I'll take any questions that you might still have. Thank you very much, Caroline. Now it's... Thank you, Karsten. This was a great overview over all the different things that you can do with both graphical models and autoencoders in computational biology. So we have now time for a few questions. Is there one from inside the network? I don't see a raised hand, but if you are thinking of a question, then do so now and then raise your hand. And in the meanwhile, we go to the Slido questions. Let me just order them chronologically within the talks. There's one about the first part, which says, exciting topics, could you briefly comment on how you achieve or improve scalability for the DAX? This is question one, Caroline. Yeah, how you improve scalability? I mean, you know, it is empty hearts in that case. You know, it all has different kinds of assumptions. Then it depends on how you are encoding these kinds of things. So I don't think there are any good ways or not good ways of improving scalability other than, you know, yes, we work over a different space, so it's a smaller search space. Maybe that's one thing that I can answer. So permutations is a smaller search space than graphs. So that's one way of improving scalability. But I think also other algorithms now scale, right? GSP, I mean, scales very, very well, the search over graphs. I mean, nowadays people have put in a lot of effort. So they will also scale. I'm not saying that these are the algorithm, we have scales better than others. Others have put in much more work in terms of scaling them up to large graphs. So in particular, the greedy, the GS, greedy equivalent search over graph scales to huge, huge networks. So that's still the fastest one. And then is this due to improvements of the implementation or due to like simplifications of the optimization problem that are being made or like, what's the major strategy there? So greedy equivalent search has been around for a long time. So that's really based on the implementation. Also PC algorithm has been around for a long time. That's all due to the implementation. So they put in a lot of work. Ours we haven't put in any work yet in terms of implementation. That's why we're so happy that already now, just because it's a smaller search space probably, right? It already, I mean, it's not our work to compare all of the speeds. That's why I'm so happy. It's someone else who got our algorithms actually to work just out of the Python package. It scales exactly the same way if you go to like a hundred thousand nodes. So yeah, but we haven't put in any effort yet and actually really making it scale well. Good, thank you. Then there is a question about the second part. I think, are there with depth intuitions, no, are these with and depth intuitions applicable to other types of architectures than CNNs? Thank you. Yeah, okay, very good. So this is not just CNNs. This is certainly fully connected networks and in particular the kinds of things that I'm showing you there on those plots. So that was with fully connected networks with a convolutional neural networks, you only get to see this after a certain depth, right? So I mean, okay, I didn't go into that, but convolutional networks, because if you look at the matrices, right? You'll have some zeros in there. So to memorize a vector, you already need to at least multiply, have as many layers so that you don't have any zeros anymore when you're multiplying up your matrices because otherwise you cannot memorize a vector. So all of these plots will happen as well for convolutional ones after a certain depth. So we have it for fully connected for convolutional. So those are the two that we analyzed in this setting. Thank you. Now there's a question from inside the network. Giovanni Bisola, please go ahead, Giovanni. Hi, hello, thank you for the talk. It was really amazing. So the question that I have, I might have missed the point, but it was more of a general discussion. So often, autoencoders are not really treated as generative models. So they do so for variants like variational autoencoders, but you seem to be analyzing their properties in that regards anyway. Was that a consideration or an obstacle that you consider during your analysis? Yeah, so okay. So many of these things just work for variation. I just didn't differentiate between them. So yes, certainly this is variational autoencoders when I'm showing you things. Whenever there are images, it's variational autoencoders so that you can actually move around easier. Yes. And can I ask a follow-up question quickly on that? Often the reconstruction quality of variational autoencoders is a bit blurry in the case of images. Have you found that to be true an issue or was that still a sufficient quality to allow you to get good results in your analysis? Yeah, so for us, it's good enough results, yes. And also here, I mean, you can play around with it. Nowadays, you can get autoencoders to actually give out quite good images where people will be very, very happy with them. So this is maybe changing a bit, right? Going from GANs to actually seeing that autoencoders can do actually really, really well. So for us here, this was good enough in the sense that the kinds of features that we care about about the heterochromatine and how it packs, et cetera, you can actually get that to be very, very good. Again, if you overparameterize it more, then of course, your images will also become less blurry. And that's another insight that you can really play around with that. And I think that's super important to do. Yeah, so that also helps a lot there. In particular, you see here that when we do it, right, you know, if you have an overparameterized autoencoder and obviously your reconstruction loss will actually be zero, right? So your images are not going to be blurry. Yeah. Thank you. Another question by Lukas Mirander. Lukas, please. Thank you very much, Kostin. And thank you, Caroline, for the talk. It was very inspiring. Following up on Giovanni and the variational autoencoders, I wanted to bring back a problem that I've been facing while using them. If I overparameterize the latent space, many of the dimensions collapse and they, after training, remain just the same than the prior. And I wanted to ask you if you face that problem and if you found any way around it. Yeah, so these things happen and this is a lot of parameter tuning, right? Which everyone will have to do in terms of these mode collapses. Yeah, so I think it's still a lot of engineering, right? So this is easier if you get rid of all of this variational part. So you could also just try to see what happens. I mean, it depends on your application. I just don't know what you want to do downstream with it. So depending on that, right? So here, for example, we actually don't use anything like that. We also don't use, we actually just use a fully connected network. So it really depends on what you want to do downstream, but you can, of course, just try to completely change your architecture. Use something much simpler where maybe you can actually analyze these things better. But otherwise, yes, it is still a whole lot of parameter tuning. I'm not going to hide that. Yeah, so yeah, but just try, I mean, we do now a lot of things on fully connected networks. Because they actually work quite well for the downstream analysis as well as long as it's very over parameterized. Thank you very much. In fact, I would have a question, Caroline to conclude here the question and answer session. You showed these impressive applications of auto encoders. Have you also encountered applications where they did not work, maybe where the sample size is still too small? I mean, this is quite common. In fact, in, once you move to clinical applications, the sample size is not the same as in molecular data sets. So have you also like examples of where this does not work yet or not work as you would expect when looking at these other successes? So for us, it has really worked and actually the sample sizes are not big. But since I'm showing you all of the advantages of over parameterization, this is something that from our work, we already knew that, sometimes you actually want to throw away data. I mean, I know it sounds really crazy, but in order, if you're not able to get your network based on computational costs to be as over parameterized as you would want it, it actually makes sense to throw away data and there has just recently been a paper which actually shows exactly that, right? So it's certainly not, I mean, now seeing this, right? It's certainly not a disadvantage. Of course, you need some data to be able to at least get a bit of an intuition of what the manifold is that you're trying to learn, but it's not the case that you need huge, huge, huge amounts of data. Here we have like maybe 1,000 different kinds of samples, but not the standard, whatever, many thousands samples. So I think there's one needs to like kind of rethink the things of what this actually shows you, right? In terms of over parameterization, it does mean that at some point you may actually need to throw away data to get better results. Yeah, so it does sound counter-intuitive, but this is the case now that we understand this better, right, there's all this double descent curves, Michel Belkin's work, et cetera. So I think there is a rethinking happening. Good, thank you very much. And thank you for this great talk. Thank you very much. And for joining our network and our summer school here.