 It's my great pleasure to introduce Fabian Thais. As our first speaker of the day, Fabian is a star in machine learning in biology. He holds two PhD degrees in computer science and physics. He's head of the Institute for Computation Biology at the Helmholtz Center in Munich affiliated with the Sanger Institute. He's heading the I-Initiative within the Helmholtz Society. He won numerous awards like ERC grants and Schrödinger Prize. So he's really a star in the field. Lately, he has been focusing on deep learning in the single-cell field. And I'm very excited to welcome Fabian here and to learn more about his current research. I just mentioned to him almost a decade ago, we were both invited to a similar symposium where we're talking about the future of machine learning in systems biology and in medicine. And I roughly remember what Fabian said back then. And I'm very curious to get another update. I've listened to talks of his in-between, but to get a decade later a new update on where the field stands. So with this, welcome Fabian. And it's an honor to have you here. Thank you. Thanks for this kind introduction and invitation and for this great possibility to speak at such a very nice winter symposium. It looks like a really fun day we have ahead of us, but then of course also a very impressive ITN. I'm happy that we've been sort of in parallel, sort of getting sucked in. And of course I was contributing to this Alice network that now ties us more and more together. And I think I'm actually looking forward to many more of these type of meetings and potentially bring us joint students and so on together. Very excited about that. So let me start. I want to talk about essentially a bunch of vignettes that we've been doing in the past few years for latent space learning and single-cell genomics and sort of I think tying into some of the topics that you might hear later today as well. I just mentioned graph convolution is one thing to cast in the beginning, even though we sort of just starting with that I think there's all kinds of interesting representation learning problems involved in the field. Let me just start slowly and just sort of outline a little bit what the problem is. I think we've all heard by now that single cell genomics is sort of a nice thing to do instead of sort of taking the smoothie or far genomics divide out and then try to see the single cell resolution. What is happening and one of the key advantages of advances in the field has been to do a tagging of cells on micro fluidic in this case now droplet level so that you can actually multiplex those experiment. A very strong fashion has been denoted method of the year already early on but so really commercialized they would say in the past five years. And because of that, the number of data sets and that it has been so really exploding it. You know, for us are interested more from the computational machine learning side, of course, this is turning into one of those disciplines where you really talking about big data sets you know there are sparse and noise and all kinds of issues. This is an area where we were talking I guess those 12 years ago about systems body modeling at least genome wide everything was essentially low rank and you had to do all those linear regularizations and so on this case you have to covariance matrices so you know that's really potential for now getting it, maybe more into regulatory mechanisms and so on but then of course also learning about state of a cell or system other perturbations. So it's a very exciting area, and just sort of going back to one of the older perspectives from from Rick Spenper sample but could you do with that. I wanted to put this a little bit into the precision medicine context I try to add a little bit of medical example to each of those more biotechnology because I would say, the net that I'm showing. Of course you could compare to a pathological situation and say you have some type of cancer is set of cells in a healthy tissue you might have picked them up if you do a bio experiment but if you do single cell analysis and then maybe cluster profiles, that means you do a low dimensional visualization maybe also analysis and you pick up those diseases associated cells because the expression profiles fall together versus the others right. And while all of this sounds straightforward, it turned out that the field actually spent a lot of time was on the computation side to do each of these steps properly all this sort of pre processing and and and maybe just normalization to make it into a new because the noise books are different. People been talking a lot about efficient clustering for these things and then this dimension reduction and sort of differential comparison of essentially densities potentially unnormalized situations actually were a bunch of challenges. I would say about one of the key steps that many of of these analysis tools that have been popping up since me is this type of this in this case, they recall itself type maps and I'm sure you've all seen those you maps or sees these are low dimensional representation of these high dimension expression profiles, but essentially it's all essentially latent spaces that that we try to learn so essentially a representation learning problem. And we can use it for visualization if you go to 2D, but more or less, all of the algorithms that I'm aware of factor along our lower dimension representation that say 10 or 100 dimensional some type of sub manifold where the cell stupidly reside that you would then build your actual downstream algorithm upon. So this is a very moving sparsity and maybe also sort of make the data more handle. So this is exciting. Our lab actually has contributed also to the technical side to sort of really make those things available you can develop a tool called scan pie single cell analysis in Python was actually scaling very well to high dimensions sort of give a framework, also data structure to deal with this type of data set and has been rather popular. So this is a very genomic extension together with a kilometer she lab coming out now. And I think people been really using this sort of as a backbone for for for doing analysis in particular in the Python word. So, let me just give short overview of my talk what I want to do is show some type of latent space modeling approaches we've been doing and putting an awesome context with other work in that field and I want to start sort of sequentially. This pseudo temporal ordering was one of those big new type of challenges that people have became aware of, and essentially it's a one dimension latent space model. So that I will be talking about the single cell honestly in general the presentation learning first in one core now with the data set sort of becoming really this also as a multi study integration problem. And then in the later parts of briefly speak about how things could be done also in the multimodal setting and potentially also in the spatial setting and this is sort of where the graph conclusions, particularly could play into into here in parts also for for sort of regulatory that hasn't been giving the strongest results. Let me just show one of those machine learning challenges I mean this is like the latent space learning example I guess in the field that people know so you have gene expression profiles you have this from a bunch of cells and you go to high dimension transfer 20,000 dimensions, and those dots them sort of fill up the thing. And just to show you an example here, a collaboration with the Guttkins lab this was this was early on I think this QPCR data and the team by cell matrix that we just by class that we see some structures popping up in this thing so if you just use the normal just you find some type of structure in this, this, this type of data sets, but it's maybe not so super clear. And what I think has turned out, for many biological processes also just for development but then also for diseases that those cells don't form those big blobs or groups but they maybe have some fine structure maybe they really live in a sort of 20,000 dimensional space of course, you're noisy, but the patient that we can then visualize that's the T sees a new map that you always see, but then we can maybe also analyze. And this type of analysis could for example be that we know this would be here some type of progenitor cell state and that sort of turn 2000 downstream cell type a or B, and we can then let many random walks run across the processes and we can try to understand what's what's happening. So this is the idea of pseudo temporal ordering has been essentially just trying to search for major directions by similarity of cells and first tools have been for example monocle from cold or this beautifully named bundle those from Donna will be speaking later today, we think diffusion pseudo time which is sort of a Markov chain type of approach for quantifying these type of things in any case you come up with some type of order of cells by proximity. It's something like that you do distance learning once you fix an entry point and they're now approaches to sort of see where things start and and, for example, do something along the direction of iron velocity which won't speak to you. But I think there's there's still interesting questions in that any case, if you do that if you apply this to this type of data set you see here to the right I hope you can see the mouse. Then what changes is that you know start things if you if you if you start from here, suddenly, you see actually a sequence of genes being turned on and turned off. So this is a changing point here and you can sort of recapitulate a bunch of process in this case in development that you would be expecting. So this has been really taking the field by storm a lot of people apply this now whenever they have some more continuous trajectory so if you have a discrete setting where you have sort of termed differentiated cells very often clustering does the job but if you have those more continuous processes. So one dimensional latent spaces, I guess, is a popular approach, even sort of if you want the discrete one that the zero dimensional one you can also call a space learning. In any case, a lot of people have been thinking about that I think that's my more than 70 or 80 such super temporal ordering algorithms around this big reviews being written, how they compare some of our tools and doing a good job. I think it's an interesting area. But. So yeah so my summary one dimension latent space could be curved potentially branching and now you can sort of add stuff on top that you can think how to differentiate compare what's going different in the disease. Let me take a step back and ask a bit more general question how could be more robustly more generally describe latent spaces and for that. And so I think that a lot of as well as a user have been sort of introducing this idea of leveraging auto encoders to describe and potentially more robustly model high dimensional latent spaces and then sort of factor out this problem from the downstream analysis. So this approach is now now around I think with this explosion of data set and also this multi domain adaptation becoming a big problem as I'll be saying something about the data. I think this is rather natural to different to that simple idea of auto encoder just to review there right we take our gene by cell matrix so this case we have we do this along the gene axis. We want to squeeze it down go into some type of bottleneck. In your case that would be just learning a PCA if you have mean square error here right blow it up again and minimize the distance open for simple. At the time, we've been actually thinking about it just as a denoising method just making things more robust and actually turn out to be useful. This case this was worked by by Lucas. The move on professor Texas and Getsen who was at the brawl and our transition to gene attack, essentially, writing up an auto encoder but with an adaptive noise function to this count situation that we have so we have an additional parameter to model a negative binomial as well as actually a serious relation for some situation necessary. And then could reconstruct latent spaces and he could show and this is just toy data, if you have some type of ground truth and add drop out them denoising with that adapted noise function sort of gives you this much better job that if just needs quite error. That was nice and you know we had some nice and yet also that this can be used for denoising in a lot of real data that's actually rather fast also to apply. But what I was most fascinated by is that in this case but also for for pbmc and so on data sets when we really drove this bottleneck not to the typical 10 or 100 dimensions that you usually work with to to just for visualization. We get actually very interesting data space. Oh yeah I should say this thing, obviously scales much nicer than graph based. Classically people have been doing cell by cell graphs and use those as description of latent spaces. This is of course because of the efficient gradient descent that you can do in your networks this was much, much, much easier to escape. But in any case, the key point was that to hear this bottleneck. That was actually interesting. So what you see here is this is a pbmc data set which color that we're presenting a cell type. In this case it's not a UMAP or it's really a two dimensional latent space, and it looks like one of those teachings. So what we've been quite hopeful is that here in this type of graph structure. Somehow the network needed to adapt to learn the latent space in such a fashion that it would actually do something like this. So we're thinking that this latent space can actually be interpreted. So we've been trying to think about how what to do with these latent spaces and one of those ideas that that did it two years ago this was worked by by more and Alex Alex who's not with Celery. We've been thinking about how to do differential effects in latent space and how to see if we have normal situation and potentially disease situation. I'm going to say this in today a little bit right for us, maybe not disease often but maybe also drug perturbation. Can you understand what the shift in gene expression space looks like. So unperturbed perturbed cells can be some more model. Of course we know that those perturbations are self have specific. So there's not going to be a linear effect in gene expression space. It's going to be different in that illustration. Then of course if we have such a model, you could try to predict then how it would behave for new situation I think that's that could have a whole bunch of applications in drug modeling drug screens and so on but of course also in understanding what those perturbations are so what we did, we came up with just our auto encoder setting this case actually rational auto encoder, sort of following up. So what Nier has been been doing in his SCDI first model, learn the latent space. And then we're hoping, similar to what what computer vision does and actually we showed this experiment in this situation that it works that the latent space turns out to turn sort of to map most of this into one direction. So we have two additional constraints and that's actually sort of still open reaction following this up quite a bit now to enforce potential to this linearity but this case we observed in strong perturbations is actually ends up in a perturbation effect that we can then apply in latent space of decode again and predict this type of situation. So what our SGEN was it's nothing more than a variation auto encoder together with what's then you typically call latent space vector arithmetic so with this we couldn't do actually out of sample prediction. So in this particular example, we've been sort of looking at and just show you two dimensional plot of that PMC is unsimulated situation, as well as stimulated with different beta. And we just left out one of the cell types I think this case was CD for positive T cells, and then we're able to actually robustly predict those profiles. This has been published so just encourage you to read up to see how this works, but we've been following this up with applications across cross study integration just for batch effect removal essentially so it's in this delta that we're implementing and shift over latent space or the domain adaptation across labs, but also for cross species prediction, you know, these actresses are being made across all kinds of organs in many different species how could you potentially see similarities or differences, what type of shifts are happening they're also between in between me but that's actually interesting questions to be asked. So this is a rigid model, it can also just robustly model single perturbation so we've been since then thinking how to extend those perturbation models. And one thing we've been doing. I won't say much about it but I just wanted to show this because I think it's potentially interesting ideas, they can set up a conditional rational auto encoder. And now we add the condition of perturbations and it could be multiple. Just a normal CV a style. Then we do a variation of versions so we model this case just mean and variance sample from that and this late variable we know feeding together with the conditional decoder that will be just a normal CV right and this is essentially what, and this is the paper has been done on building upon the values I see I see people on that much this is a likelihood, just reconstruction and in this case just emcee you can also do negative binomial together with cool back like a term fitting the densities. What we've been following this up is to actually add an MMD loss in this layer to actually ensure that whenever we change conditions, we remove that effect in the second step so we call this a transformer variation auto encoder it's just trying to make this late potential perturbation in this particular setting that actually turned out that we were able to interpret multiple of those conditions. Key extension we're currently doing is really to extend this to a full latent space factor model, but this is only ongoing, but even with that it was already working out rather well so if you just apply this, this case now we added a convolution version because for images actually playing around with this also for spatial transformation. Yeah, not so many data sets yet around focus up in this case, you look at a data set called more for amnesty we actually have sort of thickening and thinning operations, and we trained on I think numbers 137 and did our sample prediction on other numbers and you could see that we sort of interpolate this type of perturbations robustly. And similarly we could also predict perturbation effect in our single set on a six settings. So that's the idea for for modeling perturbations and I think it's still a whole bunch of public questions to be done. So this is for, for one data set plus perturbation how can we can we have more robustly sort of larger scale do a latent space modeling. And for this, I just want to highlight this idea again that I'm sure many of you have seen, but has been sort of really put for by the human cell atlas and followed by life and whatnot and so on. So this is the idea of sort of building a periodic system of elements but now sell right with these atlas is essentially integrating a whole bunch of data sets for example from long we've been working with that a lot and there's now more than 20 labs generating the flavor of multiple patients also healthy controls lung samples. How do these things map to top of each other we've gone to somehow see what what what's common across it's a very interesting latent space learning problem. And yes actually very useful, just like to stay the flavor of saying that there's also some medical problems for that. We just brought out paper, I think just last week about leveraging this long atlas that we had across 18 or so labs, each of those lines yet being actually healthy subject different agent and population structure, and then us looking at where those entry genes for SARS-CoV-2 actually were expressed and we saw them some associations so you can essentially follow those association studies that you're used to from GWAS on the scene as a level I think interesting. But in any case, the challenges for them using those reference atlas in a larger scale fashion in the future will be that not as it's currently being done that typically most of the labs start with the raw analysis of the data sets again sort of doing the doing an education of the clusters and then sort of seeing what transitions are. I think you want to just map this on top of known atlas you don't want to sort of remap the earth every time again you just want to walk down this to buy dinner at similar in genomics that we don't want to completely reassemble human genome again we just want to map on top. So the challenge is how do we use those reference atlas is particular and that's something we observe that this the study here. People can't share the data sets as if it's patient data. How can we do it in a distributed fashion, and how can you make make those lab maps and accessible. And the current state of the art really is we share all those data sets if it's possible. I think that the future will be we need to share those maps. So that's a challenge that we've been thinking about. We will have a bunch of references that we want to queue in those references. We want to sort of map those onto the curious. This paper from from Satya lab will has been essentially anchoring those curious honor reference and try to make reference assembly. So we've been thinking about leveraging our auto encoder type of learning models to really then integrate data essentially doing transfer that's the whole idea. So what what what more did was we we start with a bunch of public references that could be this long atlas that you have, and we just train our reference model. What's a typical type of model that we've looked at is something like a conditional variational auto encoder so you have the conditions here labeling the different data sets as you saw before. And then here we have our latest space that all the analysis that we built up on. And actually, and we call this a single cell architecture surgery this ACI package we actually have implementation from many different type of or from a different type of space learning such as the TR VAE that you saw before SCBI and so on. But in any case, once you have trained this type of reference, we don't share the original data set just upload those maps here to a model repository. And as a user you won't be actually working on those studies you just download this type of map, and then transfer sort of changes type of model to be adapted to your situation you could transfer across studies potentially even across organ and species or across the sea state which is I think one of the key application. So what you would be doing you have no carry data set one or potentially multiple. And then you add your query levels to that note, because we encode condition as additional nodes we need to change the structure of our network essentially just add input nodes. That's why we call it architectural surgery, you need to sort of add additional notes you call those adapters, you can think about adding those sort of different network depth turns out from from experiments that often enough just to modify and retrain those and keep the rest fixed. So you really have a very simple, much simplified training scheme sort of just freezing the rest and if you do that you can, and particularly if you just have this this first day encode you can actually add additional adapters you have a nice commutative situation you can sort of actually make this independent of the set of training you can share those adapters as well. And just show that this works in this case, we have a pancreas data set across five different conditions. And actually across five different experimental techniques and also different labs and you see that if you don't do any reference training, essentially, all those cell type clusters that you would think would be in the external type of classes but cluster by study and sort of like. You do integrated pre training so this reference learning, and you see now the studies are nicely marginalized and sort of cluster by cell type as you would expect and you know punk grass is one of those key examples would expect this one. And now, we do one round of our adapter adding our single cell architectural surgery, and we see now that we added this new study we added a study, technically called sales to the nice integrates with the rest except for this one situation here. And that's on purpose because we actually left out in this older reference we left out alpha cells just to show that data integration can actually also map new cells to a new location. We did another round adding smart sync to data and again this was nice integrated with the rest and sort of also map the office to the right position that gave us encouragement at this type of transfer learning approach works in this situation. Why did we choose those specific weights, why did we train a whole network where we actually compared fine tuning on different layers just just fine tuning the curious study labels maybe the whole input layer or always which is obviously much more in terms of complexity so if you compare the we basically would need to adapt. It's a whole bunch of auto of magnitude slow and as the arches but turns out that this regularization essentially really was helping and if we compare now, both how homogenous batches are as well as how nicely south of the reconstructed, you see some type of changes across those techniques but on average. We see that the performance of our sort of intermediate method was actually giving us overall robust overall performance versus those other two techniques and actually compared to other full late in space integration technique is transfer learning seems to be doing a rather robust job, maybe because it's also strongly. Let me see let me show you what you can do in the DC situation I said you know we want to see how this can be done on a medical situation this case, we've been trying to build a sort of curing and healthy lung atlas for disease in this case COVID-19 patients, just to publish COVID-19 lung, lavage expression profiles and build a reference atlas from bone marrow lung and PBMC because we thought that could be additional profile present in the data. We integrated this atlas and then mapped our curing data is this lavage fluid data on top of that so these are these these type of stuff just to visualize what's going on. In this case, this were where our our disease samples were integrated and we knew that this particular disease setting we have patients with severe as well as a moderate disease trajectory, we saw that they were actually being split up the severe is mapping here the moderate here. What's going on what we cannot look at the atlas because we have annotated we know what it's where. So we look into our map of the atlas and we check out that this particular area here, this seems to be macrophages. You can zoom in what's happening in those macrophages we can just map the individual genes see where they're expressed. It turns out that this area here where this where the moderate ones are expressed actually tissue resident. These are macrophages, but not the common ones and not expressed ever just sort of in this one area and that tissue resident once but expressing already the Cxl 10 which sort of this one one of those markets that are a little bit shadows with the with the most severe situation with the serious situation so it's more aside of social macrophages. This was something they learn in the paper but in this case with the transfer and you can so very quickly. This is a situation where you see if you both sales map this year's the eight positive T cells. If you zoom in where they're expressed and so we met those you see that all the about data is actually aggregating together with the long data, except for this is the activation marker being expressed as you would expect. So you can very quickly build the overall picture of what happens in the DC situation and you don't need to we learn clusters and anything from that data you can map on top of that I think a rather nice application for this type of situation. I said that, you know things are starting to become multimodal. I'm not really speaking about that in a minute but what you can do with this as the arches now, you can also try to map additional modalities in the data so what people can do at the moment not only measure RNA but actually also protein the same cell. This case not full proteomics but often antibody mapped antibody labeled. So we have a number of proteins that say up to 100 or something like that. And what we can do if we have such multi view type of data, we can actually learn both a reference across RNA and this take time this this technique is called sites that measures a protein, we use an embedding technique to be I from years lab that we sort of as the arched and thereby just allow for transfer learning so we learn a joint reference across those two situations and we add a query data that is only RNA can map it on top of our reference data but then by. See that the integrate but also impute in the query protein. So that could be potentially useful in the end if we really want to build an app plus we can't expect that every lab is doing all the different types of techniques that are available but if you have an address we have sort of annotated that you know this set of gene expressions out that maybe some information also about additional protein and maybe attack markers that could you could lead to additional insights of course you have to also have a robot on certainly more than because of all these things. All right, so that's that's it for for learning across studies and for transfer learning. One problem that we have in the community is that of course not everyone has sort of a lot of people are. of mostly driving a biological lab and have maybe a bunch of mathematicians helping them with analysis so how can we make those neural networks available to those more applied labs, easily but also how. How can we maybe make available data sets that are being generated into sort of a machine or the community maybe not so much wants to drive the community. So we kind of thinking there's a bunch of sort of reference data sets and and learned so we've been thinking about is if you have our gene expression space maybe a big sort of manifold some type of fear maybe. Then one area in that fear could be one type of data set maybe one organ there's another area that's another organ and they might be overlapping and it might be sort of things that being that between those essentially many for learning idea right. Can we somehow leverage this idea of those embeddings that would be on your networks, but share those efficient. The typical process the moment being as I said before we come into the country pre process we do PCA we do clustering visualization annotation, and we do that that takes time right. Give some analysts buys and it's not efficient will be trying to do this and let me recall this this this model zoo essentially we're setting up sphira which is great for sphere. It's a potentially interesting motivation, we replace all of this by these pre learned embeddings. So if you have a robust embedding all of these steps would essentially be just map on top of your Atlas where you particular localized. You can explore things. So that's the idea. This is this is work by David and Leander. And this has maybe be put it out on by a kind of end of last year, where you've been trying to now build up the single cell model zoo. We're in the following way we have a bunch of data loads we don't actually share we don't actually locally store data because that's been all kinds of reporters around we just have a bunch of data loaders that pull those in and put in a centralized data structure, which you can also sort of augment with your own data that you might want to share. We have an API for them actually modeling both the embedding as well as a cell type prediction so unsupported supervised tasks and you could formulate your own additional tasks if you're interested in that. And then we have an API for sharing those model parameters in a cloud storage that you can then use to download. That might be the thing that you'd be actually looking for and this is sort of packaged easy enough that you don't actually need to do much yourself. Then you can just use this from from from status and let's just show you how big it is. So if you're interested in leveraging some of these you just directly go to that. And you have access to more than 2040 data sets across 55 organs more than 3 million cells. And this will be for human the same thing for mouse. And you can just very efficiently mini batch and stream potentially all human cells, you know there's some technicalities that you need to be addressing both respect to shared cell type annotation but then also with respect to how to make these things. So I think could be a nice data structure to sort of get you into that. You can then compare number of cells versus cell types across different organs you can do some kind of mid analysis on that. But you can also very easily access these things and then walk with that. Let's see ideas of fire just to show you that is actually useful for us when we're now for example thinking about how to put additional interesting prayers on to our latest basis to make them a robot potentially understand we can actually leverage these things. And one thing you know I said before us how can we make those embeddings here faster. This case for example we take a manual embedding for pb mc's, and then we just take the pre trained fire embedding you see that you get to the same cell types map together, and you get the same annotation that you would do by hand you get it for free from spiral so it's useful. But you can also apply this for metric developers point of view. In our case, you know we've been thinking, how can we augment priors in those things that via ease, for example here in spleen via actually didn't converge well, then we've been thinking about inverse autoregressive flows also the ratio of serious. And this case, adding this type of prior actually gave us an interesting additional constraint that actually was sort of putting cell types closer together. We've been coming up with some type of measures but because we have this fire where we could quickly do this a whole across a whole bunch of different tissues, then I compare this this was I think I see my workshop last year where we've been actually seeing then that with this thing we could quite quickly evaluate where these things are working out well, so if you want to have now your own type of model evaluated think that could turn into very useful. So much for for this transcript base in this remaining 12 minutes or so that I have I want to speak briefly about what we do for other modalities and then about the spatial set. This is from nature methods, one or two years ago where they've been calling a multi model or makes on the single cell level now method of the so taken to the one step further, you can actually interrogate not only now the transcript expression, but also all kinds of epigenomic modification, as well as surface proteins and perturbations on the single cell level potentially even in combination. This multi motive omics has been turning into a big thing. And all kinds of cool techniques around that you might not be able to set up in your lab should you have one quickly, but parts of them actually already commercially available dimension decides it before the couples RNA plus protein, type of protein, but also things such as a crop seed actually being one of the most popular ones and I know where it's here. First of box techniques super popular and actually efficient to really see how things happen under perturbation I think very interesting curious that you can take on. Cool data sets coming around. How could you think about modeling them. You've been to writing a few perspectives about that about two years ago. No, essentially it's something like, we just want to have some type of integration of those cell cell distances for classroom so it's just like a multi view type of learning thing. I think something that the whole recently realized with his. What was he called it. I can't remember some sort of locally adapted and nearest neighbor waiting. You can look at interaction networks across those modalities you know if you want to understand how deep regulation works, kind of a good idea to know where chromatin is open or not and if you have to snap the single cell you can associate. You can do type of prediction models. So in this thing for example we've been thinking about. If you have some type of cell in a phenotype or some type of sample phenotype with a pooling layer you can actually do type of prediction models across views. And then the last one we've been also thinking about factorization models I think something that has been very popular for example. And then we have the only Stigler's more far together with John only one of those multimodal analysis approaches for that. Let me show you one example. This case we've been looking at T cell specificity that as an essay from from 10x quote I think you know as there's something like that, where they don't only measure the RNA counts of a cell, just in this repertoire but we also have this label surface proteins as was decided. And for this essentially just sort of have an antibody that attaches to that label within my and this reminds me sequence. So you have those protein count in addition, but then you also have a T cell receptor. And that sequence you can actually also read out a particular tag and you have a tag for antigen that binds that particular T cell receptor. And this is this MHC that also tackles in your mind. So you have the, you have a sequence of the chains of the T cell receptor, as well as the antigens that bind to that. And that's an interesting question so you know you can ask. And that has sort of all kinds of implications in the analogy, if you understand which antigens bind one which particular TCR sequence, under which particular cell the context, you'd be able to solve what kind of vaccinations on problems we try to predict when antigens we call this TCR match, something that David, David just published a year ago, the idea being just to want to predict TCR specificity. So there's different types of models that you can cook up, essentially, you sort of have a sequence encoding model, just a sequence of many layers stack you put this together and you do your inventors and independent of the sequence and then you have a sequence of other covariates that could be for example, these RNA states, we're going through also the latent space modeling as I showed you before and then you have the output you have just this particular to just a vinerized binding cause and you can also try to predict binding on antigens I won't speak about that because I actually turned out that this base of antigens, even though we had like up to half a million or so cells was too small to actually do that properly. If we take this type of model and we see how it performs on the different type of addition of covariates, we first see what we do actually get some type of things being predicted. It actually turns out that if we add particular constraints such as donors and counts as well as also surface protein we actually get closer and closer to these two situations where you could at least talk about us being able to predict the antigen binding for this particular basically given set of sequences. So I would say the sequence of coding layer types actually outperform layers and you can actually learn something from TCRs. We can think what to do with those. And one of the things that I'm quite hopeful actually in also to see situation is this the following idea so I would say every benchmark. Yeah, I just skip that it's in the paper. We've been coming up with this thing called reverse phenotyping. And that's I think something that might be a fun area for key symbology continue leveraging this TCR sequence the idea is that if we have a healthy situation as well as a perturbed stimulated situation, maybe you know just immune system being turned on in COVID that was one of the key examples we can do because that's the type of samples you currently have. But in case you have this this whole set of T cells that that could be effectors memory and so on. And then you have a bunch of T cells that actually react to the market. You want to pick those out right you know this because the market we want to pick those out where those actually come from. But because we have no this TCR information we know where those actually do come from. So, you have, for example, in this case, maybe a clone type here that maps to this situation, and then you might find one clone type here so clone type being basically a self TCR sequence, and we can then map them to the unsimulated situation then we can actually predict when we have new isolated data, which ones would actually react to the situation that is actually quite interesting and we can sort of see that this for example and then we call this reverse time because we go back once we have the stimulus, and you can characterize us on some data once we've been doing this for example this COVID situation. All right, and now I'm coming to the third part. Six minutes I'm trying to hurry, but maybe not sort of overwhelm too much. Let's talk about space. You know all of this stuff essentially just been happening when we take sales take the completely out of the context, kind of not how I'm willing to sell the organism works right. And in fact single symbology has been spatial from the get go. It's just that it hasn't been high throughput right. The only single symbology has been just one bunch of genes at a time but many sales, this actually from a nice review from Alex one would not have a set especially beautiful in any case. On this project this is what we're talking about you know this is sort of where single cell transcripts has been going transcripts has been going but nowadays we actually have techniques that can also add in space again. This is another method of the year this is just the most recent one again, so if these singers I resolve things being bigly in the news and becoming also commercially available with all kinds of different techniques to do so, but leveraging the spatial resolution. This is a sort of robot setting, just as much more than you would do a normal multi stain microscopy image because now you have sort of the top of each. Either the cell or maybe also just sport which aggregates across a bunch of cells, a full transcript home, potentially also other on. What do you do with this type type of information how do you leverage that. I think it's an interesting question. And we've been putting together a small perspective trying to see what type of axis of variation you could actually analyze. It's clear if you have the spatial dimension you could think about different ranges that should depend on the technique that you can do. You can now add covariates across different samples you can potentially also leverage now because you have a mesh in a microscopy image something about actually sell your field time morphology for example, what are strongly different changes maybe also localization of of a marker in nucleus or not. It's a subseller variation you could actually add multi modality and then of course also genics question and. And if you have this you can then look for correlations across the time scale and you can see which cells communicate and so on. I think that's one of the areas where also graph on evolution than the spatial results that could become very relevant. In any case, to sort of bunch of tools have been proposed but there's some of this joint that might not be coupled together we think we might be at a stage where we've been with with transcriptomics. Five years ago so there's a need of shared data structures and tools to make things work. So we've been thinking about how to make them this trajectory analysis you know we've been doing for a long time in RNA also sort of relevant and impossible slates based on essentially in the spatial set. So for that Giovanni Hanna but essentially the whole team in a hackathon that we were lucky to use the last summer that's been coming up with a framework was for spatial mapping. Essentially, we've been trying to sort of take our scan pie here and combine it with spatial resolved thing and because I don't know if you've got that that joke already but it looks like a scum piece scampy. We've been trying to stick to our seafood setting and now came up with squid pie which said of course for spatial quantification or something in Python, but essentially the idea being we take spatial or makes data across different types of techniques. We have a data structure to now store not only the spatial proximity graph but also local image information because morphology in a systematic fashion. There's all kinds of analysis such as space spatial analysis that I've been talking about spatial statistics, but also visualization and just interfaces with your favorite downstream ecosystem so I think that could be potentially really useful if you want to sort of dive into spatial analysis there's a whole bunch of vignettes that come with with with the tools so you can actually try this out quickly. I think that could be very important and relevant for the laps that you can actually have also visualization built on top of that, you're not in the good at making a guise, but in this case, we've been just sort of feeding this in into a very efficient webcode called an up high, where you can then actually sort of analyze your own data change layer settings pick up your favorite genes and see where they expressed and sort of combine them with spatial setting and so on. So I think that could be quite interesting to explore and annotation that you've been doing in your command line, based at the end of the space as a really just strong visualization, something that that the contributed. What do we do though with this type of analysis. The one thing we've been looking into with the liberal lab was actually then adding spatial variation across organoids. And this this whole addition of a net to tell, but I will skip that. I just say that there's again a fun latent space learning problem in there and we can actually leverage those single cell variation tools, also for space variation, but I want to stick to my time slot, and just summarize. So we can talk about like the space learning, which essentially is in the simplest setting of one dimension late in space that you want to learn, then you're back to learning key directions in the data. Then related space learning, I think auto encoders going beyond the more common cell cell native graphs. Okay, then graph essentially is a robust model to do so and we've been showing this for perturbations as well as for transfer learning. And we've been setting up this is not a zoo with pre learned embeddings that you can sort of leverage to to use your own data and we've been setting up a spatial scan. On the next slide, I want to say that we've been building up at how much money, and not only single cell analysis, but sort of a computational computer science type of a department and area we have been going beyond just my attitude, integrating with the Alice Munich unit that I co coordinate but also setting up a grad school with neighboring hammers places and universities as well as in the field for your own health and translational parts of drug discovery and genomics, a whole bunch of positions that someone interested. Please approach me or just go to the website. This is my lab in happier times when we're still socially distance allowed to go around this is how Bavaria looks all the time. Please come by if you're interested when we're again allowed to travel and thank you very much. We thank you Fabian for this excellent overview was very exciting to listen to you. Thank you very much. Now it's time for questions. Maybe we start within the network, please raise your hand or make yourself heard, please. On zoom. Are there any questions. Maybe I start to break the ice. So Fabian, you're both an expert in in machine learning and in this field of dynamical systems mechanistic modeling. Actually I personally I don't distinguish between all too much but if you but some people like very much distinguish between these two and see different merits of the two in this field and I want to ask you about your general opinion on this. So where is machine learning most powerful in this single cell world where mechanistic modeling where are the combinations of both I have seen you've worked on combinations of both also. So, so can you give us an overview on that or give us your opinion on that. That's a, that's a hard one. I think it's one of the really big things we could potentially be thinking about so you know one once at one side if you just take our machine learning head I guess you could argue. Do we need to come up with this most mechanistic model if you have enough data maybe we're already able to predict how a cell behaves under all perturbations we're kind of done right, at least for addiction. We know how it behaves under perturbation we might not need this mechanistic model. I'm kind of convinced that we're not get not even close there because that's like just perturbation space is much too large to not do without this fantastic priors that we have in biology and as you rightfully say. Yeah, I also like to call it mechanistic modeling I guess five minutes people call it maybe network information and so on, but how can we add that on top of that. And it's, I think it's not super straightforward I mean, you know, we can always add these things as priors. There's always a trade off and I think. I'm always struggling, and I don't have sort of a super straight answer when you do what and when not. I haven't really spoken about that but you know there's ideas of sort of really seeing where where these transcripts are going is on a velocity I think it's a very powerful one that's actually where dynamic systems could help. Also, if you actually add time information and you've been looking at a lot of student information in clinical times. And in our case, you know, cells differentiate over time similar times use information to come in available so real time of super time. How do you add that I think in this case, potentially, a more dynamic system approach as additional planning just in the question process that whatever could be quite relevant. So I think it's a very interesting sweet spot and I don't think we should be completely. But you're right, maybe you should be differentiated between the two areas. Thank you. Now they are questions so Lucas first and then Giovanni to ESR students in our network Lucas please. Hello, thanks a lot for such an interesting talk is always a pleasure to join your seminars. And I have two questions the first one maybe a bit boring. But I was wondering whether it's possible to also use deep learning for the preprocessing of omics data or. If you have any work on that. Good question discussing that quite a bit, you know, I mean it's just somewhat similar to, I guess what Carlson asked right for preprocessing we might have a bit mechanistic understanding you know there's this this type of library size or something we want to normalize it away, but maybe we can actually learn a better type of preprocessing that really removes artifacts that we don't get think about. I think if we look to what's happening in computer vision right. They usually want to use it advanced. I think we do learn that it's a good idea to at some point have end to end models right if you have a large enough data. So I think the single setting really maybe different to to bug where you know bug RNA. So too strong lab effect because you don't see so much variation in one setting okay so many things as in one lab setting that there you could potentially learn those better preprocessing that there's papers still nowadays coming up with people discuss what type of. Mechanistic if you want preprocessing is the best. So I think at some point adding this is interesting and you know people start doing this in a part if you want this deep cloud auto encoder in a sense we added the noise function that was adapted to the situation so you could call it preprocessing you put your feet and water either a CV is similar to does, but I think yeah this this would become more popular, and this fire in a sense tries to take a part, part of that, but it hasn't yet been fully worked out I think that stuff to me. Okay, thanks a lot maybe Giovanni wants to jump in and then I can ask you. Giovanni is next. Okay, so I have a, but maybe it's a little bit of a big question but it's something I've been thinking about a bit. So some of the first models that you describe. At the beginning are applicable to single cell data, and then the part that you described at the end is a more, that's a structure so like the single cell data discussing data within the structure spatial structure I'll teach you and so on. I've been thinking about the fact that in many drug studies at home, but we have is not structured set of cells but often like some cultures and so on, which is a collection of absolute structure data. And do you think there's any interesting way to consider these data, that's not just segmenting for example imaging into single cell data and then feeding it to the initial models. Also, we can really apply the latest part that you described in which we have for example a spatial structure. So do you think there's any interesting way to study these, let's say, collections on structured single cells for the purposes of these models because it would be really interesting to help bridge the gap between actual applications and this type of say high sequencing and high group of sequencing studies and so on. I mean, I would, I would strongly agree right that in many cases the simplest thing still is definitely not to a spatial asset because that's all kinds of issues who's not so far yet so you often have this disassociated. So what do you do with that. So some ideas around early on people have been thinking is there maybe a clever way to somehow find a grouping of my disassociated cells that if I put those in those together, you know, I can predict that feature better. I've been very briefly adding that this is essentially something like a multi instance learning problem right. Let's say you're grouping variable patients or something like that, or maybe because you've been taken that's actually also often happening you take maybe tissue from different locations within the one subject and then you can sort of add this as grouping you would try to try to understand what's maybe shared across those that you want to do some type of prediction. But maybe you can also an unsupervised fashion see what type of grouping maybe makes things reconstructed has been, I think it was in a nature paper Novo spark from the ISP and the Friedman's lab where they've been aggregating cells also by some type of similar some sort of hallway but they could actually reconstruct some type of spaces I think. So, there is a signal of potentially spatial, some type of tissue, let's call it space, let's call it tissue, because in a sense you know cells always work in interacting tissue. This tissue signal is encoded in the local delocalized gene expression profiles and this could potentially fall down. So I think you have a point that I think this is kind of further strengthened by the fact that whenever we do prediction models. And we take spatial data. We don't gain that much. So you know if we just take gene expression or whatever type of profile they have often also protein and predict disease signal type, adding them, for example with a graph conversion but also other type of local settings that space information we don't gain a huge jump in predictability, meaning that already ingrained in expression profile spatial context needs to be there so I think finding that. It's an interesting also obviously if you really do it naively and be hard so there needs to be some clever ways to do. Thank you for the answer and also for the talk I really enjoyed it. Thank you. I got two questions and writing here which I will read out here now at the end. One is what would be the first what I think it should be what what is or what will be the first medical so clinical application of single cell analysis. What's what's happening now and for a long time, of course, is that single cell analysis is not one maybe the standard technique in immunology. So whenever you want to quantify in a patient disease situation you do at least a fuckspace type of assay so that's something that this clinic of practice already. That's of course much lower dimensional you don't really look at all the expression profile but you really do just label a bunch of antibodies and then you see I don't know. I mean, essentially whenever we whenever a lab generates blood cell counts of us. That's a single cell application for me to counsel me. Pathology is by definition. Well, and mostly if you if you at least do if you if you turn up if you crank up the the notification that it's a single set technique right. When do transcriptome so these these sort of really large scale essays become clinical available. Well, some tumor settings they start doing that similar in of course a lot of clinical studies. But for standard clinical practice, it might not be necessary to really read out the full transcript or more targeted essays once, let's say, you know we do make the space learning, of course we can then then make feature to each engineer, right, because a which type of transfer can really crucial for explaining this in that direction. We actually have one paper on the space disentanglement that will sort of go in this direction. Once you have this you can combine it with a targeted essay and then you get in that direction so I think it's similar to genomics. Maybe, maybe because it's getting cheap from all of us whole genome at some point will be available for the transcriptome that might be more targeted essays down the road. But I don't have that big of an overview of all the medications that rounds of most of them I know a little more about clinical studies at the moment. To conclude here from Slido, more technical and about your repositories you mentioned that it that it would be possible to share model variants with different number of input nodes are these disease specific or rather in strata specific at the moment will be actually have in the model repo is not much of any disease but it's really different localization so for example, a different organ different organism lab maybe also how it's been sample to connect some other essay. So one of the big reasons of the human cell artists and others of course that you can add disease information top of that but they're the addresses are not that so large around so so this is something that you sort of so you have a bunch of units and that way you can actually do that but at the moment. It's sort of a happy situation. Thank you Fabian for this answer and for this outstanding talk which was the perfect start into our symposium today I really enjoyed it and I think you're on behalf of the entire network and also YouTube audience for this presentation. Thanks a lot. Thanks for the kind of ride. You're welcome.