 And now we move on to the next award, which is the Early Career Award, presented by Laurent Gâteau. Good morning. So my name is Laurent Gâteau. I'm a Professor of Informatics from the U.C. Louvain, Belgium, and I'm here to represent the jury for the S.I.B. Early Career Award. So it needless to say that we received many excellent submissions, but I'm happy to confirm that all the members of the jury were absolutely in line to award this year's S.I.B. Early Career Award to Maria Burbage. So before giving her the floor and let her talk about her research, I'll give a few words of introduction. So Maria is currently an Assistant Professor at the EPFL. Well, very recent, she started end of last year, end of 22. But she has already left a significant mark on the field of machine learning and biomedicine. Art with a PhD in Computer Science that she earned in 2019 at the University of Zagreb, she has rapidly ascended through the academic ranks, becoming an expert in machine learning with profound applications in biology. Now, you all know that you need to do good research to become a good researcher, to become an outstanding researcher that is somebody that has significant impact. It's important that your research and expertise also benefits those around you. And indeed, Maria's journey has been one that is dedicated to research, but also to mentorship and the commitment to diversity, equity and inclusion. One of Maria's defining trait is her remarkable ability to bridge the gap between machine learning and biomedicine. She has developed remarkable deep learning methods, and these methods have been notably applied in the context of cell atlases, contributing to our understanding of complex biological systems. Her involvement in single cell mapping consortia, including the human cell atlas, mouse, fly and lemur projects, highlight her dedication to translating advanced techniques into real-world impact. She has supported students from underrepresented backgrounds and advocated for women in computer science, thereby fostering a more inclusive bioinformatics community. In summary, Maria's journey from an early career scientist to an assistant professor has been nothing short of exceptional. Her expertise in machine learning, her applications in biomedicine, her contributions to diversity, equity and inclusion, and her commitment to open science have collectively earned her this SIB early career award. Well done. Thank you for the wonderful introduction, and thank you for the Swiss Institute of Bioinformatics for this award, but also in general for supporting young researchers. So I will tell you, so let me see just first how this works. Okay. So the title of my talk today will be machine learning for cell type discovery, and I will be telling you about my work in this space in the last few years. So with the advances of single cell sequencing technologies, we nowadays have a way to measure gene expressions in hundreds of thousands of individual cells. So for the first time, we have technologies to create a complete cellular makeup of the human body and understand what goes wrong on a cellular level in disease states. And not only we can measure transcriptome on the individual cell level, but nowadays we also have technology to do multimodal measurements, as well as look how cells are spatially organized using very latest spatially evolved transcriptomics technologies. And besides this single cell genomics revolution, we are also witnessing a machine learning revolution. So in this revolution started many years ago, in particular first with the paradigm of supervised learning. So in this paradigm, we would feed abundance of the labeled data, and then we would train our machine learning model on the target task of interest. However, in many machine learning applications, we don't have these abundance of the labeled data. So we've recently seen a surge of self-supervised learning methods where we could feed kind of abundance of the unlabeled data, then design some pretext task, and then fine-tune our model on the small amounts of the labeled data. And very recently we've been really witnessing completely new capabilities of machine learning methods using very latest generative AI technologies. But how can we use AI and machine learning really to make new biomedical discoveries? And there are a number of challenges when we start applying machine learnings to single cell genomics data sets. So the first challenge is that in biology and in single cell genomics in particular, we are really facing a collection of small and different data sets that are generated under different experimental conditions. So for example, these data sets are generated from different tissues, species, or individuals that suffer from different disease states. And the second challenge is that we really want to use machine learning to be able to discover new phenomena. So in biology we are really interesting to discover a novel and previously uncharacterized phenomena. For example, we want to discover novel cell types, or we want to be able to find new disease variants. So what biology needs is machine learning methods that can generalize across different tasks, different domains, different modalities. So this means, for example, generalizing across different tissues, different species, or individuals that suffer from different disease states. And secondly, biology also needs methods that can help us to discover new phenomena, for example, discovering cancer cell states among normal cell states. And in this talk I will present some of the methods that I've been developing for addressing these challenges. So first I will talk about how can we learn over a collection of small and different data sets, and I will present the Mars method that we've developed to do so. So a cell type characterization is a fundamental computational problem in single cell genomics. So the reason why single cell genomics brought such a revolution really is that our cells are very heterogeneous and they can have different roles in a tissue. And cell type identification helps us to understand these different individuals' roles, these different roles that individual cells can have. And why is this important problem? Because if we can characterize all our cells, this offers really to have a profound impact on biology and medicine, ranging for example from disease diagnosis and prognosis, treatment monitoring, diagnostics and so on. So in particular in this problem we are given a matrix of cell type genes so we measure gene expressions in individual cells and we are interested in assigning cells to different cell types. So it's a given gene expression profiles of cells, the goal is to assign cells to different cell types. And currently the community has been really putting a huge effort to annotate these individual single cell data sets. And as I said earlier, the data sets really originate from really heterogeneous and different experimental conditions. And motivated by this effort of the community that is already put to annotate these data sets, we developed a Mars method. So the key idea in Mars is really to leverage a collection of these heterogeneous experiments to help us to generalize to a new experiment. So we want to design a learning algorithm that would take a set of previously annotated data sets and then a new unannotated data sets and then somehow leverage these previously annotated data sets to learn better representation for a new data set. And these data sets can be very heterogeneous, for example they may originate from different issues. So in particular the setting we consider is that we are given a set of previously annotated data sets in which cells are assigned to their cell types and then you are given a new unannotated data sets without any annotations. And our goal is to assign this new data set, meaning assign cells to different cell types. And in the Mars method, the key idea really is to take this set of previously annotated data set and a novel unannotated data set and then learn to project them jointly in the lone dimensional embedding space in which we want to force cells to group according to their cell types. So to achieve that we learn a nonlinear mapping function F using deep neural networks that projects this high dimensional, input dimensional vector to the low dimensional Mars embedding space. And at the same time we also learn a set of cell type landmarks which are represented as these pentagons here in different colors. And if you learn two types of cell type landmarks, cell type landmarks for previously annotated data shown in these different colors and also landmarks for new data sets shown in gray color because we don't know these identities. So in Mars we learn embedding function F such that cells from same cell types are embedded close together while cells from different cell types are embedded far away and we design a specialized objective function that allows us to do so because in the interest of time I won't go into details but I'm happy to discuss later. And the question we ask is can Mars really generalize across different tissues? So on the mouse aging cell atlas we train the model on the cross tissue generalization so we would leave on tissue out and choose all other tissues and then ask the method to annotate this new data set. So this means for example we use heart lung and pancreas tissues and then ask the model to generalize or kind of separate different cell types in a brain. And as I said we apply Mars to the mouse aging cell atlas so this is an example of the data set where each cell corresponds to different each dot is a different cell and cells are colored based on the tissue they come from and we compare to the standard cell type annotation methods and show that Mars achieves significant improvements in performance. And we next collaborated with researchers at Stanford Neuroscience and Stanford Biology and we used Mars to annotate the fly brain data. So in their lab they sequence the fly brain across different developmental stages and then we used Mars to discover and decode neuronal types across fly development. We also contributed to annotating fly cell atlas the first single cell transcriptomic map of the whole fly and we recently put together aging fly cell atlas in collaboration with the Honji Lee's and Steven Quake and Lichens Law Lab where we developed aging clock models that allows us to predict the age based on the transcriptome and find the genes that are associated with aging. And finally we also ask can we kind of develop interpretable algorithms that can kind of also tell us not only have a strong generalization ability but can also tell us why the predictions are made. So in particular we consider this in the context of a few sort of learning problem where we want to generalize giving only few labeled examples per class and we developed a comet method which kind of if you ask it why is a cell assigned to B cell then comet would say for example because the most important functional terms are B cell activation and B cell differentiation. And we show that this method achieves state-of-the-art performance not only on the single solgenomics data sets but also on the standard image classification and text classification benchmarks in the machine learning community. So how can we now discover these new completely unknown phenomena. So in this in this respect I will tell you a bit more about stellar method that we've developed. So imagine how kind of I give you an annotated reference label data and in standard kind of supervised machine learning paradigm you would train a classifier to distinguish between cell types in this different issue. But now imagine that in this test data in the wild besides this kind of known and seen cell types you also see some novel cell types that you have not been able to annotate in your reference label data. For example some disease specific cell types and cell states. So ideally what we want to achieve in this scenario is that I would like these green cell types that I have not seen previously to say that this is a novel cell type 1 and then these purple cells to say that this is a novel cell type 2 and they are different from each other and also so I want to have this ability to discover novel cell types and then also identify these existing cell types that we have previously seen in the reference data set. And in particular we wanted to solve this problem so slides don't work anymore. I cannot change the next slide. Okay great thank you. So in particular we asked to solve this problem for the spatially resolved single cell data. So we've heard a lot in this morning about the spatial resolved single cell data. So just in brief so in spatial resolved single cell data besides molecular information we also have this additional spatial context of the cell. But typically we have much smaller number of genes or proteins measured compared to the single cell RNA sequencing technologies. So ideally we want to have like a method that can capture both this spatial organization of the cells as well as their molecular features. So in Stellar the idea is that if I give you kind of the reference label tissue and then your goal is really then to transfer notations from this reference label tissue to an unannotated tissue but along the way also discover novel cell types if they show. So that's why we're here in the unannotated tissue. So we start first by constructing the graph based on the neighborhood or based on the similarities of cells in the space. And then we use graph convolutional neural networks to really encode to learn the embeddings that kind of capture both the molecular as well as a spatial organization of the cells. And then finally we design an objective function that allows us to either assign cells to previously known cell types that we have seen in the reference data as well as discover novel cell types. So just very briefly the key idea really the objective function consists of three main terms but really the key idea lies in the supervised objective where we introduce uncertainty based adaptive margin that controls the learning speed of known classes compared to novel classes and allows us to this cover novel classes. And then we test a stellar on the first on the we test it on different data sets but we show here an example of the codex data. So as a reference label data we take the codex tonsil data from a healthy donor and then as our target unlabeled data we take the codex data set from the biotasophagus cancer data in which three novel cell types appear that we were not able to label in our reference label data. And this is an actually visualization of the data set so this is a reference label data coming from the healthy tonsil from the codex from the Gary Nolan's lab and then this is unlabeled data coming from esophageal cancer for which we don't have any annotations. So these here cells are actually visualized based on the ground truth annotations but originally when applying the algorithm we don't have any annotations and these are the results once we apply stellar data we represent this major novel cell types that stellar discovered. So we show that the cell types that stellar produces almost perfectly agree with the ground truth annotations by the human experts. And how much does a really incorporating spatial information help so we replace the ground convolution neural networks with the fully connected neural networks to understand the really benefits of the stellar objective function and we see 11% improvement by incorporating spatial information into this method. And next as a part of the collaboration with Michael Snedder and Gary Nolan lab and led by John Hickey so we applied a stellar to annotate more than 2.6 million spatially dissolved cells from colon and small bowel and we show that the stellar predictions are correct and we use it really to speed up the annotation process which would otherwise require hundreds of hours of annotation of manual work. And also this idea kind of motivated us to develop a novel machine learning paradigm that we call open world semi supervised learning so you can imagine that I give you so we wanted to say that this is not only important problem biology but this is generally important problem in the machine learning community which maybe you've seen elephants and octopus and then in your unlabeled data maybe there's some novel animal species like orcas and zebras that you were not able to label before. Ideally you want to have an algorithm that can annotate these images to elephants and octopus in your reference label data but also automatically discover these novel classes that you were not able to see in your reference label data. And we show that none of the existing machine learning methods can really solve this problem by applying it not only again to single cell image data sets but also to standard image classification benchmarks such as ImageNet and Cypher 100. And finally I would like to thank all the amazing collaborators so this was work not only done by me and it would not be possible without many amazing people with whom I had a great experience with mostly from Stanford and also now I'm running my own lab machine learning for biomedicine at EPSL where I have really a great group of students. And then this is a subtle method actually that I didn't have time to talk about today because the talk is very short but hope I kind of gave you good overview of the methods we've been developing so thank you. Questions for Maria please. There was a lot of information there. Victor? What is the one task where you have to use Mars where you say just use it? So I think there are kind of two use cases of Mars that I also haven't talked about like the second one so I think I would say if you have abundance of previously annotated label data that is beneficial just because you can really leverage these data sets to get a better accuracy on your novel data set but if you have kind of just like one data set or like very small number of label data then it's kind of marginal benefit because you cannot really exploit these kind of deep neural networks and like good representations that you obtain by really having a lot of label data.