 So with that, I am happy to hand over now to the first presentation, which will be for the PhD award. So hello, everyone. It's my great pleasure today to present to you this year's best graduate paper award, which goes to Viktor Petukov. But before he is going to tell you what he did, actually I want to say a few words why we chose him and what he's done, actually. So this year's theme for the award was machine learning algorithms for advancing spatial biology. And this is a very timely topic, I think. You probably agree, I hope. We even had a session. The last session was just about this. And there are many, many labs around the world who are working on making new protocols and enhancing the current ones in order to get spatial resolved transcriptomics data. So there's a lot of data coming our way. And all of this has to be analyzed in order to make sense of it. And this is where Viktor is coming into the picture. So Viktor is an expert in his field. He already, during his masters, he worked in the closely related field of single third transcriptomics. But then, which happened, I guess, at the St. Petersburg Polytechnic University, I should mention. And then, during his PhD, which happened at the University of Copenhagen, he worked on different things. But among them, he published a paper with the title, Cell Segmentation in Image-Based Spatial Transcriptomics. And this is what he got the award today for. So in his paper, he describes the mechanism how to identify the boundaries of individual cells. And I'm not going to tell you any details about that. He is going to do a much better job at that than I could. But I like to say a few words about it, at least. So what's really nice about his approach is that it's a novel machine learning technique that uses Bayesian mixture models and random marker fields in order to get a good understanding of the data and get rid of all the noise, which is in there. But also that the approach is applicable to a wide variety of different protocols. So it's not tailored to one specific experiment or technology. It's applicable to many different kinds. His code is freely available. And we heard that in the previous session. I think the term was transparent tools. So all the code is freely available. And on top of that, it's not only freely available, but there's also Jupyter notebooks. There's a demo to reproduce all the data to get into the field. The demo you can watch if you want to use it on your own and get into the topic. And of course, there's an extensive benchmarking where he showed that his tool clearly outperforms existing tools. So he's making a significant advancement in the field of spatial biology. And with that, I'd like to hand over the first years. Thank you. So where's the request? So you already got the name. And this was done under supervision of Peter Krochenko. At that moment, it was Heart Medical School, now Altoslabs. And course supervision of Konstantin Kadovich from University of Copenhagen. And in a close collaboration with Jeff Moffitt and with the help of very fine, nice collaborator, Ruslan, Rosalind, and Polo. So first, what data do we have? It all started in 2018 with the publication of the first big data set on spatial electronics by Stan Linnerson's lab, which basically got measured like these 33 genes on multiple cells. You may see the data like here. Each dot is a molecule. And it's colored by its gene of identity. And if you look at the raw data, it's x, y, like coordinates, and gene. And if you look at the clouds, they are likely cells. So you have a lot of molecules together. So it's probably a cell. But what's missing are cell boundaries. So technically, you cannot say which molecules come from each cell. And to fix that, authors used DAPI stains. But they said themselves that the quality is not so great. Usually, segmenting DAPI is not so easy. So the segmentation quality, like published by their paper, was also suboptimal. Currently, there are many more protocols, like for those who visited Jin's lecture, you also heard how they applied. But most popular of them are like Murphy's, Cosmex, Tanex Xenium. And these days, they mostly all provide membrane stains to segment cell boundaries. But I will show you later that it's not enough. So I'll show you how the data looks like. Now, what can we do with that? First of all, of course, all the same stuff which we can do with single-cell RNA sequencing, like clustering, connotating, this analysis. But on top of that, we can study tissue organization, particularly spatial processes, for example, cell migration. A lot of labs currently are working on cells of interactions, because we finally have information about which cells seed together and we expect them to communicate, so we can try to infer what those communications are. And we can go from single-cell level to sub-cell level so we can study processes inside cells. However, the problem is they all really require high-quality segmentation. For example, with single-cell RNA seed, we have the same problem as doublets, but many more of them, when molecules from different cell types got segmented into the same cell. So you might see here all these bridges. Second, if you study tissue organization, then some cell types like your mouse brain would show spatial structure, like excitatory neurons, and some like inhibitory neurons would not. But because of improper segmentation, your inhibitory neurons would have parts of excitatory neurons, so they would also appear like they have spatial structure, but that would be false. The same with cell cell interactions. If your cells seed close to each other and you just estimate their correlation, it's likely to be present just because of improper segmentation and not because of cell communication. And most papers I've read on that don't account and actually capture improper segmentation instead of interactions. So please hear, like, be really careful. Finally, if you are calling sub-cellular structures, like in the example here, if you have this yellow part getting into a blue cell, your algorithm would detect it as a sub-cellular structure, while it's just, you know, parts of another cell. So I told you about membrane stains. However, one uses them. They're really cool, but they are not enough. First of all, because honestly, not always you have them. So maybe experiment failed. Maybe a protocol does not allow that. It's hard to get them. Second, it's still hard to do segmentation. So membrane segmentation would have some errors. Third, sometimes you get membrane signal nice, and the other times you just don't get it. And it's biased toward particular cell types, so you're likely to miss some cell types, and it would be a systematic error. And finally, as with all biological experiments, you would get, like, just mismatch data for whatever reason. Like, you have stains, but you don't have molecules. You have molecules, you don't have stains. A lot. So with all that motivation, we have our method, and I'll briefly explain it, but really for more details, please look at the paper. And I try to focus on the parts which we did not emphasize enough, in my opinion. So it all starts with building a graph, a triangulation graph where each molecule is a node, and then we encode spatial information as a graph coming from continuous space to discrete space, which is easier to deal with. And then we try it for each node to assign some latent label. It may be like cell or cell type, and we make a statistical model for that, and infer those with expectation maximization or its modifications. So we proposed a general model, which basically works for different kinds of labeling problems for molecules, and we applied it in multiple cases. So we started with the filtration of the background, basically two labels, signal versus background, and we assume that dense regions are signals sparsar background, we do magic, and we remove background. So that really helps with the data quality. Like, this is known as a lot. Second, we showed that we can infer cell type information without segmentation, so we modeled it like as a mixture of categorical distributions where each cell type has a gene vector, and we may show that we capture like brain structure, properly like layers, inhibitory neurons, et cetera, neurons. Finally, we do cell segmentation, but it's like the hardest and the most important problem. So we expanded our model here quite a bit and got something like this, for more details I suggest you to look at the paper. But the idea was to use all the data we have. So of course we start like the only required data is the molecule X, Y gene, but we also have background probabilities from the first stage to filter noise. We may infer cell type, or use cell type info, it may be from our algorithm, it may be from another, you just can plug in additional information. And compartment info like nuclear cytoplasm that helps a lot because they have different expression patterns. If you have membranes, you can provide prior cell type assignment, and that also of course helps. And the model is easy to extend to other sources for helpful information. So I cannot take it back, whatever. No, I can. Yep. Sorry. So the thing is like, that's the main meat of the paper, how to segment cells, but there is another part which in my opinion is mostly underappreciated by others is how to do analysis segmentation free without segmenting cells, which you wouldn't expect in a paper on cell segmentation. So the idea is that, let's take all our molecules like here again, each dot is a molecule colored by gene, and take a molecule, take for example, 10 nearest neighbors, and call it a pseudo cell. Then as with any cell, we can extract its gene vector, like we see with a learning seek, and we can do all analysis on those vectors, and they are surprisingly robust, like we can find cell types, expression patterns, without any segmentation, it just works. Of course, there would be some noise, but still work through enough. On top of that, we can take those gene vectors and embed to 3D space, and convert to a color, as any 3D space can be converted. So we get our molecules colored by their local composition, and we call it neighborhood composition vectors, or NCVs, and it shows you different cell types with different colors, the more similar they are, the more similar are the colors. So it allows you to have this nice structure plots, like no segmentation, but you already see layers, you already see cell types, or some crazy UMAPs, not very helpful, but pretty. Also allows you to inspect visually cell boundaries, like you may see composition and whether the boundary actually separates composition. So that's the extent of what we covered in the paper, but currently we are working a lot on extending it, so what else can we do with these neighborhood composition vectors? First of all, let's look at the cellular structure. So here's like gut data, and you may see like gut wall and nuclei here, like in light green, and it get polarized towards the boundary, like the outer part, so just by neighborhood composition, you can clearly see the structure like inside the cell, and you can do it formally, like by taking your molecules, clustering them, and showing it like within each cell type what are sub structures there. Then we can try to interpret NCV dimensions and for example, here's like full NCV, but let's look just at the first dimension here, like colored by the dimension like volume. So you may see that blue actually highlights nuclei, so we didn't do anything, we didn't use any stains, but now we know which molecules are inside nuclei and which are not, so that might be helpful, for example to doing velocity, which gene invented on spatial data. We can also check other dimension and it's a place like a certain cell type here, brown one, and that's also helpful to know. So for example, if you want to study sub-cellular structure, you may remove the dimensions which are responsible for cell types and vice versa. Next, currently these NCVs are estimated only based on molecule data, but these days we have a lot of staining data and stain dots, so they're not very much like into each other. So what we can do, take each molecule, estimate intensity of each staining within this molecule and get a vector of intensities and finally do the same NCV trick. So we may basically show different markers stained separately on our molecule data, like here corresponds to this blue cell in PAN-CK and here corresponds to CD45. So we finally like merge together the staining data and molecule data. With that all, so we may take NCVs, we may filter the signal which we care about, we may add the staining signal. We can also use it in the segmentation model because currently the model expression as a vector which is linear like from the number of genes and it scales poorly so really gets lower with the number of genes and also it doesn't account for gene correlation structure but if you replace this vector with like five dimensional NCV for example and model it as a normal mixture that would be like always four dimensional like 200 times fewer dimensions so memory and also account for correlation structure so with all the benefits of what we described above. Finally, well of course as every pipeline we constantly work on improving it like usability, computational performance. We are in contact with 10X, they asked us to tweak their pipeline because they're going to recommend it for 10X Xenium so we constantly improve all the performance and usability. With that, I'd like to thank you all and if you want to read the publication is here, the code is here like everything package just go read and get in touch with me if you have any questions like special analysis in the cell analysis, I have a bunch of projects which I support so looking forward to meet as many of you as I manage in this conference and thank you for the collaborators and for the organizers. Thank you very much Victor. Questions? If you haven't figured out how to use the mic yet you just have to press and hold and wave your hand first and... Thanks a lot. Do you think modeling RNA diffusion is relevant for this model or do you have an intuition if it affects how segmentation is done if you are diffusing molecules? And do you mean like diffusion within a cell, right? Exactly. Also can be, I guess, depend on the technology in the spot or between spots or I don't know... Well, we don't work with spots to clarify it so we only work with single molecule protocols but if you're speaking about diffusion within a cell and if there are some strong processes which make the distribution of RNA uneven then it would help the model because the better our model is, the easier it is to separate cells. We don't do it at the moment, doesn't seem to be any kind of a bottleneck but if you would see such strong processes it would be nice to add. Any other questions for Victor? Go ahead. Yes, very nice. Can you clarify when you're talking about these NCV dimensions like first dimension captured cytoplasm versus nuclear second dimension cell type? So, did you do some sort of PCA? How did you get which were these different dimensions? I missed that. Yeah, well, it's kind of PCA. So like in the first approximation you may say you may take your gene vector, you run PCA, you run UMAP embedding on that so you get like linear embeddings don't work well in this case but something like UMAP is really nice. So yeah, like normal transformations but to five dimensions or like three dimensions not to deal like we do within the cell. Thank you. Any other questions? Go ahead. Yeah, it works. Thanks. So one question about when you said that you think that the second part of the paper or whatever that part of the paper was not or is not appreciated well enough, right? Yeah. So in there you're saying that you don't need to build boundaries to say what a cell is. You just take the molecules which are the closest neighbors and you sort of look at those at expression profiles of those, right? Mostly I don't mean like you don't need to do that to understand where the cell is but you don't need to understand the tissue composition in terms of both expression and cell types. So if you want cells like exactly cells well, you need cells. Yeah, of course. Yeah, of course, yeah. Okay, thanks. Any further questions? Jochen, do you have a question? No? No. Okay, just to follow up on that. What are the cases where as you showed and following on from that question where you don't need segmentation it still works really well. So what cases where that's actually the more appropriate? Well, I wouldn't say it's necessarily more appropriate. It's just maybe easier and less prone to errors and first of all is like first glance at your data. You take your samples, you run NCVs, you see clusters, you see expression patterns and that gives you feeling of the quality of the data set and of the composition of cell types. So I think that's the first part which I always do in my analysis. Right, thank you very much. I don't see any other hands up. So one, yes, your hand again, go for it. Yeah, we have 54 seconds. Right, thanks. So another question then is you showed the stain images, right, and you showed that you could pinpoint some of the stains from your model. So does that mean that maybe if there will be enough staining data to train on a model like yours could make staining not required anymore in a way? So you just run spatial transcriptomics, you use a model like yours and one day you would be able to say, oh, this spot is actually staining for this specific marker because we have trained on, who knows how many images. Yeah, well, that's a very good question. Theoretically, yes, mostly, except for the parts that the question is how many of the information is actually encoded in our genes and normally we try to do stains of something which we don't know from the RNA data. So yeah, I would expect that when we have a lot of genes and high quality models then a lot of stains would be redundant but then we would just measure other kinds of stains for proteins or like organelles. Thank you. Thanks a lot. Let's please round of applause to congratulate Victor.