 In a few days a session on the gene and genomes at the time of next generation sequencing. So my name is Olivier Delano. I'm an assistant professor at the Department of Computational Biology of the University of Lausanne. And today I will be co-chairing this session with the help of Natasha. Hi, I'm Natasha Glover. I'm a postdoc in the lab of Christophe Desimo, also at the University of Lausanne. So a few remarks before we start. So far we've got 96 attendees, which is good, 97. We've got also four talks of 12 minutes each, plus three minutes questions. I kindly ask the presenters to stick to respect the 12 minutes as we are a bit tight on schedule. And for the attendees now, during the presentation, you will have an opportunity to actually ask questions. And this is highly encouraged. So you can use the Q&A functionality. So there's a button at the bottom of your screen, just have to press and it's a chat in which you can actually write your questions. Please do not forget to mention the name of the person to whom you address the question and also make the question as clear and concise as possible. If for some reason we can not ask all questions, we're going to basically transfer the remaining ones on the next session, the speaker session that is happening right after this set of presentation. So please join as well this session, make it dynamic. So Natasha, if you could present like the first speaker. Okay, so the first speaker that we have up in this genes and genome session is Abdullah Karaman. So he is the leader of the clinical bioinformatics team at the University Hospital Zurich. And he's going to talk about the pathogenic impact of transcript isoform switching in 12,009 cancer samples, covering 27 cancer types using an isoform specific interaction network. So without further ado, Abdullah, I'll let you share your screen. So you can hear me and you can see my screen, right? Yes, it's not in presentation. Okay, there we go. Alright, so thanks everyone for joining. So 110 people, that's a lot. So really fantastic to have so many attendees here. So yeah, I'm Abdullah Karaman. So I'm heading currently the clinical bioinformatics team in the Institute of Pathology and Molecular Pathology. And today I would like to talk about the project that I started off in the lab of Christian from Marion at the University of Zurich. And in this project, we tried to assess the pathogenic impact of alternative splicing disruptions and 27 different cancer types. And for that, we developed the new isoform specific interaction network. Because it's about alternative splicing, I mean, what I assume is that everyone knows basically what alternative splicing is, so I will skip an introduction on the splicing, but I would like to mention here an interesting fact about alternative splicing, which is namely, under normal conditions, despite having all the different isoforms encoded in the genome, actually, when you look at the expression, it's usually just one single isoform that is most permanently expressed. And all the other isoforms have actually very low expression. So and this actually also proceeds to other tissue types. And so even if you look into different tissue types, you'll always see the same major isoforms. There are certain exceptions, like in the brain, but the tendency is really that there's just one functional major isoform per gene. Now we can use this information now to assess the impact of alternative splicing and cancer and how can we do that? So we could simply identify the most dominant transcripts in normal conditions and compare to the most dominant transcripts in cancer conditions. And here, for example, we could assume that the most dominant transcript has to have at least twice as much expression as the second most dominant transcript. So if you have now this conditioning, you can now look for switches. So for those conditions where in normal conditions, you would have the green transcript expressed most dominantly. And in cancer now you have the blue transcript expressed most dominantly. And we would be now interested in our analysis to understand and to identify those most dominant transcript switches. But we had one additional actually requirement, which was namely that the blue transcript on the right, that this blue transcript is really never the most dominant transcript than any of the, none of the normal samples to which we compare our cancer samples with. And if this condition is also met, and then we actually call those most of the transcripts cancer specific most of the transcripts, because simply they were really very specific to that cancer sample, and we're not observed at all under normal conditions. Now, obviously, the first question is yeah, so how frequent do we see those cancer specific most of my transcripts? And in order to analyze this and to get information on this, we joined the PAN cancer analysis of our whole genomes project, which was initiated by the International Cancer Genome Consulting. And this project was really fantastic. Because we had lots of data available, and also wait a bit in advance. So this project, the major papers were published in February, couple of months ago. But we use the same data set. So and what type of data had me available? So first of all, we had 2800 whole genome sequences available. So mutational information, and this over the entire genome. And important for our project was that among those 2800 samples, 1200 samples with RNA-seq expression data. And the RNA-seq expression data we could now use to assess which transcripts are most commonly expressed. And the 1200 cancer samples with expression data, if you split them into different cancer types, you can see that we're covering more or less all primary tissue types. So we had samples from brain cancers, heart cancers, lung cancers, kidney cancers, prostate, and so on. There was just one caveat, but this is true for most of the consortium projects. And so the number of samples was in balance, then that we had available for the different cancer types. And so we had those cancer types with a very high number of samples like kidney renal cell carcinomas. But then we had also those samples or cancer types where we just had a few samples like cervix and adenocarcinomas. But nevertheless, you know, it was a fantastic resource where we could now perform our analysis. First, obviously, was the question how many most of the transcripts could be actually identified. And what we, I mean, the number that we identified in our case was that about 70% of the genes have transcript that is at least twice as much expressed as the second most commonly expressed transcript. And if you now look at the cancer specific most of the transcripts, then we can see that in 7100 genes, so which is about 36% of the genes that we have in the genome. So for those we could identify cancer specific most of the genes that we have in the genome. And in total, these were 122,000 CMDTs. And here at the bottom, I'm showing you here a distribution of the various samples and number of CMDTs. And so separated by the different cancer types and ordered by the median number of CMDTs. So it's a lot of data, but just I just would like to point out certain things that are very interesting. So I'm cancers of the cell right now in your mouth. I mean, the data set that we had available at least didn't have any alternative splicing disruptions. This was interesting. The second and third interesting point were on metanoma samples. Metanoma is usually a cancer type that is one of the highly mutated and really by far distance highly mutated cancer types. But here, you see very few alterations in the brain. Brain cancers are also interesting because the brain and tissues known to express different isoforms in contrast to other tissue types. But then brain usually express different isoforms from the same gene. But here, we also didn't see much alterations not to have to splicing between normal conditions and the brain cancer conditions. Interestingly, also for land cancers, and the nice thing with our data set, we had subtype information. So we had different subtypes of the cancers. And here, for example, we could compare the the CMDT between adenocarcinomas of another land and squamous cell carcinomas. Squamous cell carcinomas on those cancers that are induced by smoking mostly while adenocarcinomas can also be frequently found with non-smokers. So we have two different sets of or two different carcinogens that are causing those cancerous growths. But nevertheless, the CMDTs did not differ really extensively. And if you look at the very right side, then you see actually the highest number of CMDTs or the highest number of disruptions in our time to splicing and could be observed with female reproductive organs. And so here, especially leading with uterus adenocarcinomas. From all those CMDTs, 10 really stood out. And I'm listing here on those 10, because those 10 were not only cancer specific, most of the transcripts that were not observed in normal tissue, but they were also observed on all of the cancer samples that we analyzed. So really 100% all of the samples that we analyzed, all of them had this isoform that was not been found in normal conditions. And three in particular stand here really out. And so those three were even not only found in the cancer type itself and not in any other cancer type. So these are really cancer type specific most of the transcripts and those 10 are actually for us now potential diagnostic biomarker, which we really want to follow up now next, because these are transcripts that could, you know, in theory really be indicative of cancer growth in a patient. Yeah. So next is actually, so once we know, have the list of CMDTs. So the question now is, you know, do they have any impact? So what is the functional impact? Do they cause for example, any pathogenicity? And I will come to that in a moment. So for that reason, we actually developed an isoform specific interaction network that was based on the string interaction network and the string interaction network is the famous database that is developed by Christian from Mary and from Mereng's lab. So we combine the string database basically with a 3D domain interaction database. And the idea here was very simple. So we have a transcript consists of different exons. So an exon can now transcribe binding domain, where the binding domain is important for certain protein-protein interactions, as I'm showing you here with the red domain. And the red domain is important now to interact with the string protein here. Now, if you have now in cancer transcript that is most of an express that lacks this red domain, so it's sliced out, then that protein will not be able or will not express the binding domain and will not be able to perform the interaction. So you will have an interaction loss. So and this isoform specific interaction network basically was now used to assess how extensive interaction losses observed with those cancer specific muscle and transcripts and are those interaction losses somehow pathogenic. So in terms of so to put the number of interaction losses in context first again at the number 7100 cancer specific muscle and transcripts that we observed from those 7000 CMDTs. Unfortunately, we don't have high quality domain-domain interaction information for most of them. So for 2500 at least of those genes, we indeed had high quality domain-domain interactions within the protein-protein interaction network. So those we analyzed further and we could identify around in 55% of CMDTs that there are either all interactions lost or there are some interaction or there are some interactions lost due to the expression of this cancer specific muscle and transcript. The pathways and the molecular processes that are disrupted in those cases. We did an enrichment analysis using the geo-molecular process terms. And it's like a heat map, but I just would like to put your attention on the top six of those molecular processes that we find most often we disrupt due to alternative splicing disruptions. And these are they fit basically to alternative splicing. So here we see translational termination, which is affected in 22 of the 27 cancer types. You see translation initiation transcription, nucleotide biosynthesis, protein-protein RNA splicing as those processes that are most often just erupted by splicing induced interaction disruptions. In terms of pathogenicity, we wanted now so now knowing which interactions are lost, the next question is, you know, are these interactions somehow important for cancer? And here what we did is we use now the interaction network to measure between our instance. Sorry, just one minute left, Abdullah. Yes. So there's a CMDT, basically those genes and the distance we measured distance to known cancer related genes. And those cancer genes came from cosmic and you can see most of the CMDTs are actually either interaction partners or interaction partners of interaction partners. So they're really related compared to a random selection of random selection of proteins. Unfortunately, in terms of mutations didn't correlate much with alternative splicing, but what we saw is we saw an enrichment in CMDTs if you had the mutation in the splicer. So as you can see here on the right. So with this, the take home messages, brain cancers and metanomers and surprising have very little CMDTs, female reproductive organs have high CMDTs. And if you have the cancer type from the same primary tissue, then you tend to have similar CMDTs. They induce 55% interaction losses. And very often actually the interaction partners are cancer related. And if you have a splice of some mutation, you have a higher number of CMDTs. And I would like to point out here poster where tonight, my PhD student will present further information and also get on my GitHub account to get the code for the analysis and treatment is available. And with this, I would like to thank Christian for the help and demand for the support environment. And Josh, you and from the people project for all the constant help. And with this, I'm done. And if you have one or two questions, I'm happy to take Thank you Abdullah for your very nice presentation. A very impressive piece of work. So we don't have much time for a question for this first talk, maybe I'm going to just ask a quick one that has been voted. So from Antonin Thiebaud. So you found cancer types, specific CMDT, but on the opposite, did you also find any CMDT common to a majority of cancer that could maybe even be used as global cancer markers? So let me think. I mean, this analysis, I mean, I didn't look up in particular. But we could check this up. But I don't remember seeing one that really was found in all of the cancer types. And it's also like a balance. Would you like to have one? The thing is, the interest in CMDTs are the biomarker so that really appear in all of the cancer, of a certain cancer type. And this is now put in contrast then to a CMDT that might not be found in all of the cancer types, but that is found in different in all the different cancer types in a few cases. There is none that really that was found in all of the cancer types. I guess it's my job to introduce an X speaker. Jeremy Breda from the University of Basel, going to talk about realizing the Waddington metaphor, inferring regulatory landscape from single cell expression data. So the virtual floor is yours. Thank you. And so first thank the SID for accepting my my talk. So I'm Jeremy Breda from the groups of Mihara Zavalan and Eric Van Nenvegen at the Bio-Central Neurosythe Basel. So the sort of question we're interested in the group concerned the high variability of cell that we can see in the higher eukaryotes. And in particular, I was interested in how cell types are defined, established and stabilized. So to approach such questions, Waddington introduced his famous analogy in 1957, where stem cells are compared to marble rolling down the landscape and following values. So analogously, the gene expression of cell would be guided by an epigenetic landscape and cell with developed through differentiation time, developmental time until they reach stable minima that we then call cell types. So I wanted to see how far we could go with this analogy, try to give it a rigorous meaning and try to infer such a structure from single cell RNA sequencing data. So from physics, we know that every system can be characterized from an energy function. And so in that case, due to this in the study system, due to the high degree of freedom and the inherent stochasticity, we can use a well-developed framework of statistical physics, which states that the probability to find a system in a given state, knowing only the energy of state is given by the maximal entropy distribution, known as the Boltzmann distribution. So what this question says that an energy of state defines the density of state and vice versa. So if we can estimate measure density of gene expression in a cell in the space of gene expression, we can infer landscape and this landscape would have different application, like defining cell type as local minima, defining differentiation path as value of the landscape, and would allow us to ask what's the minimal perturbation needed to bring a cell from one state to another similarly to an activation energy in a chemical reaction. So as I said, we want to infer such structure from a single cell RNA sequencing, but there's a caveat in that the number of, I mean, what RNA sequencing RNA sequencing measure is not exactly gene expression, but number of mRNA per gene and per cell. So let's take a gene with a constant transcription rate and the constant decay rate. So it's expected expression is going to be the ratio of those two rates. I'm going to call that transcription activity. And now that's only an expected number. But due to the biochemical process of transcription and decay, we accept that the number of counts in a cell is going to follow Poisson distribution with that transcription activity as a mean. On top of that, in the protocol of single cell RNA sequencing, only a small fraction of the total mRNA is captured. So adding another layer of sampling. But fortunately, the combination of those two Poisson process stays Poisson with a mean being this transcription activity times the capture probability. And then finally, we of course, don't expect transcription and decay to be constant across cell. We expect heterogeneity transcription activity. So in other words, the total variance that we're going to measure in a DMR account is going to be the sum of two terms, the Poisson variance due to Poisson noise that doesn't have interesting biological meaning, that case, and the variance of the transcription activity that we want to keep. So to solve that problem, we developed a Bayesian model that removed the Poisson noise. And to do so, we first estimate the mean and the variance in the transcription activity for each gene. And then in each cell, we estimate transcription activity in each cell separately. So we we made that algorithm available in a GitHub page. And we detailed the derivation and made a benchmarking in this preprint. So let me show you one of the main results. So I said, the main goal of this method is to remove the Poisson noise. But to test this ability, we took a set of mouse marine stem cell consisting of 80 single cells and 80 aliquots. So the aliquots were smartly designed by pulling together the mRNA content of several cells and then sampling back single cell equivalent content of mRNA. And the nice thing with those aliquots is that by construction, they have only Poisson noise. And this Poisson noise is well characterized by the strong negative dependence between the coefficient of variance and the mean. So if we first look at simply the row count, without doing anything, you still see the strong dependence of the CV and the mean. Now, a common normalization in for bulk RNA sequencing is the transcript per million, which kind of normalized for differences in sequencing that per sample and per cell in that case. And as we expect, these dependence is not removed. Now, as I said, we benchmarked with several other normalization methods. And so those are the different methods we tested. And they remove these dependence to some extent, but only sanity, which is here in the middle, almost completely remove that dependence while at the same time, finding the same very low base level of variance for both the single cell and the aliquots, as we would expect. Okay, so we believe that I mean, it's sanity, we have a good estimation of density of cell in a space of gene expression with this transcription activities, we can go back to our problem how to infer this landscape. So let's think at what this epigenetic energy can be. And in so in 1957, when Wellington introduced this idea, imagine that this landscape would come from a complex system of interaction underlying the epigenetic landscape that would be due to the chemical tendencies with gene produces. So in fact, we don't want to infer this landscape in the whole space of gene expression, but only in the space of regulators, gene regulator. So to do this, to do that, to infer the say to regulators, we used a model that was developed in the group of Eric van den Beggen about a decade ago called Mara and for this activity response analysis. So very quickly, what this model does is to first predict binding site of transcription factor on the promoter of the genes and binding sign of micronames on three on the three prime and three prime UTR of the transcripts. Then it interprets gene expression as a linear combination of the number of motif on that gene times the activity of the corresponding motif. So if you just predicted binding sites and observe gene expression to infer a motif activity per cell. Okay, so now we can reconstruct this landscape. And one of the first application of the rise is that it defined cell type as a local minimum. So to do so, we start from the estimated regulatory activity of each cell, we follow the constructed landscape descending down the gradient until we reach a minimum. Then each minimum we call it a cell type and we assign each cell type to the valley it's located in. So we apply this framework on a published data set of 2000 human pancreas cell. So here I show you each point each cell projected in the first three principle component space. So you apply that framework and could retrieve the main known types of pancreatic cells. I show you in different colors here. Now each of those cell type have the precise location in the space of our gene regulators. So we could ask which regulator is most distinguished those different types. And I show you here those different factors projected in that same space. And we could collaborate from literature that those factors as important regulators of pancreatic cells. Another one last more complex and dynamic example. We considered mouse embryonic stem cell during neurogenesis. So in an experiment made in collaboration by Tangsila Mukhtar from the group of Verden Taylor in the department of biomedates in in puzzle. So she fact sorted some neural stem cell at day 13.5 from the forebrain cortex of mouse. So the the forebrain cortex of both human and mouse are created in several successive layers of distinct neuron. Those neuron come from the acymetric division of neural stem cell. Those neural stem cell can also undergo symmetric division and proliferate to give rise to more neural stem cell. So again, we apply this framework inferred that landscape and I show you here this landscape projected on the first two principal components. And you can see a sort of main valley connecting different minima showing red. So I constructed the minimal energy path connecting those different minima where a minimal energy path is defined as a path connecting any two points of the landscape along which the energy is minimized. So we have a path that is defined in the whole space of gene regulators. Again, so we can ask which gene are most variable along the path. I show you here the top 16 most variable motif and you can see sort of three groups of gene having similar behavior. Now, each of those factors that have predicted gene targets, so I could do gene oncology and entities on those target genes and ask what biological process are being regulated by those factors. You can see a first group that is highly active on the right hand part of the trajectory, which are all related to neural differentiation process. Another group of it where factor high in the middle, which are all related to S phase in general replication and finally, a few genes high in the left hand part, which are all related to mitosis. So this corresponds to what is known about the system, which is that those neural stem cells undergo asymmetric division. So cells stay in a cell cycle and proliferate or can alternatively undergo asymmetric division, give birth to base of progenitors that can further differentiate towards different neuron types. Okay, so to summarize, first showed you here a Bayesian model to remove the Poisson noise and infer this transcription activity from a single-cell RNA sequencing data. And I show you how to map this transcription activity to a space of regulatory activity for each cell. And finally, I showed you a few simple examples on how to use this constructed landscape to define cell type as local minimum, identify regulators as they distinguish those different cell types and finally identify developmental path with associated regulators. So with that, I want to thank my supervisors, Mihail Zavalan and Eric Van Niemwegen, as well as both their groups. And thank you for attending and listening. Great. Thank you, Jeremy, for that nice talk. So I think we have time for one or two questions. So let me see in the chat here. There. Okay, here's one from Pierre-Luc Germain. I'll just ask one of them. He said, how reasonable is it to take density into the gene expression space as a proxy for stability? In particular, doesn't the abundance of the cell type influence density in a way that doesn't relate to stability? I mean, I would assume that no matter how big, okay, if we take one cell type, we assume that we have a group of cells that have all the same state. I guess that the stability is going to be defined by how heterogeneous they are in their minima. And I think that should not be, that should not depend on the principle on the number of cells that we have in that group. So I assume that, I mean, the more cell we have, the more precise we are going to be, but the kind of shape of this landscape around a minima defined by a group of cells, I think should not. I mean, in the limit where we have a really low number of cells, then I guess we won't be precisely probably need a few cells. But otherwise, I guess we can estimate a sort of variance from any number of cells. Okay, thanks. Maybe just one more quick one that's kind of general. This one's from Frederick Bastion. He asks, if I'm only interested in determining cell types, what is the added value of your method as compared to simply using expression of few marker genes? I mean, I guess you could also try to infer a landscape in the space of transcription activity, but then I also believe that there would be other method that works better to kind of cluster the cell in this one. I guess the added value would be that I mean, the additional information at least would be that those cell types are defined in a space of regulators. So you could ask which transcription factor or microRNA are distinguishing the different cell types, rather than how those cell types are different in terms of gene expression. Okay, thank you. And so with that, I will wrap up this session on genes and genomes. I want to say thank you again to the four really interesting talks from the speakers this morning. And also thanks for all the attendees to join this session. And now that it's over, all of the audience members, if you want, you can now join the Meet the Speaker session. So all of the questions that were that you put in the chat that we didn't get the time to answer, they will be answered in that session. So you can just join that by leaving the current session and then clicking on the camera symbol next to the Q&A session of the genes and genomes. And also, if also about the questions that get unanswered, there is a slack channel created called follow up discussion. So follow up one word, then dash tier a discussion. So this is a channel, you can also post some follow up questions to the speakers. But for now, we'll be in the Meet the Speaker Room. So I think that's it unless I miss something, but no, I think it's it's it's all good. We can actually go on those are just alone. Okay, everybody. Thank you. Yep.