 I think we should start. So hello, everyone. Welcome to the Computational Approaches session. I'm Luciano Cascione, head of the Brain Pharmatics Co-Unit at the Institute of Oncology Research in Berlin-Zona. This afternoon, we are having eight talks, five five-minute talks, and three 10-minute talks. And I'm very glad to co-chair the session with Francesca. Yeah, hi, everyone. My name is Francesca Zinger. And I'm a bioinformatics scientist at the Clinic of Bioinformatics Unit at Nexus Personalized Health Technologies at the 8-H Zurich. Before we actually start with our first talk, I would like to remind every attendee that you can please use the Q&A button below your Zoom window to post questions and try to make them concise in order to facilitate an easy discussion later on. And with that, I want to introduce our first speaker today, which is Charlotte Zoneson from the group of Michael Stade at the FMI in Basel. And she will talk about pre-processing choices and their effects on RNA velocity analysis. Thank you very much. I will share my screen. So thank you very much. And it's great to see so many people here, actually. And I'm happy to start off this session by talking a little bit just to give a few highlights from our recent study on pre-processing choices and specifically quantification for RNA velocity in single cell RNA-seq data. So I wanted to say just a few words first about RNA velocity. So one of the reasons why it's becoming so very popular, and it actually also got the SIB price for the resource bioinformatics price last year, is that it allows us to infer things about the dynamics. So for example, related to a differentiation trajectory based on just snapshot data from single cell RNA-seq. And the way that it does that is by first assuming that the snapshot actually contains cells from different parts of this trajectory, and then modeling together not just the mature mRNA, but also the pre- mRNA in order to figure out for each gene in each cell whether the expression level of that particular gene at that point in time is on its way up or on its way down or staying stable. And the way that the RNA velocity is typically visualized or interpreted is by projecting it onto a low-dimensional embedding of the cells, like you'll see here to the right, where the dots are different cells and these flow lines indicate kind of the velocity flow, the main flow in these data sets. So what did we do? We were particularly interested in the abundance quantification, so how to get the mRNA and the pre- mRNA counts that you need to do the velocity analysis. And the way we did that was that we looked at public data sets, so we collected some droplet single cell RNA-seq data set where we knew that there was some kind of known dynamics, some trajectory, some differentiation trajectory, for example. Then we applied 12 different quantification approaches to each of these data sets, and these 12 approaches represent four different software run in different ways. So VeloCyto, StarSolo, CallistoBusTools, and Alvin run with different parameters. So each of these tools or approaches give us a pair of matrices of pre- mRNA and mRNA counts. And for each of these pairs, we then apply the same pipeline of RNA velocity estimation to kind of keep that part consistent. And then we compare the results. So I will not go too deep into the results here for lack of time. But I think one of the main messages that I really would like you to live here with is that the counting really matters. The way you get these pre- mRNA and mRNA counts really makes a difference. And it doesn't just make a difference on the count matrix in itself, it also propagates to the velocity estimates and to the biological interpretation. So what we see here is two representation of the same data sets quantified in two different ways, and then RNA velocity estimated and overlaid onto this UMAP representation. And the purple arrows here indicate areas in these plots where actually the streamlines or the estimated flow from the velocity actually pointing completely opposite directions depending on how you do the actual abundance estimation. So that's kind of the first takeaway. And in our study, we go more into detail in this. We try to explain why these differences occur, which methodological choices in the different approaches cause these differences, and make some recommendations. So the other thing I wanted to point to is exactly that. So we also propose a workflow based on our evaluations. This is specifically for droplets, singles, and RNA-seq, for example, from 10x genomics. So this workflow has two steps. So the first thing we need to do is to prepare the reference. We need to extract transcripts and introns from the genome. And we implemented this in a straightforward way in a byconductor package called ISAR, which you can use easily. And the second step then is that we do the abundance estimation with Alavin, which is a command line tool provided within the salmon software suite. And those abundance estimates, we then re-import back into R with a byconductor package called TXI-METSA that also reformats the data that you can directly feed it into the velocity pipelines. If you want a detailed tutorial, you can go to this URL here. And with that, I would like to round up just to say that there is a pre-print if you want to know more. You can join me in my poster room later this afternoon or in the Meet the Speakers room. Thank you for your attention. Thank you, Charlotte, for this great talk and questions we'll follow in the Meet the Speakers room for the sake of time. OK, so now is the turn of Anna Koroleva from Zurich University of Applied Sciences. She will talk about the two words creating a new knowledge base for literature based on discovery. So please, Anna. Well, introducing me, let me start my screen sharing. Can you see my screen now? OK, I hope you can. So my name is Anna, and I'm a postdoc working with Marianissimo and Manuel Gil. And I'm going to present our ongoing work on creating a new knowledge base for literature based discovery. Sorry for that. So first of all, I want to explain what is literature based discovery or LBD. It is a field of research aiming at discovering and placing knowledge by money in scientific literature. And this area emerged in the 1980s when a computer scientist Don Swanson was manually looking through some titles of scientific papers. And he noticed some connections between certain concepts. For example, Reynolds disease is associated to parameter such as platelet aggregability and bad viscosity. And on the other hand, dietary fish oil is also related to platelet aggregability and bad viscosity. And all for Reynolds disease and dietary fish oil never occur together in the literature. He supposed that dietary fish oil can be used to relieve the symptoms of Reynolds disease. And this was later confirmed clinically. So Don Swanson formulated the first LBD paradigm based on this discovery and some other discoveries he made. This paradigm is called the ABC paradigm. And it says that a term A and a term C may never occur together in the literature. But if they're connected via some B terms, intermediate terms, it can mean that there is a meaningful connection between them. Why does this matter? This kind of discovering, this kind of connection can reduce the time for drug discovery and drug repulsion. But of course, if this is done manually, it still takes lots of time and effort. So nowadays, there are lots of automated systems that try to perform literature-based discovery. And I want to describe how a typical LBD system looks. So usually, it starts, of course, from the literature, performs some synthetic parsing to represent the sentence as a tree. After that, the semantic parsing is performed to produce some kind of triples. A triple is a set of objects. Usually, it's a subject that predicates an object. And it represents a relationship. For example, dietary fish oil improves blood viscosity. These triples align to ontologies and some knowledge bases to normalize the entities and to retrieve some additional information. All this is represented as a graph. And this graph is mined using some discovery algorithms to make some discoveries. And so for us, the most interesting point is the triples stores, because most of the LBD systems do not make the first two steps themselves to just use some existing triples store. And so the most commonly used knowledge base for LBD is called CMME-DD. But also it's widely used. It has some limitations. And so this motivates our work. Our goal is to create a new database for LBD addressing the limitations of CMME-DDB. So what limitations and in what way do we want to address? CMME-DDB uses methods such as rules and dictionary lookup. And we want to use more state-of-the-art methods of natural language processing, such as machine learning and distance supervision. CMME-DDB uses only midline titles and abstracts, which of course is not a complete source of information on scientific publications. So we want to use full text from PubMetal Open Access. CMME-DDB has limited coverage of entity and relation types. It only has 30 types of relations. And the entities are based on the endless tzauros. So no other types of entities are present there if they're not present in the tzauros. So we want to add more entity and relation types. And we plan to use oboe ontologies that's open by medical and biological ontologies to retrieve some additional information on our entities. And finally, we plan to use a different design of our database. We want to use RDF format following semantic web standards that would allow us to provide a sparkle endpoint for performing queries on the database. And the last thing that I have to say is that the description of our future database and methods was accepted at the first international workshop on literature-based discovery. So we have the approval of the community and we're working on it now. So that's it for me today. If you have any questions or comments, please join the poster session today or drop me an email or we will meet in the mid-list period session. Thank you. Thank you very much, Anna. So now is the turn of Yannine Nevere from the University of Lausanne. And he will talk about the hot so hierarchical orthologous transcripts. Please, Yannine. Yes, hello everyone. Can you see my screen? I guess you can. Yeah, so thank you for the introduction. I'm glad to be here to present my poster project, hierarchical orthologous transcripts. First, I'm gonna talk about orthology inference. So, as you may know, there is many orthology inference, no idea, but them to infer homologous relationship between genes, orthologous and paralogous relationships. And to do that, they consider one canonical, often the longest, isoform for a gene and ignore alternative transcripts. That's understandable, but that causes some issues. So first, the longest isoforms are not always the best matching. Actually, they are not in 34% of the cases, which could lead to error in the inference. Then, when you have orthologous genes, sometimes they are used for function transfer. But if you don't take into account alternative transcripts, then you can't know, for example, that one function is associated to, for example, to transcript A2 in human. And there is no homologous transcript in mouse. So if you try to transfer the function, then you're doing false guesses. And then, if you don't take into account alternative transcripts, there are also no information about the evolutionary stories of alternative transcripts, which is problematic if you're interested in that. So my project goal is to solve these issues by inferring hierarchical orthologous relationship between alternative transcripts using hierarchical orthologous groups as a base. So hugs, hierarchical orthologous groups that are available in HOMA are gene descended from a common ancestral genes at a given taxonomy cringe. So they are hierarchical because you can, because actually they are nested groups. So they are grouped as the lowest level of the taxonomy and you can go up this way. So, and then that you can, from the hugs, derive some gene free and using the isoform sequence, try from that to get the evolutionary history of alternative transcripts of genes in a hug. Again, at a given taxonomy cringe, that's what's called HOTS. To do that, we first need a way to compare alternative transcripts so that we plan to use spliced alignment before collaborator for the University of Sherbrooke. As a core concept of spliced alignment is to align axons together according to similarity identity and conserved order. So originally it was designed to align CDS to the genome, but you can also use it for homology determinations so to align homologous transcript together. So if you have similar axons in the same order in homologous genes, we can hypothesize that structure probably existed in the common ancestor. So using that, then we get back to the species tree and we have the isoform sequence here. And going up in the tree, we're gonna do the spliced alignments of whole isoform and class the homologous isoform together for the human spaces. We have three groups of isoform and we go up in the tree. And when adding the cut here, we know that we see that two of the groups are the new sequence and this one are not. So what can this tell us? At first, we can know where are the gene and the loss events of isoform. So from this view, we could get, for example, that one gene of isoform occur in the common ancestor of human and mouse. And you also know what are the conserved isoforms of between species for functional transfer or other world. So if you can, if one isoform is especially conserved, you could guess that it's a functional isoform of the gene. So with that, we are gonna, we have the evolutionary story of alternative front types and we can move toward new implications. So it could be used for genomic annotations because if you know what the conserved isoform in a cloud is, so you could look into a new genome of a species in this cloud, hypothesizing that the conserved isoform is also present. So you look for the exon and even makes a hypothesis that the isoform exists. And you could also improve autology inference because this is not a trivial problem with particular or difficult cases in case of duplication of different isoforms of loss. And if her org have not parsimmonious odds, there is a lot again of losses of isoforms, it could be a warning that we need to test the solution and find once, but could be more parsimonious. So here is through the application, we could, there is a lot of other you can imagine. So, I conclude my presentation. Thank you for your attention. I have a post there, the computational approach to station today. Thank you very much, Yanni. Thank you very much for your presentation. So could you stop sharing your screen please? Yes, okay. Thank you very much. So now is the turn of Elena Montego-Borbolla from the Computational Intelligence for Computational Biology Group. So please. Elena, we can't hear you. Okay, now. This is okay. Well, thank you all for joining this presentation. My name is Elena and I'm gonna introduce you my master's thesis project, which is about a computational approach on decision-making in the ICU. Why do focus on ICU to intensive care units? Well, these are very costly hospital units both in terms of monetary and personal cost. And their use is expected to increase not only cases in the pandemic like the one we are going through but also because of the aging population. And these units have a particularity that is that doctors must integrate a lot of complex data from a diverse range of patients. And this could hamper the fast decision-making. And in order to improve this decision-making many people have made use of the clinical information systems to develop machine learning approaches in order to improve it. But not all of them have done it in a real-time manner which would be a real approach that could actually be applied. Some others, they have done it for instance for predicting mortality or sepsis risk. But we wanted to focus on pneumonia influenza because it is one of the most common diagnosis. So can we detect this diagnosis from signalomics data which is the data from the signals that are connected that are recorded from the patients that are connected and also can we do it in a real-time manner? Well, in order to do so we extracted some data we constructed a data set from Mimic 3 which is a database from ICU patients. We extracted the charted vitals as well as some laboratory tests in order to get a data set which is composed of over 9,000 samples from adults and it has 35 features of the three different categories. First, dynamic and continuous such as the heart rate or oxygen level. Second, dynamic discreet such as the glycocoma scale or constant such as sex or age. And we only consider the first 24 hours of the study. And well, as you can imagine, the road data is quite messy, it is incomplete, it has missing values and figures and of course the measuring time is quite irregular. So we needed to pre-process this data and to make it useful for machine learning classification algorithm and we input the missing values and we fit the data into 30 minutes interval. From there, we derive it in two different data sets. Well, it's just the same, but organized differently. One is a full version of it and the other one is a gliding window in order to test it for a real-time application and this is used for binary classify these samples. These are the results so far on the left side, you can see a figure where the accuracy from different unsupervised and assembled methods is shown for both the three gliding window data set and the whole data set and on the right, different neural network architectures test accuracy. And the neural networks do not happen to perform very well. We expected that recurrent and random course would be of great use because their tendency of catch temporarily patterns but they don't have very strong accuracy but however, the two of the assembled methods have a very strong accuracy with almost one, reaching one in the gliding window version of the data set. Those are the random forest and the bugging classifier as you can see here in the picture. The others did not perform that well for a real case application. So in order to conclude, we, the research question got answered, we can detect the pneumonia and influenza diagnosis from signal only data but on a window version of the data set and as this window can be even shorter then we could do it in a real-time manner. And therefore implementing this approach into the ICU could help doctors to really detect patients with pneumonia and influenza diagnosis. And I wanted to thank you, all of you for your time and listen to me and also to my team for their support. And here are some of my consult details. I'll be also presenting this poster later on in the poster session and I'll be happy to answer your comment or whatever question or comment you both. Thank you very much. Okay, thank you very much, Alain. Elena, there are a couple of questions for you, Yannine and Anna, but we will answer in the speaker session. So now we have another talk. And this talk will be given by Maria Biliou from the Computational Cancer Biology Group at the University of Lausanne. Hello, can you see my slides? Yes. Hello, thank you. And thank organizers for giving this opportunity to present. My name is Maria and I'm a PhD student in the group of Dr. David Keller. And today I would like to talk to you about simplification of single cell anaesthetic data. As you know, single cell anaesthetic is a powerful technique for studying cell populations. And as bioinformaticians, you might be familiar with its main feature and at the same time, a challenge is high-dimensionality, meaning that each cell is represented by thousands of genes. But with the development of our technology, we have more and more cell sequence per experiment. And you may expect that many of those cells carry redundant information. And of course it is more challenging to visualize and analyze such a large-scale data sets. So the objective of my project is to reduce number of cells by merging very similar cells carrying redundant information into supercells. So basically to go from a large-scale data set to a smaller version. But what is more important is to demonstrate that this smaller data set can be used for the downstream analysis, which usually includes clustering, differential expression, gene gene correlation, et cetera. And today I will focus on this part and I'll try to convince you that we can use this simplified data set because the results of those two analysis are quite consistent. As an example, I will use this data set of CDHT cells. CDHT cells are known to be very heterogeneous, but not with very distant clusters. As you can see here, this is a network view of presentation of single-cellistic data. Probably you are more familiar with the UMAP or Disney by this kind of equivalent. Here in Disney or UMAP, each dot represents a cell and proximity represents transcriptional similarity. And here you can see a simplified version of this network, a network of supercells at a gradient level of 50, which basically means that we have 50 times less supercells in this data set than single cells in the initial data set. So the first step of the downstream analysis is usually clustering. So let's just cluster those two data sets. For this plot, we can see that clustering are quite similar, but let's investigate this in more systematic way. For this, we are using adjusted trunk index, which is measure of clustering similarity. And we are comparing clustering similarity of supercell data to this clustering of single-cell data. And here in this red line, you can see that this index is quite consistent with the gradient level. And also it is better than our negative control in which we randomly group cells into supercells. And of course, and also it is quite consistent with subsampling as an alternative way of simplifying large-scale data set. But what is more important is that discrepancy between clustering of supercell data and single-cell data is of the same level as if we would apply a different clustering algorithm to the single-cell data. So once we have our clusters, the next question I usually will ask, what are the markers of those clusters, basically which genes are operated in one cluster compared to others. So here I'm comparing those two set of genes defined in single-cell and supercell level. And here again in red, you can see that we can recover up to 35% of the initial marker, even if we reduce number of cells 200 times. And of course it's not the case anymore for subsampling. Okay. And at the last brief slide, I would like to also introduce you that by merging very similar cells together, we can also improve some signal between some gene-gen relationship. As in this small example, you can see that those two genes, which we know supposed to correlate since they are markers of the same. A subpopulation have better personal correlation in our simplified data set and also have met a pattern of the initial expression. Of course, this is only one example, very tiny, but if you want to see more and to discuss, you are welcome to join me during this poster session this afternoon. And with this, I would like to conclude and to say that supercell quite well preserves clustering and markers recovery. It also improves the gene gene correlation, but what still has to be assessed is how much this facilitate the analysis of large data sets because the idea is to use this approach to simplify largest data sets in order to make the analysis faster and easier. And with this, I would like to thank to David Keller and all people involved and to your free attention. Thank you very much, Maria, for the talk and questions I think we'll write soon in the meeting. Thank you. Let me introduce you. So Fabio Rinaldi is a group leader at the University of Zurich and Hizya in Lugano. He's gonna talk about a fast efficient accurate entity recognition for biomedical applications. So please, Fabio. Yes, thank you. So I'll try to share my slides, so it works. So I'm going to present work that my group, for my group at the University of Zurich has done in the past few years and that I also intend to continue with my new group in Lugano. I recently moved from the University of Zurich to Lugano where I am heading a new energy group in natural energy processing. So the reason I'm presenting today at this conference and the reason I'm also a member of the Swiss Institute of Informatics is that we have worked for many years on information instruction from biomedical literature. So in this talk, I will present a few tools that we developed over the years. Starting with the bioterm map, which is an aggregator of biomedical terminologies, Olga, which is an entity annotator. And then we'll describe some of the methods that we used. So what is the bioterm map? The bioterm map is a aggregator, as I said before, of biomedical terminologies. What does it mean? Basically we go and collect terminological information about biomedical entities from a number of reference databases. Some of them you see in this graph, here in this figure. So for example, the cell ontology, cello servers, KB for chemical, the Swiss broad, and so on and so forth. So we fetch the terminology from those databases, we put them together and we provide them for annotation cells. And by fetching, I don't mean simply going there, downloading them and putting them together. What they mean is a dynamic service which automatically checks the original resource. Every time you want to regenerate your terminology list, basically this interface, which is accessible from our website, will check if the original resource has been updated. And so we'll offer you an opportunity to update each resource. And then when you're satisfied, you have most of the resources that you need, you can select the resources that you need with picking these boxes and then you can download the corresponding terminologies. And this way we keep our terminologies synchronized with the original databases. We see we have Swiss broad, we have cello servers, and many of the resources that we send to the informatics as well. So these resources can be used by any tax mining system, let me show you how we make use of them in our system. This is an example. Basically it is dictionaries that are then stored together into our internal dictionary. And this dictionary can be used for annotate text, typically scientific biomedical scientific literature. You can see at the different categories that we annotate as an example of an article. In this example, it's derived from a recent application that we did for COVID-19. So we extended our dictionary with Kovac, a dictionary provided by the group of Patrick Ruch, so which includes also the most recent terminology of regarding COVID-19. This is also accessible from our website. So it's not only a simple, a nice interface that you can use the browser because it's also an annotation service. It means that we provide a fully fledged RESTful API, which you can use to query articles. Basically you can submit text to be annotated and get back from the system the annotations, the original notation in many different formats. We also evaluated this service in number of challenges and mentioned this in a moment. If you follow the URL that I provide in this slide, you will get the description of this RESTful service. The URLs will also be provided in the final slide. This is an example of how to use the web annotation system. We just provide a query to the service. This is the URL that you have to query and you submit text like this. Or you can submit, if you want to submit your own text, but you can submit also a PubMed ID or a PubMed Central ID. In this case, the system will automatically download from PubMed or PubMed Central, the corresponding article, abstract in case of PubMed, full text in case of PubMed Central, annotate it and give it back to you to results. As I mentioned, you can get the results in many different formats. Probably the most convenient format for further processing is the TSV format. You can get also an XML format or different formats allow visualization of the code, of the annotations within the article. But the TSV format, or which you'll see an example, is particularly practical for further processing of this data. Basically, you have to understand this figure as the same line, split in two. And basically we have four different annotations here within an article. In this case, it's the article I provided before, so it doesn't have a document ID. If I had submitted a PubMed ID instead, we will see a PubMed ID. In this case, we just submit the text, so there is no document ID. And we see the position in the text in which an annotation has been found, the term that has been matched, a preferred form for that term within the reference database, the entity identifier in the reference database and the origin you see the actual reference database. So for example, the word pneumonia has been found in mesh diseases and also in CPD dictionary. And the other entity has been found in mesh. And here you see the corresponding identifiers within the original database. Additionally, where we can find the same term when the original resource provides a link from the term, from the concept to UMLS, we provide those to UMLS identifier. In this way, you can resolve some cases in which the same term appears in multiple resources, like this one where pneumonia appears both in mesh and CPD. Obviously, they might have different identifiers, in this case, not, but they should be mapped to the same identifier in the UMLS. So this would allow to aggregate cases with different identifications. So this web service has been tested on in a competition two years ago over a different text mining system where attempted to solve an online task were tested for telefficiency in solving annotation tasks. And you can see the results, see our system was the best and we had a very efficient, so this is a very efficient system. Recently, we also just have interest process literature related to COVID-19. And you can see a graph that shows number of publications related to COVID-19 recent years. I commented this this morning in the COVID-19 session to which I was invited. And again, you can find more information at the link at the bottom of the picture. Now, I have a few more slides described the methods that we use, but I will go very fast because I don't have time. So basically there are two steps in order to recognize biomedical entities. One is to recognize the span of text that describes the entities, like example, skin tumor and beta-cutting. And the next step is to find interdiction and original repository to what was the idea of that resource. So what concept, to which concept that particular piece of text correspond. This seems easy, seems easy, but actually there are many problems of ambiguity and that have to be solved. So we actually try to solve, the typical approach is to do this sequentially, but we try to solve it in parallel with a new approach. We could join training approach for which we use two methods. One is by directional long-term memories. And another method is using BERT, which is a tool which is now very popular in the computational linguistic community. Again, I don't have much time to describe this, but I'm happy to answer questions if anyone is interested. And then basically we combine annotations that provide a different levels and we try to see how compatible different levels are. And then from this we get the best possible annotations. We also evaluated this approach in a recent competition last year. What the goal was to evaluate the quality of the notation. The previous competition was about efficiency of the system. Again, we worked the best system. Interest competition was about the quality of notation. And again, we had very good results. I don't have the time to describe them in detail, but basically you can see that we evaluated each of the dictionary categories that we have. For example, Uberon is one of them, one of the dictionaries or resources provided by SID. And for each of them we have a different evaluation and we basically do very well on all of them. So I come to the conclusions. What I presented is basically a solid, easy to use, efficient dictionary based solution. With lexical resources are constantly up to date. So we constantly synchronize our lexical resources with original database. This is done through the biotherm app which collects the resources and Ogre, which is the efficient annotated. Ogre is, the core of Ogre is a dictionary based system, but on top of them we implement a neural network solution to remove ambiguity and get the best possible annotations. And we can apply this over several different types of text. We apply typically on the scientific literature, but we have also a project where we apply this to clinical records and to social media. And I have many interesting things to say also about social media. If I have the time obviously I don't have now. So the links at the bottom of the slide provide additional information about our approach and system, which is also free accessible. It's accessible both via API as described through a web interface and also the code is free available. If you want to give us information, feel free to contact me at the address that you see below at the bottom of the slide, which is my name Fabio at idsia.ch. idsia is the institute in Uganda. It's a data modeling system for artificial intelligence. Thank you very much for your attention. Thank you very much Fabio for your presentation. So we have a question for you from Ana Claudia Sima. And she asked how does the accuracy of the annotations generated by the system depends on the length of the text analyzed? It doesn't depend on the length of the text. I mean, the text can be long as much as you want. The quality of the annotation is always the same because it's local. It's basically a local problem to decide whether it depends on the context in which a term appears. And the efficiency of the system also doesn't depend on the length of the text in the sense that we have implemented a way to do this that is constant in the length of the text. The text can be of any length. The terms can be of any length. The speed of the system will always be the same. Obviously, if you double the size of the text, you get double the time to do the annotation, but I mean it's constant in relation to the size of the text. So I think that now is... It brings us to the last talk of the session, which is then from Maciek back. Thanks. I should stop here. Okay, thanks a lot. Let me just share my screen now. Okay, can you hear me? Can you see my slide? Yes. Okay, so thanks for the introduction. I'll now present you my PhD project, which is a novel approach to detect the influence of RNA binding proteins on the processing of mRNA. So essentially, without further ado, I'll work on alternative splicing. Please note the purple exon here, which is the so-called cassette exon or skipping exon, which can be included or excluded in the mature mRNA. Therefore, later in the amino acid sequence, and it affects, of course, the structure and the properties of the protein. This is a simple case of splicing event. And my main question that I address is, which RNA binding proteins regulate the inclusion of such cassette exons? Of course, in normal conditions, but also in diseases, is there any dysregulation? And my approach to address this question is that I quantify two values. On one side, you can see that I can define a region. In this case, it's upstream three prime splice site. And if there is a regulator to regulate the inclusion of an exon, it would bind, so it would have a binding site there. So I can quantify the binding in the sense of raw K-mer counts or more elaborate, if you have a weight matrix for an RBP, we can calculate a probability of binding of a motive to the sequence. So on the other side, if I have RNA-seq data, for every such alternatively spliced exon, once I perform transcript quantification, for every such exon, I can calculate the inclusion of the total expression of transcripts that include a given exon and total expression of transcripts that could include it. And therefore, the ratio of these two is what we call inclusion fraction, or also known as the PSI score. So we have these two quantities, and now we can put them together. But before we do that, let me tell you that my tool, the method I developed, the tool I developed, works such that in parallel, I scan distinct regions, so that in the end, it's also a sliding window approach, so that in the end, once I get results, it's like we have a nice profile of regulation. So I have an insight into position-wise information about the regulation. So we started modeling with looking at the distribution of the inclusion fractions. And they are distributed as in this histogram. This turns out it can be well-fitted by a beta distribution. And this beta distribution would naturally arise if the inclusion fraction would have a form of logistic function. And now if I specify parameters, I can have most importantly the activity of a given motif in a given RNA-seq sample. And this is what drives, together with the quantified binding, this is what drives the exon inclusion and exclusion. And therefore I can rewrite the inclusion fraction explicitly with the parameters of the model. And having that model and knowing the values which are quantified, we can write a likelihood for the whole data set from RNA-seq data and there's such model. And conceptually it's very easy. And conceptually it's just a consecutive series of coin tosses when you can have an exon inclusion and exclusion a successor failure. So we have this model. We optimize it by an EM algorithm. So we find maximum likelihood estimates for the parameters of the model. And we are mostly interested in these activities, right? So we have sample pair motif activities, which then get propagated into Z-scores and aggregated in the end into pair-motive Z-scores. So a sample-wise aggregation. And this last value pair-motive Z-score is what tells us how well a given motif explains the differential inclusion of cassette exons. So when we plotted the distribution of the Z-scores, turns out the distribution is in this histogram, and we developed another model. So now we have a mixture model. This is essentially a mixture of a Gaussian and the uniform. And the idea is just to detect outliers, but rigorously sticking it's a mixture model. We optimize it again. We find maximum likelihood estimates for the parameters of these components. And we can say that we will say that motives that are statistically significant are such that with high posterior probability we can say that they come from the uniform components, not from the Gaussian. That's how we assess statistical significance. And I'll just briefly mention that my colleagues in the group were working on differential cleavage and polyadenylation, which is one more layer of regulation of gene expression. And please notice, oops, sorry, please notice here, yeah. Please notice here that if you have these distinct poly-A sites, one can build similar models. So we define a region in the proximity of the poly-A site. We quantify binding of distinct motives in that region. And now for every poly-A site, we can have a usage of a poly-A site in a given RNA example. This usage comes from just aligned reads from the read coverage. And here we have these two quantities. And again, there is a model, they developed a model that calculates activities, Z-scores, and in the end tells us by the Z-scores which motives explain the usage of poly-A sites. And I'll just, this is the last slide. I'll just show you example results. So I took RNA-seq data of HN-RNPC Nogdan, which is a known regulator of alternative splicing. And it turns out that, well, we said why plot here, the motives, which is most statistically significant with the highest activity. It turns out to be this 5U motive and it ties nicely together to the literature because this is the known binding site for the HN-RNPC. So we kind of like, we reconstructed but we captured the correct motive. So it's like a positive control. And from these heat maps, they are heat maps of Z-scores or per sample Z-scores. And from these heat maps I can learn or we can learn that there is, HN-RNPC is binding in the proximity of the supply sites. Then this, we can say that there's a statistically significant regulation of the inclusion of all these eggs. Similarly, that if it binds a little downstream to the poly-A sites, it has a role on the usage of this poly-A sites. Also notice that since in the, in Nogdan, actually we have a positive Z-scores, it would mean that in the absence of HN-RNPC, the exons are included and the poly-A sites are used. So we also have this information about the directionality of the, of the, of the, of the, this is the mode of action. And that's it. Thank you very much. I'd like to acknowledge my group, which is a great environment to work with, very stimulating. Most importantly, Professor Michaela Zavolan on the right here in the picture. Who is my PI. Also I'd like to thank a lot, Professor Eric Van Niemwegen, who helped me a lot with the statistical modeling and Andreas Gruber as a postdoc. And I started this project with him. And we are finishing it together, but he like, he led me into it. Thanks a lot. Sorry for the problems at the beginning. No problem. You are right on time now. So thank you for the talk. And maybe just one question before we move to the other room. I was wondering how are, how is this applicable to different types of organisms? Is this somehow changing the model or the weights that you can? No, no, no. Organisms actually doesn't matter. So run. I usually, when I run this, I test human and mouse RNA-seq data. But no, it's essentially you, you have to provide RNA-seq data of a given organism and the annotation. So genome sequence, the, the genomic regions in, in GTF, right? So if you have the resources for a genome and also the, if you would have a, it would be nice if you would have the annotation of poly-A site atlas for a given organism. Okay. So given you have some resources about the genome, you can plug it in and you can run the analysis on RNA-seq data you want. Okay. Okay. Thank you. All right, thanks. With that, we would like to conclude the session and encourage people to join us in the Meet the Speakers room. The link you can find on the webpage and we will directly transit to this other webinar. So thanks a lot to our speakers and see you in a minute. Thank you very much.