 All the way on the left you see. So the topic of today is our ISMARA web server. And ISMARA is a tool for automatically inferring gene regulatory circuitry from high throughput data that you can upload to the website. So the website, this is its address, ismara.unibus.ch. And if you could go there, then this is the page you will see. And basically what you can do is you can upload files here. Or you can upload IDs of data sets in the SRA database. And all you really then have to do is hit Go. You might need to say what the species is from which your data derives. You can give an email address and a project name. This is for your own reference. And the email address is so that you get automated notification when your data has been analyzed. I note also to the left here, there is a login. This is we have some users from industry that license ISMARA to be run in a way that protects their data and keeps their data secure. And so we also have this licensing scheme for commercial users. But for academic users, you don't need to log in. You can just upload your data. All right, so let me try to start motivating why we developed this tool and what is basically the main idea behind it. So one of the key things that my lab is very interested in is understanding how the regulatory code in the DNA is read out by cells to control cell fate and identity. So the kind of questions that we're interested in in my group is how do gene regulatory networks function as systems? What is a cell type? Is there even something like a cell type? Can you rigorously find what this mean? How is cell identity stabilized? What is the key information that specifies cell identity and what are all the details that don't matter? And we know that in the end, this is all essentially determined by constellations of regulatory sites in the genome because it's easy to forget if you look at these pictures of different human cell types that these are all created from the same underlying group with the same genome. So we would like to sort of understand at the system level how this works. And when I came into this field about sort of 20 years ago, maybe a tiny bit more, I was very excited because we had first become able to sequence whole genomes. And we had developed technology in those days. This was my career technology to measure gene expression of all genes in parallel in cells. And coming from a physics background, I thought, well, now that we can get all this data, we have the whole genome, we can measure the expression of all genes. It's just a matter of developing some good models. And then we're going to understand how all of this works. And so one of the things I wanted to tell you about is that sort of in the last five to seven years or so, I have developed a syndrome which is known in psychology research as the Dunning-Kruger effect. And this slide is illustrating what this Dunning-Kruger effect is. So it shows the confidence that a person has in their ability to solve a particular problem as a function of how much they have studied this problem. And people that have not studied the problem at all, they also start out having low confidence in their ability to solve a particular problem. But as they start studying a particular topic, people find that individuals tend to get much more confident very quickly about their ability in a particular field. And so their confidence starts quickly rising. However, there comes a point where they've studied the topic enough so that they start getting a sense of how many things there still are that they know nothing about and that are also very important to know about if they are going to solve this problem. And then their confidence may suddenly start dropping precipitously. And so I put a picture here to show you that I'm essentially still going down on this down slide of confidence. I have come to believe that the problem of understanding what is really going on on a genome-wide level in eukaryotic cells and how they regulate their cell identity is currently way beyond our abilities to solve. And why do I believe that? Well, we think we know how to measure a lot nowadays about the state of a cell. But there are, in fact, many, there are orders of magnitude more things going on inside a cell that we do know not how to measure and we do not know about. And so we're nowhere near the ability to write down a model for everything what's going on in the cell. And moreover, the high throughput measurements that we take are full of artifacts and biases that often are poorly understood. And third, data analysis in this field often involves dizzying arrays of normalizations and filters and transformations. So at the end, it's very, very hard to understand how the results of the analysis even relate to the raw data. That went in. So this is all worrying me a lot. And the approach that we developed came out of this worry. And basically means that we want to take a humble position and say, there is no way we're going to really understand everything that's going on in the regulation of cell identity and cell fate. All we want to do is develop methods that are robust and transparent that are relatively simple and that can use high throughput data to help guide experimental efforts. That is to basically provide experimentalists with results that are useful for them to learn where to focus their next experimental efforts. So this is really the main goal of the methods is to extract from the high throughput data hypotheses that can then be followed up on by further experimental work. All right. Was there a question? No? All right. So the kind of problem that we are focused on is a problem that we have collaborated with lots of experimental groups on in the past. And these groups are interested in knowing what high throughput transcriptome or epigenome data says about the regulation that is going on in their system of interest. So here I have a couple of pictures of some systems that we've worked with some experimental groups on. And the typical questions that people are interested in are questions are, what are the key regulators in my system? What are the roles of these regulators? Which are the pathways that are targeted by each of these regulators and so on? And the challenge for an experimental group is that it's impossible to do saturating genetics through it's impossible. Let's say human has about 2,000 transcripts of factors. It also has many hundreds of families of microRNAs. And so to investigate them all experimentally would be way too much experimental to check out each one of them. However, it's relatively easy to do high throughput measurements of gene expression, let's say in a time course or under certain perturbations or in different conditions. One can do RNA sec or chip sec or some people also still do microarrays. And the problem with many experimentalists phase is that they do not have the expertise to analyze such data. And so they have to collaborate. Somebody unmute their mic. So please remember if you don't have a question, keep your mic muted. All right, so we basically wanted to develop a tool that help experimentalists analyze their own data. So to just give you a contrast of our approach with typical approaches is that some of the most common approaches to the analysis of transcriptome data is you first do basic processing. So you map your reads to transcripts. You find all genes that are dispersed. And then, for example, you use a tool like the ESEC to find which genes are differentially expressed across the conditions. Another thing that is often done is to cluster genes into groups with similar expression profiles. So you use some sort of clustering algorithm. In this case, here is a hierarchical clustering done that groups genes together in groups that are expressed in a similar way. And then downstream of this, you can do things like look at these groups of genes and ask what kind of pathways or categories are overrepresented among these co-regulated genes. So here's an example for a data set of liver development that we will look at later on, where there are genes found that are expressed early and late in development. And then when you look at the overexpressed and rich categories of genes, among those sets, you find these various categories. Now, the limitations of such approaches is that these approaches tell you which genes are changing their expression. And you may even find what groups of genes are changing their expression together. But they don't tell you anything about what's going on with respect to the regulation. It doesn't tell you what the regulators are or how they are regulating these genes. And so it's often difficult to follow up on this experimentally. So you may find that there's a group of 100 genes that are a co-regulated module. But how do you go and now validate this experimentally? This is not so true. All right, so what our MARA approach does is that after you've uploaded data, it will identify for you what are the key regulators, transcripts of factors and micronext in your system, what are the activities of these regulators across the samples that you uploaded, what are the sets of target genes and pathways of each of these regulators, what are the regulatory sites on the genome through which these regulators act, and also what are hypothesized direct interactions between the different important regulators in your system. All right, so this slide here summarizes conceptually how this approach works. So in the top left is part of the analysis that is all pre-calculated before you submit your data. So before you submit your data, we've made a annotation of all the transcripts and promoters in the genome of your organism of interest. And we also have a collection of motifs for transcription factor and micronex that mathematically represent what kind of sites each of these transcripts factors and micronex bind. We've used this to make an annotation of binding sites in the promoters and micronex sites in the 3-prime UTR of all genes. And so this gives us pre-calculated this matrix, which has for each promoter and each motif binding sites per day. The way we've done this is so in the first step, we take collections of experimentally measured transcription start sites. So in the past, we've collaborated with the Rican Institute in Japan that have done such measurements at great depths in human and mouse. And so we use those cage data to define transcription start sites. And we associate them with transcripts by using collections of full length mRNAs, for example, from GenCoat or Ensembl. And so we cluster the known starts of mRNAs. So here are a couple of examples of isoforms of a gene. And so there is some of these isoforms start here. Some of these isoforms have starts over here. And some isoforms have start over here. And then we cluster these starts with promoters that we have annotated. So these here show promoters on the genome that they have derived from this cage data. And so we basically cluster these groups of cage TSSs with starts of mRNAs that are associated with it. All right. For predicting these transcription factor binding sites, we have curated a large collection of so-called position-specific weight matrices that represent the binding specificity of transcription factors. So here are four examples of transcription factors with sequence logos of their weight matrices. And as some of you may know, the weight matrix representation is basically a matrix of probabilities. So components Wi alpha give the probability of finding a base alpha at position i of the site. So here as an example, I've shown an alignment of binding sites of the transcription factor through R from E. coli. And so now you can look down in each column what are the letters that occur at that position of the site. And at this position of the site, for example, C is the most common letter to occur, all right? So if you look at the relative frequencies of the different letters at this column, you will see that A occurs in 6%, C 53%, G 27%, and T 13%. And so this weight matrix says if you have a binding site, the probability that at position 4 there will be an A is 6%, a C 53%, a G 27%, and a T 30%. And so we find these probabilities for all sites, for all positions in the transcription factor. And so given that one can now calculate the probability that the binding site for this weight matrix will be sequence S, which is just a product of these probabilities over all positions. All right, so we use this weight matrix representation together with a tool that we have developed previously in the lab. So my lab has worked for a long time on computational methods for predicting these transcription factor binding sites. And this tool is called Motivo. And the key thing that it does is that for each promoter region in the genome, so here is a promoter with the transcription start, we take 500 base pairs upstream to 500 base pairs downstream. So we take a one kilobase region. Then we find the orthologous kilobase regions in the related species. So when we're making an annotation from human, we collect orthologous regions from mouse, dog, cow, horse, racist, macaque, and opossum, we build an alignment of these orthologous promoter regions. We also know the phylogenetic tree that relates human and these other species. And basically this gives us a model of what kind of conservation patterns we expect for positions that are evolving neutrally and also what kind of patterns we expect to see when a segment of the alignment is a binding site for a particular transcription factor. All right, so this is a fairly sophisticated Bayesian model. I will not go into the details. I'm just telling you here that we use this previously developed tool to scan all the promoter regions. So typically there are 20, 30,000 sort of promoters in a mammalian genome. So we scan all of them with the weight matrices of a few hundred transcription factors to predict binding sites in each of these promoters. So actually these binding site annotations we also provide publicly in a database which is called Streets to Regulone. You can find here, so if you go to this database you can download these annotations but you can also browse them in a genome browser. So here's an example of one promoter with the start of a transcript and these little red squares that you see or rectangles they are predicted binding sites for transcription factors in this promoter. And so for each binding site the name of the transcription factor is indicated and the intensity of the color indicates the probability that we assign that is a functional binding site. And this probability is based both on how well it matches the motif and on the conservation statistics that we see across other related species. And so for MARA we now summarize all these transcription factor binding site prediction in all the promoters by a big matrix which we call the site count matrix where the components NPM corresponds to the total number of binding sites for motif M in promoter P. So it's a count of the number of functional sites for motif M in promoter P. And so you should imagine that if there are 30,000 promoters this matrix has 30,000 rows and it has we have about 600 transcription factors or motifs 600 columns. All right, so apart from transcription factors we also include regulation by microRNAs. I will not go into a lot of details now but microRNAs they're binding is mainly determined by matching to what's called a seed sequence at the five prime end of the microRNA and eight base bases of five prime end microRNA. And we also use a previously developed microRNA target prediction method here to predict binding sites for microRNAs in the three prime UTRs of transcripts. And so these are also summarized in additional columns of the site count matrix. So now each column mu correspond to one of 86 seed families that we have in and it predicts the number of sites for motif mu in transcripts associated with promoter P. All right, this is just to note that we've recently did a big overhaul of our regulatory site annotations and collection of motifs using data from the Fountain 5 collaboration in which we took part. And so what we did is we started with a very large collection of regulatory motifs from various collections. So our own collections with regular on Jasper, Bocomoc, Homer, Unibrow, ENCOPE and HTCELIX are all different databases of motifs. We combine them all together but when you combine them all together you get a very, very redundant set of regulatory motifs. You tend to get multiple motifs from different databases for the same transcription factor. And then basically what we did is we ran our ISMARA tool on this very large collection of samples from human and mouse that was gathered in the Fountain 5 project to basically pick one best motif for each transcription factor. And after we've picked one best motif for each transcription factor we then further reduced redundancy by collapsing together highly similar motifs that were statistically indistinguishable in this very large data set. So for example, as many of you know there are these families of transcription factors that bind to very, very similar binding sites and often it's impossible to tell which of the members of the family is active in a given sample. So when that was the case, we combined these groups, these families of motifs together into a single effective motif. Was there a question? I have a question. Yeah. Hello, Professor, nice presentation and we are listening quite nicely. My question is regarding to the species. Here, this whole tool of ISMARA and phantom they're basically based on the human and mouse. I was wondering, would it be possible to also go for the other species like animals or because they have a very less motif and transcription factor information? Yeah. So I will come to that at some point. So when the tool was originally developed it was developed for human and mouse. We have a version for rat. We have also versions that we can run for E. coli or yeasts or other species for which many things are known. We've recently made a version for zebrafish and we're in the development of a version for toosophila. But basically to make this effective you do need to have a collection of regulatory motifs representing transcription factors in your organism. So if really nothing is known about that and if your organism is too distal from species for which such motifs are known then it will be difficult to do. But if your organism is close enough that for example you can map orthologous transcription factors between a species for which many motifs are known and your species that in principle can prepare an adaptation and run is more on such a species too but it requires some work. All right. So this is just summarizing now after we've done this new curation of motifs for human and mouse that we have a motifs for almost 700 transcription factors in both human and mouse and about a hundred micro RNA families. And after redundancy removal we have about 500 motif groups. All right. So in the results of MAHRA on human and mouse you will get predictions for about 500 regulatory motifs. All right. So this was the pre calculated part. Now the second part that goes into his MAHRA is the actual data that is being uploaded. All right. So people either provide RNA seek data or micro RNA data. Actually I realize now this slide is a little bit out of date. We nowadays take either raw RNA seek data raw chipset data in the form of FASQ files or even links to the SRA database as I said. All right. So this data is taken. And then what happens is so let's take the example of RNA seek data. We map the RNA seek reads to the known transcripts in the organism. And from this we derive another big matrix which we call the expression matrix which gives for each promoter P and each sample S the expression of this promoter in sample S. So we get an expression profile for each promoter across the samples. So how do we do this more specifically? So we actually make use of the Calisto tool from the group of Leo Pachter. So we map each RNA sequence read to the transcript using Calisto and then we make a slight change from the way Calisto estimates gene expression in that we when a read maps to multiple transcripts we uniformly distributed over all transcripts that it's consistently. So let me try to explain you how this expression calculation is done. So here I've shown you one, two, three, four, five different RNA seek reads. So, and in the light blue is the transcriptome annotation. These are the various isoforms that are known in this part of the genome. All right, so you see that the red transcriptor factor here will map to one, two, three, four, five, six, seven, eight different transcripts. And what we will do is we will assign this read to each of these eight and we will sign a weight of one eighth of a read to each of these transcripts. Now the blue read here only overlaps two reads. So we give both of these transcripts now a weight one half. Here the green one hits just one transcript. So we give a weight of one to this transcript and so on. And so now afterwards, I can for each transcript calculate a total weight which is just the sum of the weights of all the reads that have been assigned with. So for example, this transcript here at the top gets a weight one third from the orange read, a weight one sixth from the purple one and a weight one eighth fingerprint. All right, so we then divide this weight for each transcript by the length of the transcript. This gives us a new weight capital W of the transcript. And then for a promoter, okay. So if I take now this promoter here, this promoter is associated with multiple transcripts. So we sum all these weights of the transcripts that are transcribed from the same promoter. We then finally add a pseudo count of one half and then we rescale all these weights to represent transcripts per million and log transforming, all right? So for each promoter, the expression that we now calculate is basically the logarithm of the transcripts per million that we've estimated for this promoter in this sample. All right, so we have this pre calculated side counts and we have this, from the data, we calculate this expression distribution of each promoter across all samples. And now we're going to apply this simple model to calculate how active each regulator is in each sample. All right, so the linear model that Mara uses basically writes the expression of each promoter P in each sample S as a sample dependent constant plus a promoter dependent constant. So this you can think of as sort of an average expression level of this promoter. Plus a sum over all motifs, the number of binding sites that this promoter has for motif M times some unknown activity of each motif M in sample S, all right? And so what we want to infer is these motif activities. All right, so this goes in a couple of steps. So first each sample of the matrix is normalized by subtracting its mean expression. So this is a mean expression in this sample, it's just the average of all promoters. Then we get, we subtract this from each column, we subtract this average and we do the same for the side counts. So for each motif M, we look at the average number of binding sites across promoters and we extract that from the side counts. So this until the side counts are now really the side count in promoter P for motif M minus the average number of sites for motif M, all right? So promoters without sites will now have a negative number of promoters with side of positive number, all right? So now we model the expression for each promoter in each sample as this linear model of motif activities and we separate this model into two parts. So we first, we look at the average expression of each promoter. So this is just averaging the expression across all samples. And we model that in terms of average activities of each of the motifs. So we fit the following model. We say the average expression of each promoter is given by this linear combination of the average activities of each motif times the side count in each promoter for each motif. And there is some noise added to it. And second, we subtract these averages from the expression levels to basically get the deviations from the average expression of each promoter in each sample. And we do the same for the motif activities. So these motifs, these A till the motif activities are the difference between the motif activity of each motif in sample S and its average activity of this motif. And so we also fit this model and this model basically fits the changes in expression in terms of the changes in motif activity across the samples, right? So these normalized expression values, they're averaged to zero across all samples and similar these motif activities also average to zero. All right, so now for the more technical listeners among you, I will first tell you how we technically solve this and then I will give a more conceptual summary for the less technical of you. So technically we want to, so these are the expression changes of each promoter across each of the samples. So we write this as a linear function of these motif counts and motif activities across the samples. And we will assume that the noise that is the deviation between the observed expression levels and the predicted one from the model is Gaussian distributed. So the probability of all the expression data given the motif activities is given by such a Gaussian form. In addition to avoid overfitting, we will put a Gaussian prior on these motif activities that is and this prior is controlled by one parameter lambda, which basically says how strongly you suppress fluctuations in motif activities. Now, because both of these are Gaussian, you can now easily solve this model and solve the posterior probability of the motif activities in each sample in terms of the observed expression values and this parameter lambda of the prior. And it's given by a form like this, where, so it's a multivariate Gaussian where the precision matrix of the multivariate Gaussian is in fact given by the covariance matrix of the side counts plus a term coming from the prior. And the variance here in the denominator is just the chi squared. It's the difference between the observed expression values and the predicted expression values at the optimal motif activities. So it's a measure of how far the data deviates from the optimal fit model. And the parameter lambda of the prior is optimized through cross-validation. So what we do is we randomly select 80% of the promoters to fit the motif activities and then use the remaining 20% of the promoters as a training set and we then tweak lambda until the quality of the prediction on the training set is maximized. All right, now for the less technical of you, I will tell you what is conceptually the model. So conceptually the model is that we are going to model the expression of each promoter in each sample as a linear function of, so it's a sum across all motifs how many binding sites does this promoter have for each motif M times how active is each motif M in sample S? All right, and so we basically fit these motif activities so as to optimally predict the observed expression values. So these are the ones that come from the data that were submitted. These binding sites counts we pre-calculated and so these motif activity ones are the one. Motif activities are now the ones we're going to infer and we're referring for each sample both a fitted activity what is our best guess? It was the activity of each motif and what is an error bar on this guess? All right, so we also give for each motif an example, an uncertainty of this motif activity and we can use that to assign an overall significance by basically dividing the motif activity by its error bar squaring that and averaging that. And so basically the Z value is the Z value now summarizes how important the motif is basically is a measure of on average how many standard deviations away is this motif from having no activity. And if you want to know what is the real what is the actual meaning of a motif activity AMS? The actual meaning is if I were to take a promoter and I would to add a single binding site for motif M to this promoter, the motif activity AMS is the amount by which the log expression of the promoter would go up if I add this one binding site for the motif. All right, so it basically a measure of how much is the predicted effect of adding one binding site for a motif to promote. Okay, so I'm now ready to show you what this analysis gives you on an example dataset. So as we showed you in the, I think Patricia sent you links. Sorry, Eric. Yeah. In your model, the activity of each transcription factor is independent from the others in your linear model or you also model the dependency. Among different binding sites motifs. No. No, you assume that they act independently. Right? Right. So I mean, okay. So all these things, there's a lot to be said. So we've experimented various times with adding interaction terms. Cause both of us. And please mute the mic if you're not happy with it. I know. Okay. And so far we found that, all right, when you don't have an a priori knowledge of which motifs may be interacting, then we have basically 500 squared possible interactions. And it's very hard to avoid overfitting once you start going to such a large model space. Yeah. So far we have not found an effective way of taking interactions into account without starting to overfit a model. Yes. So right now it's just one motif. We assume this linear model where each motif is independently either add or remove it from the expression. Okay. Thank you. All right. So Patricia sent you links to some data sets. You can also find these links on all the website if I go all the way back to the beginning. Have at it again the link to the Google Doc in the chat. Okay. So the data set that I'm going to look at is the data set of mouse liver development. And so this was an RNA. So this is just a public data set that we took from the web. Here is the publication. It's RNA seek done at 12 time points in triplicate at each time point starting two days before birth until 60 days afterwards. And as I showed this is some of the results from this paper. And so they've done this clustering of gene expression profiles. There are these sort of three phases, a prebirth phase, a suckling phase and a weaning phase. And then they show you the expression patterns of groups of co-regulated genes across the time course and they found that certain groups were enriched for having genes from particular pathways like glycolysis, ketogenesis, glycogenesis in the, so these are these expression profiles of these different clusters of genes. All right, so this is from the original paper and now we will look what the predictions of Mara are if you just upload this RNA seek data to the server. All right, so if you go under example results it's all the way at the bottom. So here's the dynamics of mouse liver and then the dynamics of mouse liver where we've averaged now the results over replicates. I will come to that in a second. All right, so if you just upload this and run it right so there's no parameters to set you just upload the data and say go. The result space that you will get will basically be a list of motifs sorted by the Z value that tells you how significant each of these motifs is. So the most significant motif is E2F. It has a Z value of 5.25. And then the next column will tell you what are the transcription factors that are associated with this motif. So in this case that is the transcription E2F1. It will show you a thumbnail of the activity of this motif across the sample. So in this case across the time course and it will show you a sequence logo. All right, now in this particular case as I said, each time point was measured in triplicate and as Mikhail will show you later today you can tell Mara that you have replicate measurements and then Mara tool can automatically incorporate this information and average motif activities over replicates. So we've done this for this data set. And so now if you go to the replicate average data what you see is of course now there are less points in the thumbnail because now the replicates are now all mapped to one time point. But it reorders the significance of the motifs. So once we've done the replicate averaging actually the most important motif is now H and F4A with the Z value of 7.61. So you'll also see that the Z values have actually gone up relative to no averaging. And this is of course because you get more information if you have replicates. All right, so now what you can do is each of these names here of a motif is actually a link to more information about what this motif is doing, right? Because this is just summarizing how important are the motifs and what is a thumbnail of how they change their activity across the time course. All right, so the top motif H and F4 alpha is actually well known to be a key transcription factor in hypothesized development. So one of the reasons I took this example data set there is quite some information in the literature about the liver development. And so we can basically confirm that the top motif that Mara predicts is in fact known to be a very important developmental transcription factor in liver development. And so here I'm showing you some pictures from a paper on H and F4 alpha is an essential for specification of hepatic progenitors from human pluripotent stem cells. And they show in this paper, for example, or discuss that H and F4 alpha controls the initiation and maintenance of several key downstream transcription factors. All right, now if you go to the page for H and F4A, so and I encourage each of you to actually in a browser for try to follow along and look at the results for this mouse liver data set and see what are the results that I'm actually showing you because the things I'm not showing on the slides you actually can see them within the interface. So the first thing that you will see is you will get a list of the transcription factors that are associated with this motif. And in this case, there is only one and it's H and F4 alpha and you get some links to further information about this transcription factor. So this is just links in other websites. All right, so now second, you will see an analysis of the correlation between the expression of the H and F4 alpha gene and the motif activity. Let me just check. So in this particular case, what we report is that the correlation between the expression of the NF4 alpha gene, which is transcribed from this promoter here is 0.53 and it has a p value of about 10 to the minus three. Now, if you click here on this link here which shows a plot, this plot will appear. And so now what this shows is the activity of H and F4 alpha in each of the samples against the actual mRNA expression of H and F4 alpha in each of the sample. So each dot is a sample. Here's the motif activity that we infer here the expression of H and F4 alpha. Now this is an important point because the motif activities are entirely inferred from the expression level, not of the transcription factor itself, but only of its targets. So we infer how active H and F4 alpha is from the behavior of its targets. And so by now comparing these motif activities with the actual expression of H and F4 alpha, you can basically check if it indeed looks like the expression level of H and F4 alpha is indeed upregulated when it becomes more active. And so you see that here there is indeed a moderate correlation that H and F4 alpha tends to be higher expressed in those samples where its activity is higher with a correlation of a Pearson correlation 0.53. Note also here that the units correspond to log 2 TPM. So we see that the absolute expression level of H and F4 alpha runs from two to the five, right? So from five to seven here is from two to the five is 32 to 128 transcripts per million. Okay, so this plot also tells you whether this thing is expressed at all and how high it is expressed. All right, so now let's for comparison Oh, so the one thing that I should have told you also is that it also gives you a more detailed look at the motif activity profile. So you see that this H and F4 alpha, its motif activity is sort of continuously increasing across development. And it starts out at minus 0.4 and it goes up to about 0.3, all right? And these error bars here show you the error bar on the motif activity in each of the sample. So to now again think what are these numbers mean minus 0.4 to 0.3, well, it basically means that the effect of a single H and F4 alpha side goes from 40% reduction of expression relative to mean expression of the gene to 30% increase of expression relative to average expression, okay? So that's what they mean with this motif activity. All right, so now the next motif I'm gonna show you is E2F2, E2F5. This is one motif that is actually associated with two transcription factors. So these two transcription factors, their binding preferences are so similar that we can't tell apart whether E2F2 or E2F5 is the transcription factor binding to this motif. They can both bind with this motif, okay? So you have these two, these are the two promoters of these two transcription factors. And here you now see the Pearson correlation coefficient of the expression profiles of E2F2 and E2F5 with the motif activities of E2F2 and E2F5. And so what you see in this particular case is that E2F2 shows an almost perfect correlation with the motif activity. Whereas E2F5 shows some correlation by the much weaker one. In addition, the absolute expression level of E2F2 goes up to almost eight. It is like 250 transcripts per million. Whereas E2F5 is much lower expressed, the highest is sort of one and a half, which is sort of three transcripts per million. So the fact that E2F2 is both much higher expressed and correlates much better with the motif activity suggests that in this data set, in this system, the key transcription factor driving this motif activity is E2F2, okay? So you can use this correlation and absolute expression levels to form motifs that can be bound by multiple transcription factors to basically make a hypothesis of which transcription factor is actually responsible. All right, now, in some cases, you can actually find that there is a negative correlation between the activity of a motif and its mRNA expression value. So here's an example from further down the list. This is a motif CBPE. This motif's activity is also going up across the development, which means that the targets of this motif are upregulated across the time course. However, we find that the activity, as the activity goes up, the actual expression level of CBPE goes down. So what this indicates is that in this system, CBPE is acting as a repressor because the higher the expression level of the transcription factor, the lower the motif activity. So you can use these correlations also to find out which transcription factors are acting as activators and which transcription factors are acting as repressors. All right, okay, so far, I've just told you about motif activities and what they mean and how to tell which transcription factor is responsible for given motif activity and whether it's acting as a repressor or an activator. So the next thing we want to know about this, of course, what are the targets of each transcription factor? So Mara calculates lists of target genes in the following way. So for each motif, and I indicate here the green motif, we first find all the promoters that have predicted binding sites for the motif. Okay, because if you have no binding sites, then you cannot be targeting that promoter. But so here there is a promoter where there were binding sites for the green transcription factor. And now we basically, in silico, remove the binding sites for this green motif from this promoter, okay? So we change the site count matrix by basically going into one row, then with the row of this particular promoter and one column, then with the column of this particular green motif and we set the entry to zero. Okay, we remove the sites for this one motif in one promoter. And then basically we redo the entire fit of Mara and we calculate how much burst is the fit to the dataset with this mutated site count matrix where one motif has been removed from one promoter relative to what it was with the original site count matrix. And we calculate the log likelihood of the data using the original site count matrix in this mutated site count matrix. And this then gives a score, a log likelihood score for how likely it is that this motif is regulating this particular promoter. So conceptually, the way you can think about it is that, okay, so there is some observed expression profile of this promoter and with our original dataset when we have all motifs present, the model predicted this expression profile across the samples. But now when we remove the binding sites for the green motif from the promoter, the predicted expression turns into this red curve. And we basically quantify how much worse is the fit of the red curve than the green curve. And this log likelihood ratio SPM, so this is the target score, is basically quantifying how much worse is the fit without this green binding site in this promoter than with the binding site. For the technical among you, it turns out that this log likelihood ratio can be quite easily calculated as follows. You basically just for each promoter and each sample calculate the chi-square deviation between the fit and the observed expression. And then you calculate the chi-square deviation with the new site count matrix from which this one motif has been removed in one promoter. And then this log likelihood ratio is basically the difference of these chi-squares divided by the average chi-square average across all samples and all promoters. Okay, if you wanna see the derivation of why this is the log likelihood ratio, then I refer you to Ismar paper where this is derived in detail. Okay, so the target score measures how much the square deviation between fit and model increases when sites for the motif M in promoter P are removed relative to the average square deviation between predictions and measurements across all promoters and samples. All right, and so one thing to notice is that this quantity is extensive in the number of samples. The more samples you have, the higher these targets scores tend to be. Okay, so it's like the more samples you have, the more accurately can we identify the targets of each model. Notice that the target score can also be negative for some promoters and some of the, it may actually, the fit may actually be better when you move the site than when you keep. All right, so the next thing that you will see on the results for this HNF for alpha, if you go to the page of the HNF for alpha results is simply a list of target promoters sorted by this log likelihood score. All right, so at the top of this list, here is a gene called CYP2C29. This is a cytochrome P450 family gene and it has a log likelihood ratio of 95.7, okay. So notice that we also provide here a link to where is the promoter of this target gene on the genome. And if you click on this link, you will be taken to the Swiss Regulon Genome Browser view of this promoter. So this is a zoom in of the promoter of this gene. Here is the promoter, so this is in mouse. And with all the prediction binding sites in this promoter and so you see here is the binding site for NHF for alpha on the genome. All right, so it tells you exactly where in the genome the site is that we predict is responsible of this targeting of HNF for alpha of this gene. So in principle, if you want it, you could go and now use CRISPR or something to remove this binding site and validate that HNF for alpha is indeed targeting this promoter in this system. All right, so we have such a list of targets and then now of course you can scan to this list but often there can be lots of targets given transcription factor. So we'd like to sort of also make summaries of what are the kind of pathways and the categories of genes is each of these transcription factors targeted. So what we also have further down on the page we have gene anthology categories that are overrepresented among the targets of the transcription factors. Okay, so here are examples of the most overrepresented gene anthology categories in the biological process category for HNF for alpha and there are two statistics that are given for each category. So here's a go category. This is epoxygenase P450 pathway. It's mostly given the total log likelihood that is the sum of all the log likelihood target scores for genes in this pathway and the average log likelihood per target. Okay, so you see that there are something like 30 genes in this pathway that each on average have a target score of about 12, okay? And so you can sort these lists in various ways and so you can get some idea of what are the pathways that this particular transcription factor is targeting in this system. Now in some cases the genes that are targeted by a transcription factor may not have been annotated so well in terms of their gene anthology categories and another way that you can learn about what kind of pathways a particular transcription factor is targeting is by using information from the string database. So here we use information from the string database that is from Kristian van Meren's group in Zurich. So basically what happens is for each data set we have predicted targets for each motif and then we send the list of these target genes for each motif to the string database and string returns as a picture of the known connections between these genes that are targeted by this particular transcription factor. So what we see here is a picture that is coming from string that is showing so each little ball here is one of the genes that H and F for alpha is predicted to target in this system and the links that you see between them are coming from the string database. These are known connections between these genes and you can go on this picture so you can have a link to the string database you can go on this picture you can see what is each of these genes and what is each of these links between the genes. So if you were to do this for this particular set of targets of H and F for alpha you'd see that there are a couple of clusters of genes with lots of connections between them. This cluster here are all cytochrome P450 genes. This cluster here is genes involved in the complement cascade. These are genes involved in the rare cycle and these are genes involved in human transport and coagulation. So with this kind of analysis it allows you not only to see what is H and F for alpha targeting but to sort of take it apart into various functions that are performed by the genes that H and F for alpha is target. All right, and then finally we also predict direct interactions between each transcription factor. And other transcription factors. So basically what we've gone is we basically take H and F for alpha and looked at all of its targets and asked which of these targets are themselves transcription factors with multiple. So in this, and so there is a little interactive picture that you get where H and F for alpha is in the middle and around it are other transcription factors that are directly targeted by H and F for alpha. And if you mouse over on one of these links you will actually see the name of the transcription factor in question and what the target score is. So for example, this link here says that H and F for alpha targets directly the promoter of NR1I3 with a target score of about 17.2, okay? If you look up what is this gene NR1I3 it's also known as CAR, Constitutive Andro-Stame Receptor. And this is well-known important transcription factor in liver as I'm showing here from this citation that I pulled out that says that CAR promotes differentiation and maturation of apathic like cells. So we predict that this H and F for alpha is directly targeting CAR and up regulating it in the system. And here are other examples. For example, second one NR5A2 target score of 11.2 this is called the liver receptor homologue LRH1, okay? So basically, Mara also predicts for you the sort of key regulatory circuitry around each transcription factor. What are the other transcription factors that are regulated by this transcription factor or that are regulated this current transcription factor? So for example, if you go mouse over this link here, you see that there are actually two red link going from H and F for alpha to H and F1 alpha and one link blue going from H and F1 alpha to H and F4 alpha and the target scores associated with these are 3.6 and 4.6. And so basically the prediction is that H and F4 alpha and H and F1 alpha regulate each other and in fact, if you go into the literature this is known that H and F1 alpha and H and F4 alpha target each other, okay? So this is another prediction that was coming directly out of this Mara analysis. All right, so just to take one other example of a motif. So we see these two F motifs that are both downregulated. So let's see what might be the role of these motifs. If you look at the motif activity across time you see that they're extremely well correlated with each other, these motif activity profiles of these two factors look almost the same. And this suggests that we might be looking at a single pathway and if you now try to find out what this pathway may be, G, if you go to the string picture of the targets for this transcripts of factor E to F2 in this system, you see this incredibly connected ball. So the extremely high density of links in the targets of E to F2 tells you that this pathway that E to F2 targets is extremely well studied, okay? That's typically what it means. If you see extremely connected string picture it means this is a very well studied biological process. And if you just inspect what these genes are you will see that these are all genes involved in the cell cycle. And in particular genes involved in replication and initiation of replications. If you, this is confirmed when you look at the Go categories that are most overrepresented among the predicted targets of these E to F2 and E to F1 and you see that the most overrepresented categories are DNA replication, initiation, DNA unwinding and being involved in DNA replication. And so what you find is that these E to F1 and 2 are responsible for regulating the G1S transition of the cell cycle, okay? So the fact that this is going down across the time course basically means that the amount of cell division replication is decreasing with time, across the time, across the liver maturation. Okay, so this concludes my summary of the use of MAHRA on transcriptome data. And I want to now say a little bit about that you can also use MAHRA with ChIPSEG data to learn about transcription factors that may be responsible for driving chromatin modifications, okay? So we first developed this in the context of a collaboration with the lab with Dirk Schubler in Basel. And we were particularly interested in testing this idea that sometimes the binding of transcription factors to particular loci in the genome can recruit chromatin modifying enzymes to these loci and cause chromatin modifications at these loci. Okay, so this is the kind of conceptual idea that MAHRA can also implement. And the way we do this is that if we get CHIP-CHIP or CHIPSEG measurements of nucleosome modifications, we map these two promoters, genome-wide, and we basically quantify in each promoter, in each sample the strength of this particular nucleosome modification in each of the promoters in each of the samples. Okay, so it's basically the number of reads from sample S mapping to a 2KB region centered at the promoter. So this is now from CHIPSEG data. So we add a pseudo-count and we divide by the length of the region and then we take a log. And so it's basically the log of the density of reads for this particular chromatin modification in each promoter across the samples. The particular, so as an example of an application of this approach, we were looking at chromatin modification known as H3K27 trimethylation. So this is the trimethylation of lysine number 27 on histone three. And this is well-known modification that is put by the polycomb pathway. So the polycomb repressive complex two is somehow recruited to a locus and then trimethylates histone three on lysine 27 and afterwards this methylated histones are recognized by PCR1 complexes that make further epigenetic modifications that transcriptionally silence the locus. Okay, so this is a silencing pathway and the key question that we were interested in is trying to learn more about how PRC2 is recruited to certain loci and not to other loci. And we were hypothesizing that maybe some transcription factors are responsible of recruiting PRC2 to particular loci and leading to this trimethylation of lysine 27 loci. All right, so we get data of embryonic stem cells that are differentiating in vitro into neurons. And we got data from three time points, stem cells, neuroprogenitors and terminal neurons. And we measured H3K27 trimethylation genome-wide at each of these time points. So basically we just apply mana to this histone modifications across time and then these are the predictions of the top transcription factors. So for example, we predict that the transcription factors snail transiently recruits repressive mark H3K27 trimethyl during the neuroprogenitor stage. These are the targets in the string database of snail. And what we see in the enrichment of particular pathways among these targets is that the targets of snail are actually highly enriched for transcription factors. That is the genes that snail binds to and causes to be silenced by a polycomb are highly enriched to be transcription factors themselves. And so maybe this polycomb pathway here in this development during the neural progenitor stage the polycomb is used to silence a whole bunch of transcription factors that you don't want to be activated. We can hypothesize that that is maybe what snail is involved in this system. So this is just to give you one example of how you can use mana with Chipsack data from chromatic modifications. You can also use mana from Chipsack data from transcription factor binding. Here I will show you an example of some analysis that is actually never published where we use data from the lab of Eileen Furlough in Drosophila mesoderm development where she measured the binding of five different transcription factors, genome-wide across different stages in development. And so basically what one notices is that the same region in the genome may be bound by a transcription factor in one stage but not at another stage. So for example, this transcription factor, sorry, this region in the genome is bound by Math 2 at the late time points, but not at the earlier time points. Whereas for example, this region is bound by Tinman, this red transcription factor at the early time points, but not at the later time points. So the same transcription factor will bind to different subsets of promoters at different stages. And we can try to use mana as well to understand why the same transcription factor will bind to the same region in some stages but not other stages. And the hypothesis is that other transcription factors can basically passively cooperate with each other to determine the binding, right? So the conceptual model is that let's say in the early stage, this green transcription factor is expressed and the similar red, sorry, okay, let me start over. I didn't do very well. Good, there's a region here on the genome covered by a nuclear zone in general, right? Because the DNA is covered in nuclear zones that make it inaccessible to transcription factors. And under this nuclear zone are two binding sites for the red transcription factor and one binding site for the green transcription factor. Now in this early stage, the red transcription factor is not expressed and only the green transcription factor is expressed. But by itself, the green transcription factor cannot really compete and displace the nuclear zone and it will not bind to its site. However, in stage two, the red transcription factor is now upregulated. And so both the red and the green transcription factor are both trying to bind to their sites and together they can now out-complete the nuclear zone and bind to their sites. And in this way, the binding of the red transcription factor helps the binding of the green transcription factor. So you can write this down also in a simpler linear model where the binding to a particular region of motif M in sample S, so of this green transcription factor, is basically driven not only by the binding and the number of binding sites for this green transcription factor and the activity of this green transcription factor, but also facilitated by the binding of other transcription factors to the same region. Right, so when you write down this model, you will see that you'll get the same kind of linear model. And so Mara can also be applied to basically find out which transcription factors are helping each other binding to sites as a function of stage. So when we apply this approach to this data, so we, for example, find that for, if you look at MF2, it's binding across the five stages. The most important motif, not surprisingly, is MF2 itself. Okay, so the most important predictor of MF2 binding is MF2 binding sites. However, other important predictors of MF2 binding that are stage specific are binding sites for Tinman, for TTK, and for Zelda. And so, for example, the prediction is that Tinman helps MF2 binding specifically here at the second stage. Whereas Zelda is helping MF2 bind early in development and is not helping MF2 to bind late in development. So it seems Zelda helps binding early in development and not late, this is actually something that is now known in the literature. Whereas this TTK seems to act to repress the binding of MF2, okay? So it seems to oppose binding of MF2. So this is showing you that you could also upload to Mara Chipset data for a particular transcription factor and learn which other transcription factors are helping your transcription factor to bind in a stage specific way. All right, so the final thing that I want to tell you about is how the replicate averaging of motif activities works. So in this example here, these are the estimated motif activities of H and F4 alpha across all samples. And as I told you, there were always measurements done in triplicates. So these were the first triplet, the second triplet, the third triplet. These are all the same time points. So now when you ask Mara to calculate average motif activities, replicate average, basically what you do is you tell Mara which samples are replicates of each other. And so you basically divide the samples into groups where each, where all the samples in one group are considered to be replicates and different groups are considered to be different conditions. And then for each group, we sample the activity in a sample S of this group is the average activity at the group plus some sample specific deviation. And we assume that the sample specific deviations are Gaussian distributed with some standard deviation sigma G. And so what we do is we now for each group optimize this standard deviation sigma G using a Bayesian procedure. And so at the end of the day, the mean activity of a group of samples is given by the weighted average of the motif activities of the samples in the group. And you also get an updated error bar that is given by this expression, all right? So basically what this says is that the smaller the error bars on the motif activities of the samples, the higher the weight they're going to have in calculating the average. And the less the variation of the motif activities of a group in a sample, the smaller the error bar will be for the motif activity of the group. All right, so this is the calculation that Mara does to calculate replicate average motif activities. Okay, and the final thing is I already told you so, so far we've all been looking at how changes in gene expression of genes across the samples are explained by changes in motif activity. But is Mara also fits the average expression in terms of the mean activities of the motif? So I told you about this early on in the presentation. So we model the average expression of each gene averaged over all samples in terms of the mean activities of each motif also averaged over all motifs. All right, so these mean activities when a motif has a high mean activity, it means that targets of this motif tend to have high absolute expression when you average over all samples. And with motifs with low activity, it says those targets of this motif tend to have low absolute expression across all samples. And so in the case of this liver maturation data set, the highest activities apart from H and F for alpha are these transcription factors like NRF1, TAL1. And so the targets of these motifs tend to have high absolute expression across the entire time course. Whereas the targets of these transcription factors tend to have low average expression across the entire time course. So these mean activities tells you something about the motifs that are basically highly active or inactive in all samples. All right, so one thing I wanted to tell you about is that we've been working very hard for the last years to try and adapt all these methods to single cells as you all know there is now a lot of interest in using single cell RNA-seq to measure gene expression in single cells. And basically what we've been working on is adapting our approach so that we can apply MARA to single cells and find the activities of motifs in single cells. Now, because the single cell RNA-seq data is very noisy and has very high sampling noise, there are lots of technical challenges. And so Jeremy Breda, who is one of the tutors today, has been working on overcoming these challenges. And so the first sort of half of the problem we think we've now solved, we've now found a new way of taking care of these very high sampling fluctuations that you have in single cell RNA-seq. And so I want to point you to a nice preprint we have about this on the bio archive and also to a tool which is called sanity that is used to normalize single cell RNA-seq data. And we're currently working on how the output of this sanity normalization of single cell RNA-seq data can be used to apply MARA to single cells. And we sort of have a beta version of this that is already working, but we're still tweaking the details. But if you have single cell RNA-seq data and you're very interested to try this out, you can contact us and we can look whether we can do some tests and run some of the data and see what comes out. All right, so we hope that the official version of single cell MARA will be available not too long now. And the other thing, so somebody asked us about other organisms. So currently we have human mouse rat E. coli. We have implemented zebrafish. So we're interested in users that have test data sets from zebrafish that we can run to confirm and validate that there are no more bugs in our annotations. And second, Drosophila, we're also in the testing stages. And so we're also interested to hear from users that have Drosophila RNA-seq data sets that would be interested to test. And then further, any general feedback, what parts of the analysis you find most helpful or most unhelpful, what are key features that are missing. You can contact Mikhail Pachkov, who is also one of the tutors. And we always love to hear from you with feedback on how useful or unuseful you found various aspects of the tune. All right, so with that, I come to the end. Just the acknowledgments. So today's tutors I already mentioned. So Jeremy has been working on the single-cell MARA. Mikhail is really responsible for the entire web interface and for support. And so Mikhail is also supported directly by the three since the food buy informatics. And I also still want to mention Piotr Balviets, who was the main developer of the ISMARA pipeline when you were still in my lab. All right, with that, I've come to the end. So I thank you all for your attention and I hope it was useful for you. And I'd be happy to take any questions you might have.