 Thank you very much Diana. Thank you very much Ioannis for inviting me It's the first time I'm doing one of these virtual seminars. So I hope I'll do this correctly. I Want to talk about Two types of data that we're working with a lot by supply sequencing to measure DNA methylation and RNA sequencing So there will be two parts but before we go into that I've been asked to quickly introduce myself and the group So on the top you see the people who have been involved in the project. I'm going to present today Demos Lucas Maria who was a an undergraduate student who is now in the Netherlands and The DNA methylation projects they they all were in collaboration with the Xubular who is also group leader at the Friedrich Miescher Institute and His former post-doc Robbie Moore who is now professor at University of Geneva and I'd like to thank those people for their great contributions in these projects so This is this is like a summary of what kind of things we are interested in in my group They're interested in gene expression and with that we we were mostly thinking Up to the level of RNA not so much to the level of proteins Just due to a lack of data actually there's not so much proteomics data at our Institute But there is a lot of data coming from all the other layers in this whole process just naming a few here Starting from the DNA sequence itself polymorphisms that affect transcription Then the state of the chromatin The proteins that are bound to the DNA the accessibility of the chromatin the positioning of the nucleosomes The structure the higher order structure of the chromatin Then the transcriptional process itself the translational process proteins bound to the RNA So you can see next generation sequencing is really a huge tool set And you can look at many many different layers in this whole process of of gene expression the data sets That we're looking at they come mostly from our experimental collaborators within the institutes and Our our questions are as diverse as the biology that they are studying So we don't have a strong focus in terms of a biological question or a biological system We're more focused on the on the data on the on the ways to process it and on the ways to interpret it biologically These data sets have turned out to be extremely rich They usually measure one thing but they give you additional information on top to to give you an example by so fight sequencing Is used to measure cytosine methylation But you can for example also look at the mutual information between neighboring cytosines and Whether maybe they're co-regulated by by the same underlying process or to give you another example RNA sequencing is usually done to get just gene expression levels or steady-state RNA concentrations But of course you can also use it to find new genes You can study promoter or polyaside usage and of course transcript structure just to name a few things So that's the theme of today's talk I'd like to to give you two examples Where we're using these NGS data sets to find additional things to what they actually have been generated for I Will start talking about DNA methylation at the beginning. So let's look at DNA methylation patterns Just as a very quick introduction to DNA methylation. We're using bisulfide sequencing to measure that It's a it's a biochemical trick where you treat your DNA with bisulfide Which leads to the amination of unmethalated cytosines turning them into uracils But the methylated cytosines will not be converted and then if you if you do PCR and sequencing You basically see T's in the place where you have the unmethalated cytosines And of course you can't just align those reads back to the genome like normal reads because you don't have many mismatches in that case So so the method we we developed a few years back and actually a lot of other people also developed similar methods is To remove all the C's both from the genome and the reads Basically doing the alignment in a three alphabet three letter alphabet space And then once you have the alignment positions You can put back the C's and you can read out the unmethalated or methylated states or or sequencing errors for that matter so If you're interested in that kind of thing you might want to have a look at our quasar package that implements these alignment strategies among certain things So how does methylation look like I'm only talking about mammalian systems here because that's that's what the exevaluary is studying In mammalian system you have a genome-wide DNA methylation occurring only in the context or almost only in the context of CPG dinocletides and This the genome is also depleted for those dinocletides except for the so-called CPG island regions And if you look at the default methylation state of the CPGs Typically, they're unmethalated in the CPG islands indicated here by these yellow color and They're methylated everywhere else and What has been known since a long time is that these CPG islands they tend to occur close to promote regions or transcript start sites and if you methylate those regions they usually repress the close by genes So they're thought of as regulatory regions in the genome About two-thirds of the genes actually have such a CPG island very close to the transcript start site and Initial knowledge about the role of DNA methylation in mammals has been really focused on these CPG islands But before the era of NGS, we didn't really know how methylation would look like genome-wide So that's why a few years back together with the Schuebler lab We we were generating genome-wide data sets using this bisophyte sequencing method and Here you see how the CPG methylation patterns would look like for a short piece in this case or mouse chromosome 10 Each dot that you see here along a stretch of about Six or seven mega bases each of these dots represents one given CPG and On the y-axis you see the the methylation level Which is just like the fraction of methylated over total reads that we get for that CPG You can see that the default state is really to be fully methylated most of the CPG state They can be found up here. That's that's an embryonic stem cells. Sorry. It's not mouse. It's human actually and You see the average level is maybe around 90 percent methylation and In between you can see these these stripes which are nothing else than groups of close by CPGs That show lower levels of methylation So if you if you plot the location of the CPG islands that I just mentioned you see those often correspond to these CPG islands There are however in addition to these two states These regions here which are also striped so also regions of local hyper methylation Which do not correspond to CPG islands And that was kind of a new observation that wasn't known at the time And we were of course wondering what they would correspond to If if I plot if I look systematically identify those regions genome-wide and I plot the number of CPGs per region Versus the average methylation in the region you can see that you get really two populations of regions This is one of the few examples that I've ever seen in in biological data where you truly get Bimodal distributions so that you can really say there is two different things while One one population are the well-known CPG islands or completely un-methylated regions where you have lots of CPGs and Really know residual methylation and these are these other stripes these new things that we discovered That have on average a bit more methylation That have only few CPGs actually they have the the low CPG content that you find Mostly in the genome outside of CPG islands and they also tend to be far away from genes They don't enrich in promoters like like these CPG island-like regions too So what are they we started comparing them to different other types of of data here For example, you see it just a zoom in into a smaller region in the genome That has one CPG island and four of these lmr regions and if you compare that to DNS one Sequencing data which measures accessibility you can see that that each of these regions basically corresponds to a peak in the accessibility profile so To to put it very simply the CPG islands plus these additional regions pretty much correspond to any accessible region in the genome and When you look at chromatin histone modifications and Also transcription factor binding you find a lot of these things also enriched in these regions So you have these lmr's here in the middle would be the CPG islands and Especially the histone modifications told us that these regions could actually be enhancers In particular, you see they have a lot of k4 mono methylation Which is typical for enhancers It's not found that promoters usually and the k4 trimethylation Which is a typical CPG island or promoter mark is is usually lowering enhancers So this profile fit very well with what people knew at the time about enhancer region So so we were basically working on the hypothesis that what we discovered there were actually active enhancers and to put this to the test Robbie Moore used a reporter constructs where Luciferase reporter gene is driven by a Weed promoter and he cloned 12 of these elements that we discovered Into that construct and and indeed 10 or 11 of these show a boost in the Luciferase activity So kind of confirming that these are really having promoter or enhancer activity in in the ESL's This is another way to test this Because of course we were wondering why why are these regions? Hypomethalated is it the binding of the transcription factor that causes it or is it? Maybe the other way around that there are somehow demethylated giving access now or allowing the transcription factor to bind and Experiments like this one here convinced us that it's really the transcription factor that causes it so we took a piece of equal ID and a and If you bring that into the genome of an ESL By default the CPGs that are contained in it will be fully methylated Indicated here by this reddish color if you plant a transcription factor binding site motif in this case for CTCF You see we can detect binding of CTCF to this element by chip and we see that the CPGs are now low in Their methylation level, but if you plan the mutated motif So just swapping to base bears actually the binding is is almost abolished and so is the hypermethylation So it's we think the binding of the transcription factor is necessary and sufficient to create this How it actually works mechanistically? We're still looking into that. We still don't know But we looked at several transcription factors. We also used The polymorphisms that are contained in in our cell line We have a few binding sites here again the CTCF motif that have heterozygous single nucleotide variants and That allow us to classify both the bisophyte sequencing and the chip sequencing reads to measure them alleles specifically and What we get there is basically we compare The difference of CTCF binding between the alleles with the difference of methylation That if you get increased binding you get decreased methylation and vice versa so you get this expected anti-correlation between binding and methylation and Of course if you if you accept that that it's the binding of the transcription factors that create these elements It's it's not surprising to To see that if you compare different cell types that express different transcription factors You will find some elements that are cell type specific and if you look into the sequences of these elements that you find motifs For transcription factors that are also expressed in a cell type specific manner So basically you can can take those elements look for motif enrichments and you can identify For example in ESL specific elements that they are that there is an over-representation of pluripotency transcription factors Like sox2 or oct4 or if you look in hematopoietic stem cells You find some of the factors that have already been known to play an important role for these cells So in a way you it's it's also a Possible way to find out the transcription factors that are important for for cell identity and With that that's that's an older story. I don't want to spend too much time on that unless you have any questions I Just like to mention very quickly that in addition to the three types of elements that I just described In some methylones we find the fourth type of pattern Which which looks very different. So here on the top you see again the human ESL Methylone, which is like what we call a nice methylone Then below that you have imar90 fibroblasts and you can easily see they look very different They have these regions that are that have been first described by Lister in 2009 and have been termed partially methylated domains Because if you calculate an average methylation level, it's it's around 50% in these regions Compared to the 90% you find outside. So I guess that's why he called them partial We don't think it's a good name because if you if you look at them They're actually not Harsely methylated the individual CPG some of them are fully methylated and others are not methylated at all and Here you see the distribution of the methylation levels in these I'm or 90 PMDs. It's almost a uniform distribution or methylation levels and That was striking us as strange because while they were talking about chaotic methylation and Unregulated this doesn't look like random to us something that would be random. You would think with more Scatter around an average methylation level Because the cells wouldn't agree with each other So on average when you when you look at all cells and calculate population average You should get something that is maybe more like a Gaussian. Maybe somewhere around 50% That's not at all what we're seeing we're seeing CPGs that are hundred percent methylated That means all the cells they agree that this CPG should be methylated and just the neighboring CPG can be Can be at 0% methylation Again meaning all the cells agree on the methylation state. That's not random at all So we started comparing these PMD regions across different cell types and the surprising finding was that The methylation levels are highly correlated The correlation coefficients are as high as they can get given the error bars in our methylation measurements Here you see the comparison between I'm or 90 and four skin fibroblasts on average correlated about 0.8 So we were wondering what drives this kind of pattern and You could say the LMRs I described in the first part were driven by the binding of transcription factors So in the end it was the underlying DNA sequence that recruits the factors I think also in this case, it's it's actually the underlying sequence This has also been published So if you're interested in the details you can read this up you can predict these methylation levels with Sequence features for example with the local content of dinucleotides and we think it reflects probably the then the preference of the methylation machinery Which surfaces in these regions that are usually thought as being inactive not very well maintained because probably they're hetrochromatic And so the methylation is not maintained as actively as it is maintained in in the intervening regions Also this part we We implemented in a methyl methyl seeker Our package that allows you to find these regions genome white if you have Isle fight sequencing data and With that I'd like to summarize the part about methylation. Basically. I showed you two kind of patterns And that are caused by bikini sequence in the end first one by the binding of the transcription factors the second one is maybe more Fun to find it but but biologically maybe not so relevant but it probably shows the the preference of the methylation machine and and So these would be like first examples of what you can pull out from NGS data sets in addition to Why people generate them in the first place? Yes Good question I should repeat the question for the people who are listening from remote The question was whether the I'm our 90 cells have been looked at using high C or basically Any technique that allows you to see the higher order chromatin structure? That's a good question. I actually don't know but that's something we should check we we we know that the PMDs they they tend to Correspond to the tad structure But I don't know if that's the case in I'm our 90, but I would expect so actually Yes Okay, so with that I'd like to go now to the second part which is more recent and Which deals with RNA seek data? So if you think of RNAs in a cell They part of them you will find in the nucleus part of them are actually in the process of being matured So maybe partial transcripts still containing the introns maybe the lariat that has been spliced out partially spliced forms then they will be exported and probably the Most of the molecules will find in the cytoplasm If we do RNA seek we grind out the whole cell and we get reads or counts from all of those species and A lot of people are studying the different steps in this process and And and of course they even developed specific methods to to measure particular steps in this process like global run on sequencing or Mason seek that Mason seek for example that just precipitates the RNA that still sticks to the chromatin and is supposed to be very fresh newly made RNA or Self fractionation you can just separate nuclei from cytoplasm and sequence them separately That's Experimentally challenging or let's say at least more difficult than just doing the standard RNA seek that you might usually do and So if you think about measuring transcription I just mentioned global run on sequencing in this case you pulse the cells with a Nucleotide analog that you can then use to pull out the freshly made RNAs or the nascent seek as I said you precipitate the Transcript still sticking to the chromatin or the cellular fractionation techniques They are Experimentally definitely more involved than standard RNA seek But there were indications that maybe that information you get with these kind of experiments could in part At least also learned from standard RNA seek data There was an older paper by omitted Seisel where he used probes on affymetric affymetrics X on the race that locate to intronic regions to to get an estimate for the transcription and Later on there was a paper by omitted all where they used RNA sequencing data, and I looked at the read coverage within the introns Giving evidence of co-transcriptional splicing because it tended to to increase along the intro There was even a very sophisticated Dynamic model fit to our MC data where people were trying to estimate transcription rates splicing rate and all of these things That turned out to be maybe Going too far because there wasn't enough data. So in that paper they suggest to Make groups of hundreds of genes to to get the power to fit these parameters But it was obvious if you maybe don't go that far if you if you take a step back You might be able to get it from a standard RNA seek data for single genes And that was the most guide us is in my group who discovered oscillating genes and could prove that it's a transcriptional oscillation by by looking at intronic reads So what do we mean by taking a step back? Basically, we're we're doing this in a in a contrast setting. We're comparing two conditions like wild type knockout or healthy and disease or treated untreated and We're basically splitting the RNA C reads that we're getting Into two groups the ones we get that map to exons and the ones that map to in France And then we're calculating a full change between these two conditions separately at the level of introns and at the level of exons and In the end we compare these two full changes So how would this look like in a in an experiment where we know what's going on? Let's take this one here It's an experiment where people have stimulated cells with dexa metatone And we're looking here at at five genes that are known targets of the glucocorticoid receptor Which are known to be transcriptionally upregulated when you when you stimulate the cells And you can see that in for all of these five genes You see that up regulation at both levels you see the intron going up and you see also the exon Going up Which is what you would expect for a transcriptional process if you look at experiments where There is no transcriptional change, but there is a post transcriptional effect like transfection of si RNAs We don't expect the transcription of the gene to change because the si RNA would presumably only hit the mature RNA in the cytoplasm knocking down the steady-state level But without affecting the transcription and that's exactly what we see we see a decrease of the exonic levels But hardly any change for at the level of the entrance So this is kind of the overall summary of the second part of the talk If you remember that then then you go basically everything So let me show you that in a bit more detail We're looking at many different systems to To see if this generalizes this one is from from a paper where they studied macrophage activation using lipid a in vitro and where they where they separated the chromatin Associated RNA from the cytoplasmic RNA and sequenced them separately and in that paper in cell they They discovered different groups of genes that have different response profiles after the stimulation For example, these would react a bit more early. These would react a bit more later And you can see that usually the reaction is seen first at the level of chromatin in the nucleus and only later in the cytoplasm so now Luckily, they also did total RNA sequencing of the unfractionated cells so we took that data and We split it in silico into exonic and intronet And you can see that the Exonic fraction Follows more the the cytoplasm profile While the intronic fraction follows more the nuclear profile. So without doing the experimental separation Let's look at yet another example Mouse circadian rhythm In the liver there's known to be about a thousand genes that actually Cycle with with in it with a daily written and And we took a data set where they did nascent RNA seek. I mean sequencing the Chromatin associated RNA and we we detected Oscillating genes in that nascent RNA seek data set and we found about 800 using our cutoffs and we're Sorting them here by their face. So basically by when during the day They peak in their expression from early to late if you look at the levels of mRNA seek reads in introns and exons for the same genes in the same order Then you get these heatmaps. You see the different samples that they've taken with Sampling interval of about four hours And you can see those those cycling genes right you can see two waves of expression because the overall data set is Over two days And if you look very closely you see you see even more than that look for example at the genes down here They have their peak expression in the second Sample in the second time point if you look at the level of introns But if you look at the level of exons actually the peak is a bit later It's maybe between the second and the third sample So you can basically see a delay of The signal in the introns versus the signal in the exons indicating that there is a time in between and Probably we could explain this by the time it takes for for an mRNA Once it's upregulated in the nucleus to be matured and exported until we see the rise of the cytoplasmic level of that If you quantify this delay Versus the nascent seek that directly measures nascent transcription You see that the introning level is is basically not shifted or if anything Even earlier than what they get with nascent seek while the exonic is is almost two hours later We were a bit surprised that it's such a long such a long time So we looked at additional circadian written data sets and and we found similar estimates So in most cases Exon was lagging behind one and a half up to two hours after the introning levels and That's like a third example I'd like to show you again supporting the notion that if you look at intronic changes What you're seeing is transcription rate changes that's a data set of fibroblast stimulated with TNF alpha and They perform growth seek in this case and we compared the change of growth seek Which should directly measure transcription rate changes with the introns and you see you get a very nice correlation and the correlation is less linear and it's also less less Prominent if you compare to exonic changes. So taking these things all together I think they support the fact that you can really see transcriptional changes transcription rate changes by by doing this Exon versus intron analysis So let's think about post transcriptional changes for a moment The simplest model you could assume about how a gene is regulated Would be this right? It doesn't get simpler than that. The changes of the mRNA level is determined by The transcription rate what's newly made Minus what's degraded and let's assume there is just just a Rate for each of these two processes At steady-state so when the amount that is made equals the amount that is degraded You get you get this relationship. So the steady-state level should be the ratio of these two rates and Since we're looking at this in log space. We're looking at log fault changes Basically means that the the log fault change of the steady-state level is The log fault change of the transcription rate minus the log fault change of the degradation rate So now I've I've just shown you that the transcription rate change We can measure by getting the log fault change in the introns and that's the overall Exonic level corresponds to the steady-state you have in the cytoplasm So basically if we subtract these two we should get we should get the change in in half-life or the change in Transcription that's sorry post transcriptional Regulation of that gene Half-life is of course inversely proportional to the degradation rate So let's see if that if that actually holds up if you look at data We didn't find many data sets where we had both or any seek and measurement of half-lives But we had one such data set in house. It's a It's again the troublers lab who differentiates embryonic stem cells to neurons in vitro and here you see the Intronic versus exonic changes Across this differentiation. You see that it's really highly correlated Indicating that most of the things that are going on in this differentiation are actually regulated at a transcriptional level Nevertheless, you see there is some Some spread around the diagonal. So maybe these small residuals you would get here could be Indicative of post transcriptional regulation and so what we do is we subtract the two and Correlate them to the changes in half-life that were experimentally measured using actinomycin D inhibition of the normal transcription and It looks pretty bad. I admit but nevertheless, it's a it's a significant correlation that you find And what kind of convince me that it's not it's not trivial to see this is the fact that we get this about point three correlation coefficient when we Compare half-lives to delta exon minus delta intron We get a much lower correlation when we just compare to to delta exon and we get no correlation at all With delta intron, which of course you wouldn't expect because it's cytoplasm post transcriptional versus transcription in the nucleus and That's not trivial at all because overall delta exon and delta intron are very highly correlated with each other So we think that's a that's a rather low correlation coefficient because it's technically difficult to measure to measure half-lives Taking this one step further Where can we use that kind of information? For example, if you're after Identification of micro RNA targets, this is usually done by transfecting a micro RNA into a cell and then measuring the RNA You would expect that the targets of that micro RNA are down-regulated And that's exactly what you see here. So again, I'm showing you change at the intronic level versus change at the exonic level See most genes don't react 12 hours after transfection of the micro RNA Some genes get down-regulated, but only in the cytoplasm. So that would be your micro RNA targets In this data set, they also looked at the RNA a bit later, namely 32 hours after transfection And you can see that the picture is quite different here We still have it's actually the same genes the micro RNA targets down here But now in addition We start appearance of a correlation here Or maybe you cannot see some genes are starting to migrate out along the diagonal So correlated changes at intronic and exonic levels. So what does that mean? we think it's actually secondary effects that tend to be transcription and And and what supports this assumption is that if you if you look at the local density of of seeds, so Genes that have a seed match with micro RNA one, you see that they tend to be down here Also here, but not at all along the diagonal So if if that's true that that that would mean we should be able to to Discriminate primary from secondary targets by playing this exon intron game and that's kind of what you see here So that's a larger data set where they transfected to micro RNAs in four different cell types and We we ordered them here kind of from no secondary effects to more and more secondary effects and And we also checked so how good would be enriched for seed containing Transcripts if we are selecting them based on delta exon minus delta intron, which is our Proposed methods to to measure post transcriptional effects or if we just select them Based on the delta exon, which is the black line, which is what people usually do in the literature You see that in the absence of secondary effects You actually don't do better by doing delta exon minus delta intron in this case You do worse Significantly worse because probably you're just adding noise to to your signal There are no secondary effects. So you don't need to correct for anything But as soon as you do have secondary effects, you see you do as well or even significantly better than just using delta exon to select your micro RNA targets and This is kind of an extreme case the genes that change most on the exonic level are actually not MicroRNA targets in this experiment The question was if if there could be maybe a gradient of ploidies in these cells That's a good question. I don't know We were wondering why if we order them this way by basically correlation coefficient between delta exon delta intron Why do we get the two micro RNAs from the same cell together? I don't think it's by chance, but what actually drives this one thing. We also assumed what was that it could be the speed of Metabolization how quickly the micro RNAs are taken up and integrated into risk that differs between these cells That then kind of create an asymmetry in how fast the targets react But yeah, we don't really know and Since we were talking about micro RNAs, let me show you a last piece of data about those People have argued that you should be able to learn micro RNA expression just from RNA seek or RNA expression Because in the tissues that have a high level of micro RNA the target should be low So if you do the analysis across many tissues You should kind of learn which micro RNAs are high in in the specific tissue That kind of worked for some people and other people claimed it doesn't work So we repeated that very simply by by fitting a linear model I'm using the number of predicted micro RNA target sites pro gene as our regressors and As a response variable either delta intron or delta exon or what we would think what should work best delta exon minus delta intron so the The primary trees that we're fitting essentially are the predicted expression levels or activities of a micro RNA in a given tissue and I'm showing you just an example for three micro RNAs or families of micro RNAs that share the same seed Mear 124 here, which is known to be highly and specifically expressed in brain You see that we really get the most significant coefficient for delta exon minus delta intron Kind of also sees it in the delta exon that doesn't see anything at all in the delta Or we have mere one which is expressed in heart and skeletal muscle this case Both delta exon and delta exon minus delta intron work or here near 122 which is a liver specific micro RNA, which we can also detect using this approach So with that I'm almost at the end If you want to calculate p-values some people really like p-values we we were suggesting actually forced by the reviewer Suggesting in the paper that one could maybe use the GLM framework to do this like it's implemented in HR Without going into details what we're basically doing is we're Fitting two models one that explains the observed expression data just in terms of the conditions in this case ESLs versus terminal neurons Plus whether the counts are coming from exons or introns of any given gene and we're comparing that to to a model that has in addition an interaction trial which kind of Makes the model able to To account for things for changes that are different in exons than in introns So geometrically what we're basically modeling with this term We're giving the genes the the freedom to move away from the diagonal To the couple the exonic changes from the intronic changes And that's kind of as I've showed you in the last plots. That's how you see a post transcriptional effect, right? So so with that you can calculate p-values and if I color them in this scatter plot You see that would be the ones that this model says are Significantly post transcriptionally regulated I think there is a few drawbacks for example this this model fits mean variance relationship and It's assuming that that's the same in introns and in exons which if you look at it Is not too far, but it's probably not true So maybe there is room for improvement here. That's just a suggestion Something else that we were surprised of You would think that in poly a selected RNA seek data sets You shouldn't find so many reads mapping to introns. In fact, we still do and and as you can see here I'm plotting here The number of reads you you have in a data set versus the number of genes you can quantify using some arbitrary cut-off Comparing poly a data sets with total RNA seek set data sets. They're actually not that different For reasons that we don't really understand But it's basically just to encourage you if you only have poly a selected data It may still be applicable to your data this approach And with that, I'm at the end I hope I could convince you that it's worth looking at both exons and introns especially if you're interested in transcriptional or post transcriptional regulation and I would like to thank you for your attention