 Good morning. I'd like to thank the organizers for this opportunity to speak with you and to Convey to you our vision for integration and application of in-code data and differentiating blood cells the the vision is actually a project the project name comes from Validated systematic integration of epigenomics data in hematopoiesis our motivation is that there are large numbers of data sets from in-code other consortia Lots of individual labs that are available to researchers, but it's difficult for them to Effectively incorporate those that information into their research Our vision project aims to integrate these heterogeneous data sets systematically and produce Useful resources for the broader community. I'll be focusing in mostly on these candidate regulatory elements The work I'll be describing is led by Cheryl Keller and Guan Jie shang in my group and we had a Part of this work was published earlier this year in genome research so this Project which is funded by NIDDK it consists of multiple laboratories around the world and we're Working to acquire epigenetic information either through our own experiments or by Mining the work from other laboratories or consortia including in code integrate that information To predict regulatory elements and to give you a painting of the regulatory landscape and turn that integrative information into predictive models for gene regulation to Predict how each regulatory element is impacting target genes then test those Predictions by genome editing evaluate the results and try to improve our modeling We're doing this both in human and in mouse and here's An example of the the systems we're working with this is a mouse hematopoiesis The Hematopoiesis is the production of all the blood cells that you have circulating In your blood and these are all very different from each other very different Abundances very different functions. They all come from a Common hematopoietic stem cell that during differentiation progresses Progressively decreases this potentiality through various progenitors and and then along each lineage You have an expansion and maturation going on so we're looking both at the mature cells and these Progenitors using data from our own laboratories data from Edo Amit's lab from encode and others and Trying to fill in this matrix of histone modifications nuclear accessibility a Structural protein CTCF as well as RNA seek we Don't have a full matrix, but the missing data can be handled reasonably well By our integrative system which comes from dr. Yeh Chung who Developed this when he was on our statistics faculty at Penn State This is a two-dimensional segmentation That's done by a system called integrative and discriminative epigenome annotation system and The object is to assign each Genomic interval to an epigenetic state that itself is a unique and commonly occurring combination of Epigenetic signals such as histone modifications nuclear accessibility and so forth what is powerful about ideas is it does The segmentation jointly Learning both along the genome and across cell types and this preserves position specific information And gives you greater precision in finding cell type specific events. It is a Treats that the data is quantitative variable rather than a binary one And can handle missing data in an elegant way this Explored in this paper from Shang and Mahoney and this is our most recent Package that you can access via that bio archive article now When you apply to take all of that epigenetic data that I showed him that the data matrix and put it through ideas Ideas learns that there are about 27 states and those states are characterized by particular combinations of Features again nuclear accessibility the various Histone modifications not only the combination but also that the amount of contribution that that each of these makes to the To the output and you can see several flavors of promoters flavors of enhancers different Repressive regions different transcribed regions and when you then assign those states to each DNA interval across all of these blood cell types the progenitors the erythroid cells the Lymphoid cells you get a very informative painting of the epigenetic landscape. You can easily see Broadly expressed genes with their common epigenetic profiles across cell types. You can see more lineage specific genes with patterns of Active regions of which we associate with promoters or enhancers these orange colors for enhancers Restricted to a particular cell types and in fact you can see that not these Epigenetic state assignments are supported by orthogonal data such as a co-activate or binding which was not part of our training but you see it binding around these predicted enhancers and Actually independently That these regions have been examined for activity and transgenic mice and they actually are active. So that's actually looking pretty Pretty good and and we're utilizing that that these this epigenetic painting the segmentation in many many ways The thing I want to focus on now is not just the entire genome not not the entire genome But rather let's make some discrete calls for likely regulatory elements candid it's this regulatory elements and with this Integrative modeling. It's a fairly straightforward a procedure. We find all of the Peaks called by chromatin accessibility attack secret DNA seek and then we just ensure that those peaks are not solely in quiescent regions Of the genome and then we got our CCRE calls and we also know what the Epigenetic state of each of these elements is across the cell types Excuse me and Both in mouse and in human blood cell types we get slightly over 200,000 of these Elements you can see kind of similar patterns in mouse and in human for the same locus that we were just looking at a substantial Accumulation of these elements both within and between the genes Looking more broadly we ask what How does our collection of vision CCREs? How does it compare with earlier catalogs of Blood cell enhancers and how does it compare with the larger collection of CCREs from from the Encodes of the phase 3 encode which of course as many more elements because it's looking at meaning more cell types and The the overlaps that you can see in this upset plot are actually a pretty supportive. I think That there are about 60,000 CCREs that are found in almost all of these Collections so you should feel that these are very strongly supported of the Elements that The other elements in vision a whole lot of them are also match up with Data that that are in the larger encode Collections so I said almost 60,000 of them there Apologize for my my voice here and And as I mentioned before that you expect there to be a lot of them Elements that that are it found by encode that Because they're examining way beyond of blood cell types and what they're looking at One of the several things that we've been doing with these epigenetic states is to use them as an input to try to get at a an estimate of the regulatory potential the regulatory potential based on the epigenetic signatures where that potential is what's the impact of each element on a target gene and Our first approach to this was a classic multivariate regression analysis where What we wanted to do is to explain expression levels of the different expression levels of genes across cell types There's a gene A and gene B. We know or we've already made calls for the where the Regulatory elements are and we know what the profile of epigenetic states are in each of these cell types So we try to ask them how well do these? States explain the Expression and we look separately at proximal and distal we convert this categorical variable into a continuous one just by Letting our X be the proportion of the pooled CCREs that are covered by each of the states of fairly simple methodology there and then the Multivariate regression then well learn Estimated values for these betas the the coefficients. So now instead of just having a Color assignment for each state we can assign it a number That's based on whether it's having a negative or positive impact on the Expression of particular genes and that's what we call our ERP score, which is actually the weighted sum of them coefficients for each a state for each of these CCREs and then this is calculated for each gene and in each cell type and This is available at our website and We're starting to try to dig into this and let me just show you one example one of the genes that that we like to focus on ALAS2 is a clinically imported erythroid-induced gene it encodes the enzyme that catalyzes the rate limiting step in a heme biosynthesis Erythroid cells need to make lots of heme to bind up with lots of globin to make lots of heme and globin and be a nice red blood cell You can see from the RNA seek results here across this maturation series is Differentiation and maturation series you go from almost no RNA seek signal to a really really high signal and you can see within these introns very strong Chromatin accessibility which has been noticed before Our lab has studied them many other labs have studied these and showing for instance that reporter gene expression is greatly enhanced But by these intervals and importantly There are some families with X-linked citroblastic anemia or show mutations in this Element so it's clinically important What if we look a little more broadly though we were just focusing in on ALAS2 before in our previous studies well Looking within the TAD this is that these are the the TAD boundaries for the region including ALAS2 You see a lot of places where there are epigenetic signatures states that seem to be Associated with that with the gene activation. So maybe there are some of these are artistal regulators We were particularly interest particularly interested in the ones that had binding by known erythroid transcription factors like God a one tau one and so forth so In and here's that the ERP scores that we were getting across this you see there's a lot of positives These are the the the known ones are sevens and some of these are much larger ERP scores. So we wanted to test these out and we focused in well, we utilized this R2 and R3 the intronic ones as positive controls and we I want to look at this R4, which is a more distal element and April Coburn in my lab has been conducting Directive mutations of these elements using CRISPR-Cas9 and trying to target the guide RNAs to to hit around these Conserved God a one motifs and sure enough when the deletion goes into those motifs You can see a substantial reduction. So that's a good positive controlling and actually confirms that this has an element has an effect on ALAS2 similarly this R3 which has not has been studied as much but her mutagenesis is Hitting these goddess sites up pretty frequently and and we see reductions. So this is all looking good the more distal element We have more limited data on but and this is very recent and we're still working on this But you can see mutations in this element Generating substantial losses and activity and we're following up on that, but it's very very promising so I hope I've convinced you that integration of these epigenetic Signals to get a regulatory landscape is useful. You can combine that with accessibility to come up with a good set of regulatory elements converting states to numbers by these regression models is actually Has a lot of potential and it's very promising. We're very happy with our initial investigations of These experimental evaluations our resources are at this website I want to thank the members of my own laboratory other labs at Penn State and other members of the Division Consortium for working with us and of course I'm very grateful for our funding and I'm grateful for your attention. I'll be happy to take any questions