 Hi, I'm Jeff Hirstra, and today I'm going to share with you a little bit of our work to build nucleotide resolution maps of transcription factor occupancy across the human genome. You carry out a genome as hierarchically packaged into chromatin, the fundamental unit being the nucleosome, which is 150 base pairs of DNA wrapped around an histone octamer. Long arrays of nucleosomes form chromatin fibers and then higher structures on top of that. Chromatin is punctuated by cis-regulatory elements such as promoters, enhancers, and control the expression of their cognate genes. Within cis-regulatory DNA, transcription factors cooperatively bind in the place of the canonical nucleosome. These TS recruit complexes that directly control transcription. Thus, transcription factors form the basic building blocks of regulatory DNA, and I would argue that a full mechanistic understanding of gene regulation requires detailed maps of transcription factor occupancy across the whole genome. So how do we do this? To do this, we employ a strategy called DNA-S1 mapping. Now over 30 years ago, it was discovered that active regulatory DNA is exquisitely sensitive to cleavage and digestion by the non-specific endonuclease DNA-S1. The digestion of chromatin with DNA-S1 generates small DNA fragments that can be purified and N-sequenced using massively parallel sequencing. Mapping these small fragments back to the genome enables the genome-wide detection of active regulatory DNA. Now with sufficient sequencing depth, one can begin to visualize the activity of DNA-S1 at single nucleotide resolution. So if we sequence these C8T cells to 200 million tags, you can see that DNA-S cleavage is not uniform with one of these DHSs, but the signal is attenuated by the binding and occupancy of individual transcription factors. For example, Nrf1 here is I'm showing. Now, DNA-S1 footprinting capitalized on the fact that 50% of the total cleavage is occur within accessible elements representing 1% of the human genome, highlighting the incredible signal-to-noise ratio inherent to DNA-S1 mapping. Now intuitively, DNA-S1 footprinting reflects the outcome of a competition between transcription factors and DNA-S1 for access to DNA. Footprinting works because the affinity of sequence-specific transcription factors that are specific binding sites is much greater than the affinity of DNA-S1 for those particular sites. So highly-occupied sites will result in market protection or a DNA-S1 footprint, while lowly-occupied sites reflect the intrinsic sequence preference of DNA-S1 itself. So in order to truly fulfill the potential of DNA-S1 footprinting as a method for the de novo genome-wide detection of transcription factor occupancy, we want to do this in a manner that doesn't require any prior knowledge of computer binding sites, only examining the cleavage profiles of DNA-S1 themselves. And the trick here is to use an algorithm to identify footprints in the data itself. To do this is very conceptually simple. We're just going to find contiguous regions of the genome with cleavage imbalances. So one way you can intuitively think about this is sort of running a window across the genome consisting of a footprint core in the flanking regions. In any time there's fewer cleavages in the core of the flank, you could call those positive footprints. However, this approach has a number of challenges that make and found footprint detection, namely the variability cleavage rates at adjacent bases, as well as the variation accessibility amongst different DHSs in general. A major contributor to the variation in DNA-S1 cleavage is that DNA nucleases in general have intrinsic sequence specificity. In other words, DNA-S1, for example, doesn't cut uniformly a naked DNA. What we know now is that it senses the minor groove with. In a collaboration with Haran Boussamak and Remo Rose, we show that the cleavage preference can be effectively explained by a six-mer sequence model. Now, we took advantage of this and we developed a computational approach and incorporates both chromatin architecture and the empirical DNA-S1 sequence preference to determine the expected nucleotide cleavage rate across the genome. And then for each dataset, we derived a statistical model for testing whether the observed cleavage rate at individual nucleotides deviated significantly from the expectation. So what you can do here is then you can do that statistical test and choose an appropriate cutoff and call footprints. Now, we performed de novo footprint discovery independently and all of the data sets in this particular study, which is about 243, and we detected on average about 650,000 footprints per dataset. Now, it should be noted that the number of footprints that you can detect is very dependent on sequencing depth. So the deeper you sequence, the more footprints you're going to find. Now, this was great, but we wanted to see if we could do even better. Specifically, we wanted to leverage the fact that there are now just hundreds of regulatory DNA maps encompassing diverse human tissues and cell types. Now, if we use all these cell types compared to footprinting across these, has the potential to really illuminate both the structured function of regulatory DNA. However, here, systematic approach for the joint analysis of the genomic footprinting data really has been lacking. Now, as I mentioned, given the scale and the diversity of these cell types, we really wanted to develop a framework that could integrate hundreds of available, even thousands of available datasets to increase the precision and resolution of footbridge detection. And also we wanted to build a scaffold and build a common reference index of transcription factor occupancy across the genome, essentially to build a composite view of what individual regulatory DNA looks like. So to do this, we implemented an empirical-based framework that estimates the posterior probability that a given nucleotide is footprinted. We did this by incorporating a prior on the presence of a footprint and a likelihood model of cleavage at both occupied and unoccupied sites. Now, this worked amazingly well. As you can see here, if we just plot individual nucleotides across all these data, so the heat map here, I'm showing you the posterior value of that empirical Bayes approach, we plot the individual nucleotides scaled by their foot prevalence across all the samples. Doing so precisely results the core recognition sequences for all these diverse TFs in the bottom. Now, what we wanted to do is we wanted to build this reference set of TF-occupied DNA. So what we did here is we applied the same consensus approach described by Voucher and that he used to build the DHS index. And what we did here is we collated the overlapping footprint regions across individual data sets into a nucleotide resolution consensus footprint map. We applied this approach to all DHSs detected in one or more of the 243 data sets in this study, and this collectively delineated approximately 4.6 million consensus footprints of individual distinct footprints present in the one or more cell types. And these footprints were populated within 1.6 million of the 3.3 million DHS index. So slightly about 50% of the DHS as we discovered footprints in. Footprint occupancy across all data sets showed market enrichment for the recognition sequences of the master regulatory TFs of their corresponding lineages. For example, you see got a one-foot prints in erythroblast cells, HNFL footprints in fetal intestine, and PAC-6 footprints in fetal eye. Now we find an enrichment in a cell type for virtually all major class of DNA mining domain families, suggesting that very few families, or if any at all, are refractory towards DNA's one footprints. For degenerate motifs of the same sequences recognized by many distinct TFs, we observed highly cell-specific occupancy patterns that could be further decomposed into coherent groups that corresponded to cell type and function, as you can see here, with two different e-box families. Now because TF engagement creates alterations in DNA shape and protects underlying phosphate bonds from a nucleus attack, we wondered to what extent fluctuations in the DNA's one cleavage rates reflected the topology of the transcription factor DNA interface and mass. Here we focused on CTCF. CTCF is a well-known poly-zinc finger, as you know, which has two clusters of zinc fingers that bind DNA separated by a hinge, and that hinge region is thought to mediate DNA bending. So what we did is we transposed the overall permiglutide cleavage propensity, or the average cleavage propensity, onto the co-crystal of CTCF with DNA. And what you can see here is that this accurately traced all the features of the protein DNA interaction interface known, including the focal hypersensitivity limited to a couple of nucleotides within the hinge region as thought to mediated bending and likely modifying the minor groove with accentuating DNA's one cleavage. Now furthermore, we can plot the corrected cleavage counts for all CTCF sites genome-wide and t-regulatory cells, and this reveals that these topological features are immediately evident even at the level of individual footprints on the genome, so we're looking at the structure of individual binding sites. We examined this further for a number of TFs and found the average footprint width for diverse TFs tightly track the width of their respective recognition sequences. So take it together, I think that demonstrates that the extent profile of the permutative DNA's cleavage is really reminiscent or reflective of the 3D structure of the regulatory DNA element, or definitely interested in looking at this in the future. Another extremely powerful feature of genomic footprinting is its ability to reveal the logic of individual cis-regulatory elements. For example, in non-nervous cell types, occupancy of the repressive NRSF in the scant-5 promoter silences transcription. However, nervous tissues like bipolar neurons, NRSF is not expressed, not bound, and transcription of scant-5 occurs. To detect differential occupancy in regulation in an unbiased manner, we devise a permutative differential occupancy test similar to those employed for differential gene expression analysis. Here in the scant-5 promoter, our tests identify the precise nucleotides that are differentially occupied in nervous tissues and cell types. This revealed the fundamental logic of the scant-5 promoter. In nervous tissues, two activators, ZFX and TFAP2, replace rest and drive expression, where a strong rest occupancy in non-nervous tissues silences transcription. While the vast majority of disease and trait-associated variation is non-coding, identifying the specific genetic variants that are likely to affect regulatory functions remain a significant challenge. Deep-sequence coverage of individual DHS enables the de novo genotyping of regulatory variation and the simultaneous characterization of their functional effect. We do this by quantifying and comparing the cleavage at each allele at heterozygous sites. Our data collectively encompass 243 individual data sets, and these data sets are derived from 147 individuals. And de novo genotyping across all these individuals revealed four million variants. On these four million, 1.65 were heterozygous and had the power to accurately quantify allele imbalance. Across individuals, we conservatively identified 117,000 variants that altered DNA accessibility on an individual allele. Notably, within DHS is sickle nucleotide polymorphisms that were allele-oakly imbalanced for markedly enriched in the core consensus footprints, highlighting that footprints are pinpointing the functional sites on the genome. Genomic footprinting data provides a unique nucleotide resolution to view an interpretation of the impact of individual DNA variants on chromatin structure. As an example, I'm showing you a variant within a DHS identified within the intron of the gene EGHD1. Here the variant is a C to a G. Allelic resolution of the DNA1 cleavage is in heterozygous individuals reveals that the C allele results in no footprint or the G allele results in a strong footprint. We can confirm this in homozygous individuals by looking at individuals that contain either the C or the G, and you can see here that in these particular cell lines in retina and HD29 cells that the C allele also has no footprint while the G allele contains a strong footprint. We can confirm this in homozygous individuals by performing our differential footprint test across the two alleles. This reveals the precise nucleotides that are affected by this variant and shows that the G allele is creating a strong NFI-X binding site. So this means that the derived allele is creating a gain of function binding on the genome. While this might seem surprising, this actually occurs about half of the time. You can find many more examples of this across the genome where variants affecting NFI-X occupancy resulting into the loss of binding or the gain of binding. Here in this plot, I'm showing you the relative DNA swim protection homozygous on the Y axis versus the proportion of cleavages on the reference allele or allele can balance in heterozygous on the X axis. And you can see here there's a strong consistency between the variant and these two different configurations. Consistent with this, we can also look at the variant effect on the recognition sequence or what is the predicted energetic effect on the sequence alone. And what you can see here is that if variants that result in the loss of binding as measured by foot printing, the alternative allele creates a weaker motif than the reference allele. In contrast, if the sites that are resulting in the gain of binding or the variants that result in the gain of binding, the alternative allele creates a stronger motif than the reference allele. So really taken all together here, this shows that genomic foot printing provides an ultra high resolution view of regulatory variation and its impact on transcription factor occupancy. Given that genetic variation affecting chromatin accessibility is enriched within footprints and that trade associated genetic variation localizes within DHSs in general, we wondered whether disease and trade associated genetic variation would preferentially localized to genomic footprints versus non-footprinted regulatory DNA. To do this, we've reformed enrichment of the GWAS catalog SNPs after LD expansion in DHSs but outside footprints and then increasing stringencies of consensus footprints. What we find here is a strong increase in enrichment in the strongest footprints. Well, no enrichment within DHSs outside footprints suggesting that the vast majority of trade associated variation is mediated by variation within footprints. To gain a more accurate view of the enrichment of trade associated variants in footprints, we compared the SNP based trait heritability of individual traits. We did this using summary statistic data from individual GWAS studies from the UK Biobank. We applied partition to LD score regression to compete the relative heritability of these variants within all DHSs versus footprints. What we found here after doing this analysis was striking enrichment of variants that account for trait heritability in footprints versus DHSs and most prominently in footprints for the corresponding cell types. For example, here if you look at red blood cell counts in erythroid footprints, you see an enrichment of approximately 45 fold. So taken together, we conclude that the genetic signals from disease and trade associated variants are emanating primarily from TF footprints within DHSs and that the variants within footprints seem to be the major contributors of trait based heritability and regulatory DNA. To wrap up, digital genomic footprinting is a structural readout of regulatory DNA technology and an incredibly powerful approach to map transcription factor occupancy. We've made some computational advances to integrate hundreds to thousands of different cell types and to build nucleotide resolution consensus maps of transcription factor occupancy within millions of cis regulatory elements. Digital genomic footprinting also enables the unbiased and nobo discovery of regulatory logic and structured individual loci. We can measure the effects using this approach of genetic variation on TF occupancy and nucleotide resolution and using this we're able to show that disease and trait associated variation is specifically and preferentially enriched within footprints versus non-footprinted regulatory DNA. Now there's many more findings in the paper and I hope you guys give it a read and take a look. I would also like to say that the digital genomic footprinting data is now available in the UCSC browser through a track hub so you can go to the UCSC browser, you can load up a public track hub it's called digital genomic footprinting from 243 cell and tissue types and when you load that up you'll get something that looks like this where you can see here there's footprinting data from individual samples so this includes the per nucleotide cleavage counts both the observed and the expected tracks. Footprint calls at different FDR thresholds and individual samples a consensus footprint track and motif matches from the overlapping footprints so give it a look. Finally I'd like to make some acknowledgments the people that supported and helped in this project. First I'd like to acknowledge three talented computational biologists that aided in the data analysis. John Lazar was critical in designing some of the statistical frameworks used for footprint detection. Shane Neff and Eric Haugen helped in data processing. I'd like to also specifically acknowledge the ENCODE data production team and the Institute for generating such fabulous DNS1 data. My two colleagues at the Institute, John Stam and Walter Lulamon, also like to reiterate the data is available at Bolsonota my personal website and their extensive code and documentation at Github. Finally I'd like to acknowledge our funding sources the NHGRI ENCODE project and a charitable donation from GlaxoSmithKline. Thank you.