 Hello everyone. Thank you for joining the encode users meeting virtual this year. Unfortunately my name is Annika and I'm the encode project manager in Professor Mike Snyder's lab at Stanford in California and He already gave a great scientific talk yesterday, and I will today Give you an overview of our production center here at Stanford what we are doing Which data sets we are making available to the research community and what our goals are our Projection Center is led by Professor Mike Snyder together with Professor Greenleaf and Professor Chang here at Stanford We are working very closely together with our collaborators at Northwestern University in Chicago and with two surgeons at Washington's Washington University and University of Washington Although our ability to determine the genome sequence of individuals is becoming easier our understanding of the function of most of the human genome is still limited and Mapping of regulatory information is particularly crucial since most common variants associated with human diseases lay outside of coding regions And we are mainly performing three major essays in our lab We map the transcription factor binding sites using chromatin immunoprecipitation of text transcription factors And we expand the catalog of regulatory elements by analyzing open chromatin regions Finally, we also map open chromatin regions in single cells from these type of biosamples And we believe that these studies greatly expand the catalog of regulatory regions in the human genome During my rather short presentation today I will give you an overview of how we are mapping transcription factor sites and what is available for further analysis by encode By the encode portal for users I will give you a short overview of our open chromatin essay efforts And I will end by giving you some information about the biosamples that we chose and collected and how those are Processed within the encode consortium in an integrative effort to maximize essay space So let's talk about let's let's start with the transcription factor chip The transcription factor working group consists of mainly two encode production centers ours in Stanford and the lab of Rick Myers and Eric Mandel Holly Hudson alpha in Alabama and the objectives of our projection center are Create a global comprehensive transcription factor binding site catalog Identify and rich DNA binding motives We try to expand the catalog of regulatory DNA currently obscured by cellular heterogeneity and reveal accessibility variation We try to uncover regulatory elements by comparing a large number of different normal and diseased human samples And we develop novel methods and standards for best practices and make them available for the research community We have a transcription factor list curated from various annotation databases Using a variety of criteria such as known motive or DNA main binding domain this list contains about 1600 transcription factors about thirteen hundred of those have a TPM Over one. So those are the one that we are primarily focusing on obviously for Experimental and technical reasons We analyze transcription factor expression from RNA seek data sets that were previously generated by encode and We determined the cell types which have the broadest aggregate set of transcription factor expression to allow for the maximization of element discovery So currently we are expressing transcription factors in the following cell lines K562, A549, SKNSH MCF7, WTC11, PGP1 and GM12878 and HEPG2 This doesn't of course necessarily mean that the experiments work best in the cell line where the transcription factor is expressed highest So we try alternatives in this case of course To provide a comprehensive As of now our production center submitted about 1800 chip seek data sets that are available In this round and previous rounds of encode and there are Sorry, there are about 4,600 chip seek data sets in total The limiting factor in chip seek experiment is usually the availability of highly specific antibodies It's time-consuming. It's expensive to create the noble antibodies and to screen commercially available antibodies for use in chip seek In our production center we tack transcription factors with GFP using CRISPR-Cas for this For mentioned reason and our colleagues at Hudson Alpha use a flak tag instead of the GFP tag Transcription factors along with their associated DNA are immunopulverized using antibodies for GFP and the transcription factor bound DNA is sequenced and we have successfully demonstrated in the past that Antibodies against GFP, HA and flak tag can all be successfully used in chip seek Our data sets usually consist of two independent cultures of cells like two biological replicates of 10 to 20 million cells And peaks of tack transcription factors are scored for enrichment relative to input controls I'll come back to this later So for epitope tacking of transcription factors, we perform two characterization methods that both must pass for validation of our chip seek results In order to be released First we validate the insertion of our tack by PCR. Here's an example of transcription factor Fox S1, which is So the the characterization is called genetic modification characterization. I'm using those Terms here because this is how you can find it on the portal. So I think it makes sense to use those throughout the presentation In addition to that, we add the biosample characterization, which is a Western blood with a GFP antibody. In our case, a flak tag antibody for our colleagues at Hudson Alpha To confirm that the transcription factor is expressed and the full size protein and not a truncated version of the protein is present in our cells For details, please refer to the encode Experimental guidelines on the portal or ask me Oh, my colleagues. I'd like to mention here that we Deposit our constructs at edgine. So if they're properly validated and the chip seek experiment a successor and released, we send all our constructs to edgine Yeah, so once we have all the validations and sequencing results, metadata submitted to our data coordination Center and everything is reviewed The data sets are released for the user and can be easily queried on the on the website, which Looks like this An amazing feature of encode is that the raw data is actually processed by uniform pipelines, which not only makes the data available for more Researchers, but also easier to compare data sets. Of course You can see here that you see the transcription factor You can Choose the experiment you see which lab, you know, like in case there is a flag or an audit You see which cell line was used The data quality is assessed by a variety of means which has been built into our standard analysis pipelines So duplicate reads are removed So duplicate duplicate reads are removed usually and mapped reads are scored typically only unique non repetitive reads are subjected to peak calling Qc scores include non redundant reads in an experiment like nrf And enrichment of signal in peak like FRIP scores For high quality experiments Then cross correlation analysis for strength distribution Um And then also the number of reproducible peaks scored that is scored using idr So this is more a general overview of how the chip seek pipeline looks If you click on an example on the portal, this is an example for only the For replicate one, but of course this exists for replica two and for the wildtip controls as well Um on the portal itself, you can click on these little arrows here to get additional information on This step in the pipeline and you can also click on these green little Dots here and then you can get information of for instance, which software was used to do the peak calling certain metadata or quality metrics For instance like for instance idr Or reproducible peaks number of I mentioned that we are using gfp And our colleagues at hudson alpha use flag tech for tech taking transcription factors And in previous rounds of encode our projection center used native antibodies So using a common epitope to tech different transcription factors has many advantages It only requires one antibody to be validated Which can then be used to study a lot of transcription factors in the human genome Furthermore, since this epitope is not coded within the genome potential antibodies can quickly be screened for non-specific interaction on on modified cells Comparing these data sets like native antibody versus flag versus gfp and then native versus gfp and you know All kinds of comparison showed that actually in both cases regardless Um, uh, the the overlap is pretty high. So we think that um, you know, uh, it's It's a valid approach for us Okay Um, currently our productions center and hudson alpha Production groups they use we use different methods in generating and applying background background controls for our experiments so, um Sometimes when you click on a transcription factor data set on the portal, you see that some data sets have, um, two controls um, so one of those cons controls is The pooled input rep one and rep two pooled. The other one is the untapped y type cell line That is used as a y type control here So the question of course for us was is there a significant difference in peak calling performance based on The type of background control that that we use and um the strongest And as a hypothesized most of the variants and peak calling among the different background controls resides within the weakest peak so You know, I I I'm not showing all the data here now, but um, there is actually no better performing control Which um, this is sometimes you find both sometimes you find only one on the portal So chip chip seek has a numerous advantages over general mapping of open chromatin regions Which are potential regulatory regions, of course And it enables proper assignment of the transcription factor to specific genomic regions In contrast high resolution mapping of open regions to identify footprints can identify motives and Can be suggestive of classes of transcription factor binding regions Open chromatin mapping as a single essay that can really be formed on many cell types Enabling a broad survey of regulatory information in many different tissues and cell types um, we are performing bulk attack and also single cell attack in our production center To complement our transcription factor Chip seek data sets So attack seek is a very sensitive and accurate probe of open chromatin and The expert the data sets that are available from our production center are all the cell lines that we use for chip and other cell lines Bulk both bulk and single cell attack But then we also do a lot of attack and single cell attack on primary cells and tissues from human So I would like to transition quickly into Presenting you what kind of tissues we were collecting for encode for and what we did with those tissues So our collaborators at washington university and University of washington they collected a variety of human tissues Mainly from 16 donors. They are healthy and with cardiac cardiac diseases We have a variety of tissues and organs diverse age and disease state They are high quality and suitable for single cell resolution, which is very Which is very exciting for us Then explicitly consented for genomic data sharing Explicitly consented for sharing with other researchers, which of course is important for a project Like encode and they are explicitly consented consented for immortalization reprogramming and Really important for us as well is that they are high quality as you can see here For instance the RIN score. It's it's very good So high quality data Of your high quality tissues Variety of different tissues as you can see here. This is only a short Summary we do have more than that high quality means we can do a lot of different essays on those and I'd like to mention here that there's a huge collaborative effort within encode Of the so-called biosample working group to collect many different essays on these tissues and also other biosamples that were collected by other production centers that are part of the encode for a project So i'm i'm just listing some of the laps and mapping centers here on the right Many people involved in this But we do have data sets on the same tissue for different essays such as a histone chipmanship You know chip seek then essays for DNA accessibility like attack and single cell attack and dna's a lot of essays concerning RNA And also 3d chromatin structure So to sum up What do we think is the most exciting feature of our data is The combination of transcription factor epitope chip seek And the attack seek on these cell types and I think those two provide a complementary data That that is very exciting for the research community. So it is the depth of the transcription factor The the high resolution, but then also the breadth of Yeah, bark attack and single cell attack on on many tissues in concert with many Different essays from many different Groups that are all available at the encode portal for analysis And yeah, so many data sets many people involved I there are probably People missing here on this slide too. So, uh, yeah, many people at stanford than our collaborators at northwestern uh, washington university university of washington are our two Surgeons that collect really high quality tissues for us that are open consented um, our colleagues at hudson alpha And then the encode biosample working group and of course nih for great support and uh, their funding and uh, yeah, thank you for your attention and I'm happy to share any questions Answer any questions