 Hello everyone. My name is Samia Agrawal. I'm a research scientist in the laboratory of applied computational genomics at Wiccan, Japan. Today I'm going to talk about my project where as part of Phantom 6, we are focusing on exploring the biological role of long-non-coding RNAs using high-c data. Thousands of LNC RNAs are known to be transcribed across different cell types. However, if we look into the literature, 96% of them are still functionally unannotated demonstrating a major gap in the knowledge. The individual studies have shown that LNC RNAs are involved in different functions, which include transcription and translation regulation, chromatin remodelling, and so on. Therefore, to explore the role of LNC RNA in different cell type, we have used the high-c data to identify the protein genes that are in close proximity of a LNC RNA and use this information to infer its biological role. If you are not aware of high-c method, basically it is a technique to identify all the genomic regions that are interacting with each other in a selected cell type. To start the analysis, we have uniformly processed deep sequence high-c data for different cell type, which include in-house data generated for IPs and dermal fibroblasts together with data by encode and other individual studies. Once we identified the significant interactions, we mapped the expressed promoter identified using matched gauge data to obtain the annotated interaction for downstream analysis. One of such analysis showed us that different promoter pairs are non-randomly present compared to the same type of promoter pair in mRNA coding and indigenic LNC RNA gene pair. This is not restricted to one cell type, but is true for all the selected cell type showing that it is a general feature of the interaction. Overall, this shows that the mRNA coding, indigenic LNC RNA gene pair are non-randomly specially closed or localized due to their regulatory role. Next, we generated gene clusters by selecting the genomic region that are up to three high-c interactions away from LNC RNA window. These gene clusters were used for various downstream analysis. One of such analysis we performed was gene ontology analysis for each cluster. The result shows that up to 58 percent of LNC RNA has at least one enriched geoterm depending on the cell type. Further comparison between the high-c cluster and the linear cluster shows that high-c cluster can rotate many more LNC RNAs compared to the linear cluster. Overall, this shows that information contained by the high-c cluster is distinctive and cannot be recapitulated by a linear cluster. Further, we looked into the association between the LNC RNA lineage and high-c interactions. We found that the number of interacting mRNA coding promoters increases from human-specific to the older LNC RNAs. Further, geoterm associated with human-specific LNC RNA high-c cluster are more cell-type specific compared to the older LNC RNAs. As shown here, for cubic, human-specific LNC RNAs show enrichment for terms like cell adhesion, platelet activation, while older LNC RNAs are enriched in the term related to the response to metal ion. Overall, this shows that human-specific LNC RNAs have more cell-type specific function, while older LNC RNAs have more ubiquitous function. Altogether, we can use these information to annotate an unannotated LNC RNAs. Here is an example of unannotated LNC RNA which is specifically expressed in HEPG2. This cluster is enriched in different blood-associated geoterms. The LNC RNA has a positive correlation with the GEO genes, which belong to the gene family known to be involved in blood calculation. Further, the GWAS analysis shows the enrichment of traits associated with blood regulation in AP compartment that overlaps with the HICC cluster of the LNC RNA. Overall, this suggests that this LNC RNA may have a role in blood coagulation. Further, we are developing a HICC visualization website which will provide a platform to explore the role of LNC RNA by combining all different analyses generated in this project. Overall, this project provides a platform to explore the function of hundreds of intergenic LNC RNAs expressed across selected 18 subtypes, and this is not restricted to just intergenic LNC RNAs but also extended to the other LNC RNAs showing the strength of the project. In the end, I would like to thank my boss, Dr. Mikhail Duhun together with my lab members and all the collaborators for their support and their analysis for this project. Last but not the least, I'd like to thank you all for your attention and I'm looking forward for your comments and question in this lecture. Thank you. I would like to start by thanking the organizers for this wonderful meeting and opportunity to present our work on a harmonized scalable functional genomic data repository. As we all know, typical genomic workflows require querying multiple and often large-scale genomic data sources where the results of the squares are then linked, filtered and summarized across these data sources. However, the functional genomic data required for this analysis is distributed across many data sources. Individual data sets are generated for various biological conditions, various tissues and cell types. The data sets themselves are stored in a variety of formats and described in often data source specific metadata schemas. All of this together makes it difficult to query, compare, and aggregate site functional genomic data across data sources and in general it makes it challenging to integrate this data into new or existing workflows. Filer is our work aimed at addressing some of this needs by providing a harmonized collection of many functional genomics data integrated across many data sources over 20 data sources, providing a total genomic coverage of over 2,000 fold and capturing a variety of biological conditions with over 1,000 tissue and cell types. Importantly, all of this data is consistently annotated with extensive metadata. All of these data sets are uniformly processed into bad-based formats and all of the data is organized in an ontology-driven way into smaller data collections such as phantom 5 enhancers, road map enhancers, or encode histone chip-seq data, or encode DNA-seq data. Importantly, all of these data are coupled with Apaches, Park and Giggle genomic indexing-based API to allow for scalable access to this large-scale data. So this pie chart shows the data composition of the first version of Filer and as we can see here, the Filer includes data from encode, the latest encode phase 3, includes data from Gtex, NIH, road map epigenomic project, phantom project, and so forth. And these data are covering over 30 different experimental data types including chip-seq, DNA-seq, ATAC-seq, as well as expression and splicing QTL data. And like I said, all of this data can be scalably equated and shown here is the illustration of using Filer data and overlapping with genetic variants or structural variants, variant sets of various sizes. And as we can see here, the Filer allows for highly parallelizable access that scales linearly with the number of cores. As indicated on the axis, as we scale from 16 to 96 cores, the source speed, the y-axis, is increasing linearly. So to sum up, Filer provides a harmonized functional genomic data collection across many over 20 data sources including encode. And all of this data can be accessed. Filer can be accessed through the web interface or deployed on your local server or in a cloud environment. And we hope this will, this flexibility also will make Filer go to a place for functional genomics data. And I would like to wrap up by thanking all the people who contributed to this work as well as the finding agencies. Thank you for your attention. I'm Lance Henches and today I'm going to talk to you about the Lancertron peak caller. Chromogen profiling assays like attack, chip, and DNA seek work by selectively pulling down DNA fragments which are then aligned to a corresponding reference unit. Areas of interest are identified by their increased fragment density through the use of a peak caller which typically works by applying some statistical tests. Generally real peaks are enriched to the point of statistical significance. But the occurrence of false positives in peak calls is a well-known problem. There are techniques that mitigate this, FDR, IDR, they can remove some false positives, running multiple replicates for experiments. But these solutions aren't perfect and work better in some situations than others. So how can you tell noise from a biological event if they're both statistically significant? Well, you have to visualize your data. The two peaks shown here come from the same track are shown at the same scale and have the exact same p-value. But to my eye, the peak on the right looks much better. In fact, several studies have shown that humans are good at calling peaks by sight and can do so reliably and repeatably. Obviously, this is not feasible genome-wide, but we wanted to create a peak caller to address the false positive problem, and this idea was really critical to our design. So when we were planning our new peak caller, we decided to use peaks labeled by humans to teach a deep learning algorithm what a peak looks like. We got these data from ENCODE, which as you know has a varied and extensive collection of experiments to work with. Deep learning has been shown to outperform humans when it comes to things like image classification or pattern recognition. So we applied it to peak calling when we did so in concert with traditional statistical testing. In this way, Lancertron assesses more than just the height of a peak. It considers the actual shape of the region. Benchmarking has shown it to be extremely accurate at identifying known peaks over 99% accuracy. Visualization is really important to understanding a dataset, so we wanted to have that built in. After uploading your data, your results are mapped to a genome and histograms of peak stats as well as fully customizable and interactive charts are loaded automatically. We also wanted to have a web interface. This meant users had no software to install. When you upload your file, the actual computations happen on Oxford's machine learning cluster. So I'm actually going to do a quick demo showing a track from ENCODE. Down here, this is the link. I'll put it up on Slack when this goes live. So this is a track from Histomark H3K27 on the cell line 22 RV1. I think that's a prostate cancer cell line. So there are a number of ways to interact with your data. We can see basically down here is a genome browser. This functions as you'd sort of expected to. Typically, the output of a peak caller is a bed file. The table on the right here functions as an interactive bed file. So clicking on a row takes you to that location. In the genome browser, you can filter and sort columns here on the fly. You can look at the table with thumbnails, or you can actually just look at the images themselves. On the left here are interactive charts. And all these pieces are connected. So charts will update your table or images down over here. These things are also updating the tables around here, or sorry, the genome browser down here. So everything is interactive and connected. And you may have spotted this already. But if we take a look at the peak score, and this is the scoring generated from the deep learning algorithm, you can see there's actually quite a few low-scoring regions that you'd probably want to filter out. We can see from the thumbnails that these things are probably not what you want. And maybe you think this is low-resolution data or something like that. But actually, it's the exact opposite. This is an extremely good experiment. The peaks shown here were even called two biological replicates. This is actually an intersection of two peak calls made with Max2. So I think this just goes to show that peak calling is not a solved problem yet. And I think Lancetron can be a useful tool for improving your data analysis. There are many more features to explore, but I have to leave it here. Do take a look if you're curious. Also be on the lookout for bioarchive paper in the coming weeks. There were so many people that have been critical to this project. I want to highlight the Vision Grant for Funding, Supervisors Steve Taylor and Jim Hughes, as well as my collaborator, Martin Sargent. Thank you for your time. Hello, everyone. My name is Christophe Thave, and I'm a PhD student in mathematics in Dr. Steve Beauder's lab in collection with Dr. Arnaud Roy at the University of Laval in Quebec City, Canada. Today, I represent my research project that deals with the original transductionary response to glucotocoryte stimulation. Transcription involves the interplay of multiple factors as the transduction factors that would bind to regulatory elements. This would refer to transductionary co-regulators, and there were any polymers too at co-promoters in order to enable transcription. Transcription and the chromosome architecture are closely related. Folding of the genome is not random, and we can distinguish several layers of folding. Each chromosome occupies a specific area within the nucleus, which are called chromosome territory. At a smaller scale, we can distinguish compartments A and B, then we have the test. That stands for topologically associating domains in our domain of preferential chromatin interactions between regulatory elements. Through the project, we explore the transductionary regulation in the context of the 3D genome following a hormonal stimulation. The stimulation is induced by the dexamethasone that diffuses through the membrane, bind to the glucocotocoryte receptor that will translocate to the nucleus, then GR bind to the DNA in order to activate and repress its target. We use several types of available infrequenting data, and the encode has GIF-SIC to identify the binding site of GR and torsoconal co-regulators. Every NACC at several time points and IC data to have the position of the test on the genome. Our preliminary analysis showed that among differential express genes, about half of dex-responsive genes have a GR binding. So, we were wondering how dex-responsive genes, without any evidence of regulations by GR, are This led us to the following hypothesis. In addition to direct regulation by attribution factors, the position of a gene within the 3D genome is an important determinant who will first be interested in the NGPTL4 gene, which is a known activity target of dex stimulation. Tracks represent the binding site of GR and several cofactors, made one, BID4 and CDK9, before and after the stimulation. Here, we can observe a gain of GR and cofactors after stimulation in number of hood of the NGPTL4 gene. Further analysis showed that gains and losses of cofactors are correlated across the genome. Then, we counted the number of gains and losses per test, and had distributed a score to each test using the formula at the top right corner. Here, we can observe that those are biased towards gains or losses. We are here in a particular test that contains several genes. We can observe a gain of GR and cofactors at the promoter of CHD16 genes, as a result of activation in terms of transcription. However, at the promoter of RRD genes, we noticed an absence of GR and cofactors. But RRD expression follows CHD16 transcriptional kinetic without any direct regulation by GR. This suggests that RRD expression is induced because of the position in the test with only gains of cofactors. This map shows the full change of different trial express genes at one and two hour. This set of genes have an evidence of regulation by the transcriptional factors GR. While annotating those genes with gains and losses of cofactors, we observe the correlation between activation and gains and the repression and losses of cofactors. While evaluating the cofactors activity of the test, the gene belongs, was served, an arrangement of induced genes within activated test, and repressed genes within repressed test. So there is a correlation between the transcriptional response and the test activity for GR-bound genes. While looking at genes not bound by GR, we can do the same. Here, we evaluated the changes in expression of genes depending on the test activity. We observed that in the repressed test, genes are repressed and in the same way induced genes are in activated test. While looking at GR not bound genes, we are doing the same observation. So we observed that distribution of changes in RNA expression is similar in bio-stats, no matter the GR-bound. In summary, we saw that GR elicits a regionalized distribution of cofactors, test that bias towards gains or losses of cofactors, and that the position of a gene within bio-stats correlates with the transcriptional response following the coca-2 stimulation. These results suggest that the position of a gene within responsive test is an important determinant of the transcriptional response. Finally, the model would suggest is as follows. We are in a test that contains several genes that are expressed at the basal level. After simulation, GR will bind at the neighborhood of the gene, recruit some cofactors, and the polymerase, too, in order to activate the expression of the gene. What we are suggesting is that it is a local accumulation of cofactors, its transcriptional condensate, that allows genes in the chromatin environment to be activated. I would like to thank all the people that are involved in that project. I would like to thank the InfoConstruction for making available genomic data for the scientific community, and thank you for your attention. Hello, my name is Ben Ketladi, and I'll talk about TRNCO, our Efficient Topological Associated Domain Aware Regulatory Network Construction Tool. We'll talk a little bit of the goal of the model, the algorithm, the model that we generate, and then we'll talk about how we implemented this and used ENCO data for it. So, we all know that regulatory networks are powerful tools, and many people use different genomic data to generate these regulatory networks to find differences in developmental time points or between normal tissue and a disease state, to really find driver genes and novel regulators. So, the common model that many people use is a very linear model. They'll find enhancers, look at those transcription factor motifs, and then develop the network by using a distance matrix on a linear base. So, the closer you are to a gene promoter and enhancer, those transcription factors that drive that promoter will have a higher weight. However, we all know that the genome is quite complex. We have nucleosomes, chromatin loops, individual loops that are very variable, and TAD chromatin compartments as well as LADs. So, what we're here looking at is we decided we were going to take TADs and incorporate that into our model. So, what we did is we added TADs, and we looked at gene expression data and enhancer data, and we chose these two as our enhancer data by ACK 27. We chose these two models as one, they're very abundant, and two, they easily developed by, used by many people, and we can always add more data to the model as well as our thoughts were. So, the model, as you can see, looks very similar to our previous slides. However, we had a TAD boundary here in green, and this stops people, it stops us in the model to say that gene enhancers that cross this TAD boundary are not actually part of the weights. So, that means this, in prime example, this delta transcription factor cannot control itself and create this loop red. So, the core to the algorithm is taking enhancer locations and calling enhancer expression by log TPM, and then we call gene expression, we call motifs on the promoters, as well as on the enhancers, and using those transcription factors and distances within a TAD, we then create a transcription factor motif by gene matrices here, and that becomes the whole network. So, for our test data, we took the hard encode data because it has won a wealth of time points, and it's well studied on what's potential drivers of those cardiac development. So, when taking all the hard data and developing time points, we find that transcription factors show time point activation, or we highlight here FoxS1. So, FoxS1, in our model, we're looking at these weights here in the heat nap between nodes, between each of these nodes, and so FoxS1 over time is driving the expression of these genes and has higher weight over time. This is highly, we can really see this in the inflection time point graph here, where if we look at the time point compared to the previous time point, that we had these weights really increase around 15 and a half, 16 and a half, and then at zero and eight weeks, these weights really drop and go the opposite and are less important as we go into the adult stage. So, then we take a look at this across multiple different transcription factors, Gata4, MF2, A, NK25, TVX5, which are canonical transcription factors people are looking at. FoxS1, what was you talking about? SRF. Here in SRF, we see that is weights are really low in embryonic, but as we hit birth and adult, that's where we see the weights of SRF really changing. So, we looked at previous studies that showcased this networks and did a comparison. And when we compare this, we find large overlaps with our networks and previous networks, and these 165 targets that we do not find is when we investigate or do to our TAD boundary that we put in. Our network is also broader when we look at go annotations of what we're targeting here, and this allows us to investigate this further and truly capture the biology. When we look at the whole network for 10.5 and 8.5, 8 weeks for Gata4, we really see this as driving embryonic tissue development versus circulation. So, in conclusion, we capture extensive network connections. We identify time-dependent changes in transcription factor interactions and the processes that they're controlled by, and then we have high overlap with already published networks from previous studies. I'd really like to thank Chris Bennett, who's been driving, driver for this project, here and then, who started the project.