 Thank you everybody for your patience. So I'm going to start now. So I'm Giovanna, Giovanna Brazini from Philips Bucher Group. We are a group at PTFL, a computational biology group. We are focusing on the analysis of genome sequencing data, as well as the development of web-based tools and databases, basically related to genome structure and transcriptional regulation. So today I will show you a few principles of chip-seq data analysis. I will explain via a guided, sort of guided, two-step-by-step tutorial through our two main web platforms, the chip-seq server and for chip-seq data analysis. And the signal search analysis, or SSA, platform for motif analysis. So our goal is to get you acquainted with our tools, of course, and we would like to encourage you to use this platform because we think it's quite easy to use. And it's, in particular, very useful for data exploration, exploitation, as you can run quite even complex pipeline quite rapidly and in interactive manners, as you will see in a while. So about this workshop, so today I will start, I will basically focus on some common biological questions for chip-seq data analysis. And for each task, I'll try to expose the biological motivations, explain the underlying methods, and provide step-by-step instruction and guidelines for the interpretation of the results. So this is what I call by tutorial style presentation. We will basically work with two bioinformatics. As I previously said, bioinformatic resources from our group, the chip-seq analysis server, whose link you see it here. I also encourage you to go to the web server and the signal search analysis for motif analysis, which you find URL here. If you want to go more in depth with our tools, explore more. We have more workshop material here on this link. So today's example is based on an early landmark experiment about 10 years ago. That was targeted that stat 1 in Hila cell. As probably most of you know, stat 1 in response to cytokines and growth factors translocated to the cell nucleus, where it acts as a transcription activator. So this experiment, this IP experiment includes two types of type of data, IP data from interferon gamma stimulated Hila cells, which is the main experiment. And the control data, the stimulated Hila cell control data, we will be using the main experiment. We will not use the control data in our tutorial today. So this data comprises about 15 million sequence reads that have been mapped to the human genome. This is a typical size of a Chipsic data set, a Chipsic experiment, especially at that time. Today's Chipsic data sets reach even 20, 30 million sequence reads. Before starting, I would like to show you this figure that gives you a nice summary of the state of the art technologies for chromatin studies. So in general terms, a genomic locus can be analyzed by complementary chromatin profiling experiments that reveal different aspects of chromatin structure. So as you know, Chipsic reveals binding sites of specific transcription factors. DNA-seq, ATAC-seq, which are similar techniques, reveal regions of open chromatin. And MNA-seq identifies well-positioned chromosomes. So of course, each technique implies a particular kind of data representation that is specific to the technique itself. Our tools have been designed to cope with different varieties of data. And this is one of the main guiding principles of our tools. So this is here. I show a schematic view of the process, the Chipsic process, that leads to the input data for our analysis. As you know, the Chipsic process enriches for specific cross-link DNA protein complexes using an antibody against the protein interest after size selection. All the resulting DNA fragments are sequenced simultaneously by using the genome sequencer. The sequence file that we get back, basically a FASQ file, needs to be aligned to a reference genome. And subsequently areas of protein DNA binding are identified by peak calling. So our tool deals with already map read alignment data basically. Here, I show you multiple fragments coming bound by the same protein and coming originating from the same chromosomal location, as it is depicted here. The short sequence reads are mapped to the genome, which are these short sequences highlighted by these arrows, green and red arrows. Given the double-stranded nature of DNA, as Eric already mentioned yesterday, the reads representing one binding site, basically this, will map on the plus and the minus strands respectively. So this leads to a kind of read alignment histogram for both strands of the chromosome, as we see here by these green and red dots. Note that reads that map on the plus and minus strands accumulate upstream and downstream of the protein bound complex, respectively. So here, we have a view of such redistribution alignment on DNA. We see this on a specific stat 1 locus at the UCSC genome browsers. I suppose most of you know the UCSC genome browsers. This is one of the, these locus represent the promoter region of a known stat 1 target gene, ICAM 1, and so this is the promoter region. We see the plus strand reads in green and the distribution, and the minus strand reads in red here. So how do we represent this type of read alignment data? We use a simple format that we call SGA. It stands for simple genome annotation. SGA is a bed-like format, as you can see. It's a, it is a tabbed and limited text file, which has five obligatory fields, a sequence identifier that uniquely identifies the genome assembly and the species, species and genome assemblies. For that, we chose to use NCBI RefSec data, RefSec identifier, so to avoid mix-ups between different species and genome assemblies. Then we have a feature field, which is a short string that identifies the experiment, describes the experiment, the position of the read, the strand, and then counts, counts being the number of reads that have been mapped to this particular position. So one important feature of this SGA format is that for computational efficiency, SGA needs to be sorted by chromosome identifier, by sequence identifier, position and strand. And this is because our program really can process entire genomes in less than a minute. So for a bit more of technicality, this is a difference, a few differences. The main difference is between SGA and bed format. So here on top, we have two SGA lines and the bed lines here. As I said before, SGA is a single position format. We only represent the five prime end of the fragments of the short sequence reads. Whereas bed indicates region, so you have a star and coordinates as it is shown here. So the bed coordinates keep in mind that are zero-based, whereas SGA are one-base coordinate. This means that for coordinates on the plus strand, when we translate them to SGA, we have to add one base pairs as it is shown here. Whereas on the minus strand, so this is a fragment on the minus strand, we represent the five prime end on the minus. So this is the green coordinate. In that case, on the minus strand, we keep the coordinate as it is. As I said before, SGA files are required to be sorted by chromosome position and strength. Chromosome identifier position and strength. And both files are top-delimited formats. So which are the questions that basically we ask when we interrogate the Chipsic data? Of course, we want to map binding sites genome-wide. This is the main goal. We also want, generally, to benchmark these files, to characterize them. And so we use motif enrichment to find whether these peaks are enriched, in particular, in a set of binding site or a specific binding site, depending on the binding mode of our protein, of our transcription factor. And this is an important step. We also want to study the chromatin context of our peak regions, of our regions where the chromatin is. So see whether. And we use that by histone modification profiles, by comparing, correlating histone modification profiles around our regions of interest, IP regions. And of course, we also want to generally see whether these genomic regions fall in conserved regions across the genome, because these may reveal important biological functions of our protein of interest. So the Chipsic server is a front end to the Chipsic tools, to tools behind the web interface, which is basically a collection of C programs and a few per script that have been designed, as I mentioned in the very beginning, with a few principle in mind. The tools are simple tools, are very easy to understand, even to known specialists, and they perform very basic tasks, as we will see. I will describe a few of them in details today. They are fast algorithms, as I said, and they implement genetic methods in order not to be restricted, as I said, to Chipsic data. The web interface has a modular design as well, in that the output of one task can be used as the input by the next task. We will see that in a while. And so we can, as you will see, we can really create quite complex pipeline, just by a few mouse clicks. It gives us access, and this is one major characteristics and feature of our tools, to a very large collection of selected public data. We have today more than 400 experiments that correspond to more than 30,000 data samples. They are, we have them as a pre-processed data, already in read alignment format, as GA, as I said, that can be browsed, view explored, and for teaching purposes, learning purposes, and we can be combined with public data. This is also very important for reproducible resource, because if you want to reproduce a figure in a paper, this is very nice, you have the data already accessible. We provide upload in several formats, okay, the most common one, BAM, BAM and SGA, which is our format. And another important feature is the high interoperability. This is one of our guiding principles, being high interoperable with other tools, both from our groups and with external resources. And we'll see that in a while. So this is just an extract from a survey that we did when we published our tour that was aimed at comparing similar resources in the field, and we focused on resources at that time that were available on the web, such as Nebula, Citron, these are galaxy-based resources mainly, and we're accepting read alignment, five, BAM, BAM formats, which are the common formats. Here I would like to highlight the performance of our peak caller, which can, okay, this is the performance for scanning the entire genome for peak calling an entire genome, HG19, the HG9 human genome assembly with default settings. And you will see that our chip-seq peak calling takes less than a minute, more than a minute, whereas our tools take 15, several minutes. So this is how the chip-seq portal looks like. And this is generally, we have this, more or less the same look for all our interfaces. So from the left side menu, you can access the tools. We also have in the main page, the main frame, we have a brief description of each tools. And here on the left side menu, you can access other tools from our resources tools and databases as well as tutorial and documentation. So what we will do today, I will, what basically this is a flowchart of a typical chip-seq analysis. And specifically what I will show you today, as I said, we will start from a read alignment file that is already present on our server. So accessible via the menu, the menu-driven mechanism on the interface, we will do read-shifting peak calling. And once we have our peak list, we will see how to do quality assessment of our peak list by motif enrichment analysis and explore the genomic context of our IP regions by doing some correlation analysis with histone modification profiles. So let's start with read-shifting and peak calling. As was already mentioned yesterday, read-shifting is typically a pre-processing step where we basically shift the plus strand forwards toward the center of our bound fragments and the reads on the minus strand, we shifted backwards to the center in order to increase the resolution. So we will get basically read distribution peaked at the center, at the protein binding locus, basically. We have chip-centering or chip-seq read-shifting as both a standalone application, but this is rarely used or most used is the input data processing option on our web interfaces. So read-shifting is illustrated here. If you remind, if you recall the previous read alignment distribution at the ICANN promoter, the one of the stat 1 target genes, we had the green reads on the plus strand, the red greens and the black greens will represent the centered or shifted reads to which coincides with the stat 1 binding sites. So of course, by shifting, we lose the orientation. So reads that are shifted or centered are unstranded reads. So with peak calling, we do first peak calling. So we want to find how many binding sites and where they are located. So the basic idea of any peak caller program is to identify regions in the genomes, where we find more sequencing reads than we would expect to see by chance. Okay, our method as I said before implements a very simple method which works as follows as we show here. We took as input genome-wide shifted read distribution for one genome feature of stat 1 IP data. The output is basically a list of peak center position with read counts as a matrix for measuring your peak strands. So in our case, the number of reads is counted in sliding window or fixed with, we fixed 300, 200 base pair typically. Speed, we gain speed by considering only those windows that have at least one read at the center position. So we shift across the genome and for each position, for each of such windows, we determine, we compute the cumulative read counts in these windows and select the speech position, the centers of those windows that have a cumulative read counts that is greater than a given peak threshold in read counts. And in addition, our local maximum within a vicinity range, within a range, basically we merge peak that fall within a given range. Optionally, we can refine these peak center positions. Instead of giving the center of each windows, we will report the arithmetic means of read position within the peak region. So as to give a more precise peak location. So this is the chip seek, the chip peak input form. I would now like to go on the web page itself. So it's better. I can show you how to operate. Let me try to do that. So this is the chip peak module, my analysis online module. So here, as I said before, you have on the left hand side, you select your data sets, you can upload the, of course, your own data, or you can select available data sets though that we have pre-processed from publicly available experiments. So in our case, so you can select, we have several genome assemblies. We have data for several genome assemblies and species, more than 15 species. You select data type. That can be, we have several data, chip seek, chip seek peak already pre-processed data, other types of pre-processed data and so on collection from the encode consortium and so on. Then you select your experiment and your sample in particular. So we can select that, our stat one data. So then you say that you, this is the pre-processing step for re-shifting. We choose 75 as centering as we have previously estimated by cross-correlation of five prime and three prime tags. As Eric explained yesterday, that our average fragment size is around 150 base pairs. So we shift by 75 base pairs, downstream and upstream respectively. So this is the pre-processing step. And here on the right, you have the peak detection parameters. As I said, you define the window width. Here we are using a fixed sized windows. So the vicinity range, which in typically we set as the windows, window width, so we merge or local maximum within our windows, within our region. And here you can set the peak threshold. Peak threshold can be set in two ways. You can specify it in read counts, in absolute read counts. And this is of course less intuitive. How do you know how many read counts you should accept in order to define your region significant compared to a background mean? So we also give a background average read density, let's say, to make this easier for you. You can also select your threshold in relative enrichment factor. So you give a full change with respect to a background read density. So you say 20 times your background read density or 10 times and so on. So on this you can set, generally you can start with 10 times where this is a typical threshold that maximizes sensitivity for chip secret experiment but then it depends very much on your read coverage and so. So but we will see later how to choose reasonable peak thresholds. So we... Giovanna, can I ask you a question so read to interrupt? Yes, of course. In the additional input data option that you had on the left, you have this little repeat mask. If we don't cross, it means you map on the repeat and if you cross, you don't map on the repeat. Exactly, if we cross it, you mask repeated regions, yes. If you don't want to mask repeated regions, you want to extract IP regions in non-repeated regions, let's say. That is boosting a lot the program, I guess. Yeah, typically you want to use that, for instance, if you want to do then motif discovery, binding motif discovery, you want to mask repeated regions because there you might find motifs that are not really the binding motif of your protein. Thanks. So you submit and hopefully it will take less than a minute. As I claim before, okay, that was the case, I guess. So here is the chip peak output page. The output page reports the number of peaks that have been identified by chip peak. In our case, 6,500 peaks. The peak files in several formats, namely bed, SGA file and other formats that are used for our tools. Then you have links to external tools, great for peak annotation as we will see in a while and use CSC browser to view your data on the browser. You can optionally lift over if you are interested to other genome assemblies, your peak list to other assemblies. If then you want to compare list or you want to lift over to other assemblies for you. You can extract sequences around your peaks. And this is, it is, for instance, useful if you want to extract sequences around your peaks for subsequent motif discovery, then you can use meme. I don't know whether you know meme or other motif discovery tools that you upload your fast A sequences. And then here, most important, you have the links to, for downstream analysis, links, direct links as we will see in a while to our tools for motif analysis and genome context analysis. So let's now look at the UCSC view just to show how it looks. UCSC is a bit slower. No, it's fast enough. Okay, here we are at the outcome. Yeah, Jim, because I use them, then UCSC used to cache your operations. So it's good. So this is a stat one peak detected by chip peak in our case. So we show at UCSC, we show both the bed. So the region we have. So this is a 300 base per region. And we show the peak density, the peak height. Basically the peak strength is the height in week form. So we show both representation, bed and week for each region, for each peaks. The bed, as I said, shows the region, the entire region of your peak. The gray shade is correlated with a peak height. The strongest, the stronger is the color. And so on. Then you can zoom in, of course, zoom out to see more peaks. You can focus on a particular, so here you also see histone mark. You see, you can switch on a Chipsic data if you want to compare with other experiment Chipsic data from the encode consortium, just to see, to compare your data with the public available data and so on. As I said before, you can, okay, you can, you choose your threshold more or less, yeah, by attempts, you start with an enrichment factor, reasonable enrichment factor. One way to see whether your peak list is reasonable or you to explore the functional, let's say the functional meaning of your, basically of your IP regions is to annotate them. A nice tool for doing that is GREAT, which is a tool that is used to annotate the cis genomic regulatory regions by using by associated, by using nearby annotations, GO terms, annotation of nearby genes. So first you associate your regions with nearby genes and further you try to associate your regions with GO terms. And you, of course, you assess the GO term, genomic region association. You can access the significance by binomial or by a binomial distribution. So to say that these associations are not just by chance and you use a binomial model for assessing that. So what GREAT does, it outputs, first of all, you have histogram showing basically the genomic compartment, the genomic distribution of your region. So here you see that basically our stat one regions tend to be this, they don't tend to be at promoter regions mainly, but they tend to be more distance. We see that they're basically are enhanced regions. And here you have a nice table of GO biological terms that are enriched in your PIC list. And here you see the terms and are ranked by this binomial P values that gives basically a probability that your association is just by chance because you have across the genome genes that have this specific annotation and your regions by chance have been associated to these GO terms. You can visualize this also. These are ranked by binomial P values. This is the matrix that is used by default. You can choose other matrix binomial for the enrichment for instance, and others. And this is the false discovery rate. Basically these are the adjusted binomial P values. You can display this table in bar chart representation as a bar chart again showing the ranked, the binomial P value ranking. And here you see, as you can see, that the top rank GO terms, biological terms are related to regulation of inflammatory response, positive lucoset initiation, which is the biological role, basically consistent with the biological role of stat one. So let me go back to our presentation. So of course, so this is what I have been shown now. And this is, if you want, you have it on the slides, but what I said is how great works by predicting functions of cis-regulatory region. You have your regions. You first try to associate both proximal and distance, your input region with target genes, putative target genes, and then later on to ontology annotation terms, which are depicted here in big A. And then you will have an association and enriched of regular, of ontology annotations in your IP list to assess the significance of this enrichment. You use a binomial probability that basically states, that basically calculates the probability that this association is basically not by chance. So once we have our big list, we would like to do some quality assessment estimates, because what you would like to do, of course, you can vary your threshold, you can do it by varying your threshold. Of course, if you have low threshold, you will have many peaks, many false positive, you would like to reduce the number of false positive peak by not missing too many true binding sites. How do you do that? One way of doing that if you don't want to give, if you don't want to give, we don't do statistics on our peak list as it is done in crunch. For instance, you have just a way we give tools to assess that. This is, for instance, a benchmarking of your peak list is to use motif enrichment. Do motif enrichment test, do annotation steps of your peaks to find out whether you pick up interesting biological terms, biological function. So motif enrichment, as it was said yesterday, the rational behind that is that your peak list, if your protein binds directly to DNA, should contain the binding sites for your protein or for your family of transcription factor or other transcription factor. If your protein has cofactors as well. But the rationale is that we should find your peak list should be enriched in the binding motif for your protein. How do we do that? Today I'll show you how to do that using SSA. Of course, there are other tools on the bioinformatics world, online world, such as Centrim or Atnim as well, which does basically the same thing, similar things. Our SSA tools have been originally developed for analyzing eukaryotic promoters many years ago by Philipp with the purpose of characterizing sequence motif that occur at specific distances from physiologically defined site in DNA sequences. I would like to highlight here that with these tools we look for motifs that occur at constrained, at fixed distances from sites, from functional sites, like our IP regions in this case or transcription on start site and so on. This is depicted here where you see, here we have sequence, you can see sequences that are aligned with respect to a functional site, could be a transcription start site or peak centers, peak positions for a gypsy experiment. And here the type of the kind of motif pattern that we want to study, to try to represent is, try to find out is represented by these red boxes, green and blue here in the figure. So this type of pattern, we call it technically locally overrepresented sequence motifs. This is the SSA web interface. As you said, it looks like the very much the chip seek web interface. On the left side, you have access to the tools, to documentation, you have a brief description of each tool from the server, from the platform. Today, now we will use OPROF, which stands for motif occurrence profile for finding, for doing our motif enrichment analysis. How does OPROF work? OPROF uses a, yeah, uses as the input that functional position set, the files that tells you basically, that position your peaks, that has the position of your peaks or your feature of interest, and extract sequences around these peaks. It takes as input also the sequence motif, a motif binding site that can be described as a consensus sequence or as a position scoring matrices. On the web, we offer models from many motif collections, such as Hoco-Moco, as it was mentioned yesterday, Swiss Regulone as well, and the Jasper, these are a few of the most common motif collections. So once we have that, the binding motif and the set of DNA sequences that are aligned with respect to your peaks, peak sequences, basically DNA sequences around your peak regions, you scan these sequences in a sliding window to find motif occurrences at each position across your sequence. The output is a graph that shows the occurrence frequency of your signal as a function of its position relative to the function on site, the peak center for instance. So how do you do that? I'll go back to the web interface to close some windows. Okay, so here we are at the output page from a chip peak. So here I said we have links to downstream analysis. Here you have a direct link, what I call direct navigation link to the op-rolls. You will be presented by the op-roll page, input page, and the input page will have already your data set uploaded. So you don't need to do anything with your data set. This is your peak list. We will just have eventually to adjust parameters for the analysis. So here you have the regions, you want to explore the region minus 500 pace pairs around your peak size, the window size. As I said, you use a windows to find your currents of your binding motif, the window chip by direction. You can optionally switch on the shuffling sequences just to see your background, just to compare with your background. As I said, you can use a consensus sequence for your motif, which is the consensus sequence for the stat one binding set, but you can use motif position scoring matrices for defining, for describing your motif, and you can pick this up from motif collections. So if we use the consensus to use it, and so here you are screening your sequences around your peak regions, your IP sequences for stat one for finding the binding site. And here, as you see, you see a peak at the center position, meaning that basically your binding motif falls in the center position of your region. You have an enrichment of 40% frequency of 40%, which means that approximately 40% of your peak regions have the binding motif for stat one. And this is if you shuffle, this is the background distribution. As I said before, you can pick up your motif from different motif libraries. We have Jasper, we have HocoMoco, we have a Swiss Regulom that Eric mentioned yesterday, and others for also for other species. We have Trozophila motifs, Arabidopsis motifs and others. So you can do that, of course, with several thresholds, of course, just to assess different peak thresholds using the same strategy, the same stuff, okay? This is what we have seen. And so you will get, you will see that different peak list, from different thresholds, we end up with different enrichment, of course. And as expected, the peak height is inversely correlated to the number of peaks, meaning more peaks, less peaks, the more stringent you are, higher is the enrichment for your binding site. So let's now study, try to explore the genomic context of our IP regions, start one, peak regions. How do we do that? What does it mean? Generally, by exploring the genomic console, we want to see whether regions fall preferentially in transcriptions, in promoter regions or enhancer regions and so on, or it's a repressive mark, which kind of transcriptional activity your protein has. So how do you do that? A way of doing that, using our tool, is to use another tool, chip core, to correlate your regions of interest that we call reference or anchor regions or functional region with other profiles, with other target features. Today we choose, of course, histone modifications, or we will choose three types of histone modification, H3K27 acetylation, which means acetylation of lysine 27 for histone-free, which is a typical active enhancer mark, it marks for active enhancer regions, H3K4 trimethylation, which is an active promoter mark, and H3K27 trimethylation, which is a typical repressive mark. As target features for, we chose our histone modification profiles from any code experiment targeted that histone modification in several cell lines, and in particular, HILA cells, but that are not those interferon gamma stimulator. So let me now introduce the second tool from CHIPSEC, the chip core correlation tool. These tools is very useful. We can do that for several purpose, for quality control of our data. And it's very often a prior step for doing this cross correlation analysis between five prime and three prime N tags to estimate the average fragment size. You can do that prior to doing peak calling. And so to better choose your, you can also see, you can also estimate signal to noise relation by looking at the distribution, at the redensity distribution across the entire genome. But you can also, and you can use it to generate nice aggregation plots. Aggregation plots is a technical term that means that an aggregation plot shows basically the distribution of a particular genomic features today, for instance here, histone modification, relative to a specific anchor point that can be your peak regions again, or a transcription start site or other set of genomic regions of interest. You want to correlate positionally two features basically. So here I will explain here by showing you how to build this aggregation plot by exemplifying with a few examples. So you have here, you have your genome, human genome, you have your chromosome. The reference feature, your peaks are these vertical red lines, whereas the target features are the orange boxes. So here you have your two features depicted here. You define the correlation distance, a region that you want to study around each reference point. And so you proceed. So I will exemplify for four regions, but this will be done across the entire genome. So first region you see does not contribute to the aggregation plot. We don't have any target feature around. Second region you will get a box account here at the specific position and so on. Region third and four will contribute to this part of the plot and so on. And you will end up with a nice aggregation plot that tells you basically represent the abundance of your target feature relative to your reference anchor feature, reference feature. Again, let's use a chip core directly on the web starting again from our peak list, peak list output. We go back to, we can increase of course the number of peaks by changing the relative enrichment factor as I said before. In this case, of course, we will obtain more peaks. And we can do that. As I said, you can really, we encourage you to explore, to interact with the web, to explore for exploring your data, comparing with other data using other tools and so on. This is the idea that you really explore your data and this can be done with a few mouse clicks. So here we said we want to study histone profiles around your peaks. You go to chip core. Again, you are presented with a chip core interface on the left side, you have your data set already uploaded to the page. Here you have your analysis parameters, typical for chip core, you will have to define your correlation distance, your correlation region. Since histones are rather spread signal, we can choose a larger regions than the default. That is plus minus one KB here, plus minus five KB, the window width, that you can increase just for speed, account cutoff. I haven't commented on the account cutoff. This is generally set to one. You know, you remember the SGA that has counts, that has counts as a field that represents the, how many reads, the number of reads that have been mapped to this particular genomic location. Generally, we tend to set a cutoff to one, so we keep each read once, no, at each position. And we want to avoid, yeah, we want to filter out regions where you get for, as Eric mentioned yesterday, for PCR or any technical artifacts, you have regions that really attracts several thousands of reads. So you want to avoid that. So you always set the account cutoff that can be one, 10, it's not necessarily one. And here also, you have normalization. So you generally, when you scan, as you said, your target feature around your reference, in reference feature regions, you just count the number of counts. You can show the results in row counts, or you can show the results in read density, density, or as a global, as a fold enrichment compared to a, again, to a, to a background level, density level. So here we choose the global visualization. And here you select your target feature. So you have, here you have two, two windows, two, yeah, your menus for your reference, where you define your reference feature and your target feature. Again, the genome assembly has to be the same. You cannot mix up different assembly. You select encode the chip seek this time, the dataset, because we said that we want to analyze data, histone data from encode, histone modification, this is the set. Here you scroll, you see you have several sets, and you choose, as we said, the acetylation mark, for instance, to begin with. And here you shift by 70, and you, you submit. This will take a bit longer, because it's a large file. We are dealing with large, but not much longer. Why did I choose 70? As a shifting, because histone modifications are, yeah, IP, for instance, modification brings down, basically pulls out DNA fragments that are wrapped around nucleosomes. And these are typically 140, 150 base bearing length. That's not the main reason that we can choose that. So typically, yeah, you have centering shifting distances that are, that you can, yeah, choose as default. It was quite fast. So here you get your histone profile for H3K27 acetylation around stat one peak region. So what can you infer from that? Your stat one peak regions, which are more or less 500 base pairs regions, less, have, yeah, are enriched in H3K4 acetylation marks that are flanked by these nucleosomes carrying this histone modification. You see a valley at the middle of your peak, which makes sense because probably this is a nucleosome free region, given that the protein, the stat one protein is bound. So it evicts a nucleosome. And here you see more or less compared to background if you think, yeah, 740 enrichment in this histone modification mark. You can do exactly the same with H3, as I said, K4 ME3, which is the typical promoter mark, active mark. So this is nice. This is what chip core does is this aggregation plot. Of course, this is a profile for your histone that it's average across the entire genome. So you will see an average profile. Not every size, of course, we have the same nucleosome organization. But this is because we scan the entire genome, we create this aggregation plot. This is an average profile. And this is the profile for H3, K4 ME3. Here you see we have the enrichment for this, much less is four, three or four. Compared to that, the shape is basically the same, meaning that you always have this valley at the where the protein, where the stat one protein is bound, most likely at the peak center and you have enriched somewhat kind of enrichment for H3, K4 ME3. And you do that to see it with the repressive mark. If we have it here, we don't have it now, but I did it sometimes ago. So I will show you the plot on my slides. So if we go back, this is what we had just seen before. So if you do it for the repressive mark, you will see basically a flat line, which means, as I said, that these results suggest that stat one primarily binds to region, first of all, that are already in an active chromatin states because here, remember, we are looking at healer cells that are not stimulated. So it looks like the binding sites are already in an active chromatin states, binds primarily to enhancer regions, primarily, but to a certain extent also to promote a region, but to a less extent. And you have this typical valet at positions here. You can also ask from the same exploiting the same type of data, are this region also acting in other cell lines? And for that, using the same procedure, I will not go back, but you can do it yourself. Picking up other cell lines, we chose for our tutorial, three other cell lines, the embryonic stem cell and the embryonic stem cell, the lymphoblastoid cell line, the GM12878 and the cancer derived cell line K562. And as you can see from these histone profiles for the acetylation mark, there is a substantial degree of tissue specificity, not all tissues are in the same chromatin state as the healer cells. So I said before that with a chip core, we have, we get an aggregation profile, a profile for your histone mark, for histone or for your target feature more generally, that it's averaged across the entire genome because you repeat that, you report that for all the regions in your genome and you get a profile. If you want to have a different picture in a sense and you want to see maybe patterns, specific patterns, subset of regions that behave different, that have different patterns, you will use the chip extract tool that does similar things as a chip core tool, is a correlation tool. So it scans reference feature regions for a target feature, for the abundance of target feature, but instead of plotting an aggregation plot, it gives you the results in a tabular form for each, in a tabular form for each reference feature. And I will show you that. So this table, this tabular form represents, has rows, rows that represent each reference anchor point, each of this region. And columns, the feature read counts or occurrences depending on, again, on the type of normalization that you choose at specific distances relative to your reference feature. So what it will do is I will exemplify it again with this few regions. So this is for feature at this position for reference feature, you will get this line, representing target feature abundance or read counts across this correlation region. Same thing with this, you have the target feature here, so you have reads at this position, basically. And here as well, you have this and so on. And you will get this tabular table output that can be easily represented in our, as a hit map. And as you can see already from here, the hit map gives you an idea of the proportion of regions. These are regions, for instance, your reference regions, the hit map can be ranked, and these are regions that are ranked according to the similarity to the overall pattern, the aggregation profile. This is the aggregation profile that you will get with the chip core analysis, and this is the proportion of regions that has basically, that reports this type of pattern that have similar pattern. You will see that you have regions that do not have at all this type of patterns, or even the opposite, or other type of patterns. But overall, the majority has this type. Of course, you can, okay, when you create this table, you can be in the way you report your target feature abundance, so you can be in your results, your read counts. You can use different being in size, and this will produce, of course, as we can see here, smoother pictures, but the idea is always the same. So let's now use chip core, sorry, chip extract on the web. Again, we're using our keystone marks. So we go back to the output chip, pick output page, and we can go to chip extract, similar to what we have done before for chip core. Same thing, we are presented with a chip extract input page module. We've already the data uploaded on the page. Again, you choose your analysis parameter, and again, here we can increase the correlation distances, your being size, and here you select again, as before, your target data sets from the encode chipsy collection, histone modifications, H3K, acetylation mark, you again center, and here you have a few options for ordering your HITMAP, you produce this table and you represent it as a HITMAP, resemble to overall pattern, as I've mentioned before. So you ranked your regions, your peak regions based on their similarity to the overall pattern. You can use other metrics, center of gravity, basically to see whether you have maybe see metric if you tend to have specific patterns that are symmetric, maybe from left to right, for instance, or you can also use K-Mean clustering to cluster your peak regions according to your target feature profile. We use the default, and this is, as I said, very useful because it gives you a more in-depth idea on how your IP regions are characterized, the heterogeneity, if you like, of your data. You want to dig in, and then if you use clustering, you can select subset of regions that maybe have this type of histone mark, histone profile, so they are regions that maybe that are probably located in enhancers region from those that are located in promoter regions, and so on, and then do downstream analysis, peak motifs, compare with a few extract regions around promoter regions. You can correlate with transcription on start sites to see, to study dynamics of your peak. So this is the picture for H3K4M3, so you see there is a substantial fraction of peaks that are marked by the acetylation mark, and this is exactly the picture, the aggregation product we saw on chip core, and if now you can, of course, repeat the same exercise with H3K4 histone mark, and if you do that, I will show you already the results. Basically, you have this picture where you see that the proportion of peaks that have H3K27 acetylation marks are larger compared to the K4 trimethylation mark, the promoter mark, basically, consistent with what we have seen before. And with that, I would like to end this presentation. So here I would like to first acknowledge the group, our group. Okay, Philip, for the nice discussion, his guidelines and his patients, Rene and Ruhayda, who are former collaboration, and Rene, especially Rene, who left the group two years ago and who implemented a great deal of DMGA data repository, the curation data, the public data sets repository, and Ruhayda for managing the quality control of our web pages, very important as well. And last but not least, thank you very much for your attention, and now please don't hesitate to ask questions.