 Okay, so this lecture is going to be a little bit shorter than the previous one, that's at least the plan, and it's going to be about the crunch web server, and crunch is to crunch chip data. That is to say, you upload to crunch chipsack data, and all the processing will be done automatically, including a finding of all the peaks that are significantly enriched, and doing a very comprehensive analysis of motif occurrence and binding site occurrence within those peaks. Alright, so if you go to the website crunch.unibus.ch, this will be the file that you see, but this will be what you see, and so you can now, you can upload your data, currently supported organisms are human mouse, and it's a soft log, alright. So there are three ways in which you can provide your data, either you can upload your FASQ files directly, use this tab, if you use this tab, you can provide links to the URLs where your files are stored, so you don't have to upload yourself, you just basically tell the server at which web addresses your data files are, and then crunch will get the data itself, or if your data is in the sequence read archive, you can simply provide the IDs of the data sets, and then also crunch will fetch it themselves. So as already came up, crunch would like to get both foreground and background files where the foreground files are IP samples, and the background files are input DNA, so by clicking on these buttons, you can, you can upload both the IP samples and the input DNA, and this will work similarly with the links where you give separately links to the IP samples and to the background samples. Alright, so again you don't need to set any parameters or something like that, so you just upload your data and then you say go, and of course it's going to take some time, so typically it's a good idea to provide your email address and a name for your project so that you get automatically notified when the analysis is done, and the analysis that crunch will perform consists of three steps. So the first step is really the preprocessing which includes quality filtering of the reads, removing adapters, mapping the reads, and estimating how big the fragments were in your data set that these reads derive from. The second part of crunch's analysis consists of calling the peaks though, and this is, this again consists of detecting regions that are enriched in the IP sample along the genome, decomposing these regions into individual binding peaks and then annotating the peaks that is, find out what the nearest associated genes are. Third, and that's maybe the most unique about our pipeline, is we will perform a comprehensive motif analysis on all these binding peaks, which means is that we will both find new binding motifs up in issue, we will, of combining these de novo motifs with a library of known motifs, we identify a set of motifs that can together explain your peaks, we will then predict sides for all these motifs and all peaks, and finally score and annotate the motifs. Okay, so I'm going to now basically explain to you what each of these steps does, and then I will illustrate what the results look like. So the first part of the pre-processing that crunch does, is to basically truncate low-quality three-parameters of reads, so sometimes the quality of the sequence becomes bad near the end of the read, and basically crunch truncates reads to make sure that all the bases are of sufficient quality, and if after this truncation the reads are either too short, or there is too low sequence quality along across the entire read, or there are too many ambiguous nucleotides, or if the read, it has very low dinucleotide entropy, that is if it's all a repeat of A or a repeat of A, T, A, T, A, T, then these reads will be filtered out. Okay, second, crunch has a library of adapter sequences that are used in sequence protocols, and it will look for matching of these three-prime adapter sequences to the reads, it will identify the adapter sequence that has most matches, and it will then cut these adapter sequences out of your reads, so this is to move adapters from the reads automatically. After this step, we will do read mapping, we use just a standard algorithm to map reads, and we currently use bow tie to map the reads as we quality-filtered them, and I think the only thing that is sort of special in what we do is that we only keep the best mappings for each read, so if a read can match with let's say one or two errors, we only keep the matches with one error, and if there are multiple mappings at the same quality of a read, we will uniformly distribute the read over all its mapping locations. Okay, so if a read maps to 10 places, it will assign to each place with a weight one-tenth. Okay, here's just a little picture from a report file that shows what fraction of reads have zero error, one error, two errors as a function of length across the read. Okay, so these are all fairly standard steps, it's just that you don't have to worry about these things crunches performing these automatically for you. And then after this, we want to estimate the sizes of the fragments from which these reads derive. So here I took these pictures from a paper of Karshenko in Nature Biotech, some sort of review article, but actually the idea was published before by Philip Bucher and Christoph Schmid here before this, and this is that if you have a protein bound somewhere on the genome, Chipsick, and you get these fragments, so here in red and blue there are several fragments of DNA that after the sonocating of the DNA, you get these fragments, and when you now sequence, you may sequence such a fragment either from the left on the positive strand or from the right on the negative strand. So in red, the positive strand reads that come from this five-prime, sorry, it's both five-prime, but comes from this left end, and here in blue our negative strand reads that come from the other end of the fragment. So now when you have such a protein somewhere on the DNA, then so you're going to get one set of reads that are coming from left ends of fragments, and they will make a peak here, and on the opposite strand you will get one set of reads that makes it a peak that is shifted on the chromosome, and the shift between these reads on the plus strand and the reads on the negative strand is exactly the average fragment size. So that means that whenever you have a binding peak, you're actually going to get two sets, two peaks, one on the plus strand and one of the minus strands that are shifted by fragment length. So you can use this fact to estimate fragment lengths and to basically also estimate the center of each fragment, given a read either on the positive or the negative strand, and the way you do this is by looking at the cross correlation between the occurrence of reads starting on the plus strand and the reads starting on the minus strand shifted by a certain distance. And then the place where this cross correlation is biggest is an estimate of the size of the fragment. So Crunch does this, and so here's a picture for a data set of what this cross correlation profile looks like, and it has here a peak at a fragment size of 137. So Crunch estimates that the fragment size is 137, and then it repeats, sorry, it replaces every read with a mapping to the estimated center of the fragment. So reads on the plus strand gets shifted forward, reads on the negative strand gets shifted back. All right, so this was the pre-processing step. The next step of the protocol consists of peak findings, and so this is again a picture that I took from a review article, that the general idea is that you slide a window over the genome, and in each window you look at the density of reads coming from the IP sample, and you look at the density of reads coming from IP, sorry, from the input samples, and by looking, and then you find regions that are statistically significantly enriched for reads from the IP sample. That is where the density in the IP sample is statistically significantly higher than the input sample. All right, so what Crunch follows the same protocol, so it slides a window across the genome, and then it collects all windows over a significant threshold, and it fuses consecutive windows that are all across the threshold into enriched regions. Okay, so now one very important feature of Crunch is that the statistical model we use for detecting enriched regions, so this is what I'm going to explain to you next, but before I get to that, we've noticed, and we're not the only group that has noticed this, that there are some regions in the genome, it's a very small fraction, but there is a fraction of regions in the genome that have very high read density, even in the input. Okay, so in principle, in the input DNA, you should get a roughly constant density of read across the genome, and it's also true that more than 99% of the windows in your genome, the read density falls within a narrow range here somewhere between maybe five and 15 reads per window. However, about one in a thousand of the windows on the genome has a much higher density of reads, and exactly what causes this is a complicated question we haven't really dealt with in detail, but often these regions are associated with repeats. They are regions that poorly align across genomes, so if you try to find orthologous regions in other genomes of those regions, it's hard to do, and we think what is going on is, for example, that there are all these repeats that exist in different copy numbers, in different genomes, and the copy numbers that are actually in the assembly might not actually match the true copy numbers of repeats that occur in the genome from which your data derives, and so there might be some repeat that is repeated a thousand times in the genome from which your data derives, but only 50 times in the assembly, and that means that now all the reads from these thousand repeats have to map to these 50 in the assembly, and so now you get a 20-fold enrichment of reads in those regions. So those, these regions are difficult because the statistical model that we're going to use to detect enrichment doesn't apply to those regions, okay, it breaks down, and so what we do is that we filter these regions out beforehand, okay, so what crunch does, it detects in your input sample regions that have this very small fraction of regions that have abnormally high density in the input, and these are abnormal. All right, so now we get actually to the crucial part of detecting the peaks, so we need a statistical model of how we expect the density of reads in the IP and in the input sample to fluctuate, and already a long time ago we worked on this statistical model to describe such sequencing data fluctuations in the densities of reads, and we first applied this actually to cage data, and what we found is that the fluctuations can be very well described by a statistical model that has multiplicative noise, that is to say a log-normal fluctuations in density convoluted with Poisson sampling noise due to the sequencing, okay, so I will not go into the details of the mathematics of why this is a good model, I will just explain you what the outcome of it is, so for each window in the genome, we count the number of reads n from the IP sample and little m from the input sample, and we also count the total number of reads in this chip sample and in the input sample, capital N and capital M, so the density in the window in the foreground is little n over big n, and the density in the background or input is little m over capital M, and we're going to define our enrichment quantity x as the log ratio of these two densities. Now according to this model of multiplicative noise convoluted with Poisson sampling, in the regions that are not enriched, so in regions that are where there's no binding occurring, this quantity x should fluctuate in a Gaussian way, but where the variance of the Gaussian is given by two times the variance due to this multiplicative noise sigma squared, which we don't know what it is, it might be different for every sample, plus variance coming from the Poisson sampling and in this log scale that variance is 1 over n plus 1 over m, where little n and little n are just the beat counts in this window. So this model takes into account that the actual noise is different in every window and is a combination of the sampling noise plus the noise due to multiplicative fluctuations. All right, so what crunch does is it describes now all these three densities across the genome by a mixture model, where with probability rho a region is not enriched, so if a fraction rho of the genome is not enriched, and there the observed value of x, right, so the log ratio of the density in the IP in the background is coming from this distribution, this background distribution, and then a fraction one minus rho of the windows in the genome, they are enriched and there this x can be anything, and so we assume that x comes from a uniform distribution that looks just like one over, well how big is the range in x, so x max is the highest x we ever seen, x min is the smallest max we ever seen, and so this uniform distribution is 1 over x max minus x min. Okay, so that's the model, but now notice there are three variables that are unknown, there is this mean mu, so you might think that the mean of these densities should be the same in the IP as in the background, but it's not really true, because in the IP there are actual peaks and they're quite high, and therefore in the regions where there is no enrichment the signal, the density will be a little bit less systematically in the IP sample than in the background sample, so mu is this overall shift, sigma is the fluctuations in the redensity, this is actually a measure of how noisy other aspects of the protocol were like sonication, like PCR amplification, and the sequencing library preparation is sigma is a measure of how noisy those processes were, and finally Ro, what is the fraction of the genome that is in the background, and so we fit all these parameters to maximize the likelihood of the data, so we do that for every dataset that is submitted, and then once we have done that we can now calculate for each window i in the genome a normalized z score, and this is the, again it's the log ratio of the densities in the IP in the background minus this shift, oops, minus this shift mu, and then divided by the expected standard deviation which is two times this variance of the multiplicative noise plus the sampling noise coming from both the IP and the background sample, all right, so now if the entire genome had no enrichment whatsoever or if there were no binding events then these z values should perfectly follow a standard Gaussian distribution, so to show what the enrichment's genome wide look like we make a plot where we show the distribution of these z scores and notice that we plotted the distribution on a log scale, that is a Gaussian distribution which is e to the minus z squared turns into a parabola on this log scale, and so in red is the standard Gaussian distribution and in black is the observed distribution of z scores along the windows across the genome, and you see that the black line follows very well this perfect Gaussian distribution until you get to a z score of about four or something, and there you see that there are all of a sudden many more windows with high z scores than you would expect under this random distribution and those are in fact the windows that we will now call as having significant conditions, okay, so this plot shows two things, first of all it shows that the statistical model is successful in capturing the distributions of fluctuations that you see genome wide, and second it gives you a systematic way of setting a cutoff to call which are the windows in the genome that are really significantly enriched and so what we in fact do is we can use this mixture model to calculate for each window a posterior distribution that it's truly enriched and then we basically set a cutoff in z score such that the average posterior of all the windows with z score at least as high as this cutoff is 0.9, another way of saying that is to say that the false discovery rate is set to 10%, okay, so by definition we pick the cutoff so that we expect one in ten of the predicted peaks to be false discovery, and by the way as far as I'm aware this statistical model is the only statistical model used for for chip second analysis that actually demonstrably matches the statistics that you see in the data, as far as I know none of the other approaches actually match the statistics that is observed in the data, okay, good, so now remember that we've now found these windows along the genome 500 base per windows and we found all the windows that are enriched and sometimes we have that there are overlapping windows because we we slide this 500 base per window by 250 base pairs at a time so you can get overlapping windows that are both enriched and we merge those into enriched regions and now we want to decompose these the these enriched regions into individual binding points. Now so we also know that because we've estimated the fragment size distribution we know that if there is a single binding event on the genome what kind of a width of a peak we expect to see, all right, and so from that we can take the read density that we've seen an enriched window and try to decompose it into individual binding. So here's one example where there is an enriched region that is a thousand base pairs long, okay, so we had several windows after each other that were all significantly enriched and now we take this thousand base pair window and we calculate the read density, this is what you see in red, as you go along this window and then we decompose this read density, we model it as a mixture of a constant, sorry, yeah, was there a question? No? Okay, we model this read density across the window as a mixture of a uniform background together with a set of Gaussian shaped peaks whose width is constrained by the known width of the fragment size that we have estimated previously. So what you see here in black is the fitted mixture and that is consistent of a uniform background plus these two Gaussian peaks that are shown in gray dots and so in the end we call two peaks in this window consisting of this blue and this green region and then we recalculate the statistical significance, we recalculate the Z score for the green window and the blue window. Right, so in the end the peaks that we report are these individual binding peaks and they're typically fairly short, so you see here they're sort of 80-85 base pairs wide. All right, so the first thing that Crunch will now report is a sorted list of all the peaks that it's predicted on the genome, of course you can download this file and print it out and so for each peak it gives you the genome coordinates, it gives you the Z score of this peak, it gives you the closest gene to the left of the peak together with where that gene is, so this gene transcription start site of this gene is 124 base pairs to the left of this peak and that gene is transcribed on the minus trend and the nearest gene on the right is this gene, it is only, its transcription starts at only 40 base pairs away and it's transcribed on the plus trend. All right, now there's some slides that I thought were there that I do not see now but in this case too you can actually click on, you can click on this link and then you will see a, you will be taken to the genome browser and you will see, let me see, well I'll get back to this, all right and the second thing that we show is to give you an idea of the quality of the fitted peaks, we give you one, two, three, four, five, six examples of what these read density profiles look like and what the fits to the read density profiles look like, okay, so here is an example from the top five percent of regions so they have a very strong signal and then it goes all the way to examples from regions that were still statistically significantly enriched but that were only marginally enriched and so you see for example also that these profiles get much noisier because the read density gets less as you go from very highly enriched regions to less enriched regions so you get these pictures to give you a sort of a sense of where in the range of these peaks, what the quality of these peaks looks like, all right so finally and probably most important, once crunch has predicted all these peaks it will now look for what binding sites occur in the peak sequences that can explain why these peaks were enriched, all right so the general approaches as follows, crunch will take the top thousand peaks sequences and it will randomly distributed into, sorry, randomly divided in two sets of 500, 500 training peaks and 500 peaks for testing the predicted motifs, all right so with then all these peak regions we will find the orthologous regions in related genomes and make alignments so for the mammalian we have, we make alignments with seven mammals for the drosophila with ten drosophila, then we use a motif finding algorithm that we have previously developed, this is now 15 years ago, called phylogyps and phylogyps will find motifs de novo and it takes into account not only the sequences but also conservation patterns, once we found these motifs we will refine them with another algorithm called Motivo and this will give us up to 24 candidate de novo motifs so we will find motifs both with alignments and without alignments and of motifs of different size and so with this gives us a library of up to 24 candidate motifs that we found de novo on the peak sequences. Second, we have a large library of known motifs from databases so we have a library with about 2,300 different position specific weight matrices from resources like Hokomoko, Unipro, Jasper, our own collection in Swiss Regulon and called NHD-Selex and what we now want to do is for a given peak dataset we want to find a set of motifs which can include both these de novo motifs and known motifs that can jointly explain the binary peaks that we see. All right so what does it mean to explain the binary peaks that we've seen so to do that we use a probabilistic model that is a sort of an idealized version of what happens in a chip sec experiment. All right so basically what happens when you do a chip sec experiment is that your immuno precipitation fished out a subset of sequences from a much larger set of sequences along the genome. Okay so we somehow want to calculate the probability that the IP will fish out all the peak sequences that we've seen but none of the other sequences that occur in the genome and so our idealized model of this is that we're going to assume that we have a pool of sequences that consists of our peak sequences in red plus background sequences in black and these background sequences we're not going to take the actual genomic sequences but what we're going to do is we're going to make random sequences that have the same length and the same dinucreotide composition as the peak sequences but that are otherwise random okay and it is a big pool of such sequences and now given a set of motifs and the concentrations of the transcription factors representing these motifs you can now calculate the expected number of transcription factors that will be bound to each of these sequences. All right so as a function of the sequences that occur in in each of these peak sequences and background sequences you can calculate how many of the transcription factors that are represented by these binding motifs are expected to bound to each of these sequences and we will now assume that the probability to IP to fish out one of these sequences is just proportional to how many transcription factors are bound to it so that's the basic idea the basic idea is that the probability to fish a particular sequence is just proportional to the total number of sites for these motifs that occur in this sequence all right so the probability to to fish if you fish out one sequence that it will be sequence s is simply the number of sites for the motifs w in s divided by the number of sites for these motifs in all sequences in all black and red sequences together okay that's the probability to fish one specific sequence and so now the probability to to fish all the red sequences and nothing else is simply the product of this probability over all the red sequences okay so we're calculating the probability of for a given set of motifs to fish out only the peak sequences that we actually saw and nothing else and what we then try to do is to find a set of motifs w that maximizes this probability okay that will be the set of motifs that we say best explains the observed peak sequences so for each set of motifs we can now assign an enrichment score and this enrichment score is defined as the probability of fishing all the peak sequences and nothing else given these motifs divided by the probability of fishing all the peak sequences and nothing else assuming that you fish at random and then this to the power one over the number of peaks and this basically gives you how much more likely it is to fish a true peak than a background sequence average over all sequences and another way you can also look at it is the ratio between the number of sites for the motifs in your sequences divided by the number of sites in background sequences okay so for each chip deck data set crimes will now come back with a set of complementary motifs and so this is an example data set for the transcription factor nrf1 and you see that for this this complementary set of motifs has four motifs in in them and they are given to you sorted by their enrichment all right and so for each of these motifs you are now given a set of statistics that describe how enriched this motif is in your sequences so the first number here is simply the enrichment that is how much more common are binding sites for this motif in your peak sequences than in similar sequences with similar nucleotide composition but that are otherwise random and so you get about 38 fold more occurrences of nrf1 sites or sites with this particular motif than in background sequences the second number here is an area under a precision recall curve so basically what you could do is you could try to classify true peak sequences from background sequences simply by the number of binding sites for this motif that occurred in the sequences okay so if you use number of binding sites for this motif as a classifier you will get this precision recall curve that we're shown here and the area under this precision recall curve is almost 93 percent so it says this motif is very good in distinguishing peak sequences from background sequences third we will investigate to what extent it is true that higher peaks tend to have more binding sites so what we do is we take all our peaks and sort them by their z score that's a measure of the height of the peak and we divide them into bins and then in each bin of peak height we calculate what is the mean and standard deviation of the number of predicted binding sites and you see in this case there's actually quite a good correlation that says as peaks get higher you tend to also get more binding sites all right so this correlation is a Pearson correlation of 0.67 which is really it's quite a high correlation it's true for this motif that higher peaks tend to also have more motifs and then the third thing that you can check is whether the binding sites tend to occur in the center of the of the peaks or whether they tend to occur outside of the center of the peaks notice that if binding sites occur near the center of the peaks they will occur in those sites in the enriched region that have the highest chip signal so what we look at is the distribution of the chip signal at predicted binding sites relative to the distribution of the chip signal in the entire peaks and that is here given in green and blue and you see for this motif there are nine times the chip signal is on average nine times higher at binding sites than it is outside of binding sites in the peak okay and then finally we also just tell you what out of all 9227 peaks almost 4,000 had binding sites for NRF1 okay so for each motif you get these kind of information so the way we now create an optimal complementary set of motifs is we start with the most highly enriched motif and then we iterate the following procedure so for each of the remaining motifs we add this new motif to our motif set and now calculate an enrichment for the new set of motifs that is our original motif plus now this new motif w prime and then we select w prime that maximizes this enrichment okay so given that we started with one motif we now ask for a second motif such that if I sum together the sites of both motifs I now get the strongest enrichment and I keep iterating that adding one motif at a time until this enrichment increases by less than five percent and so for this particular example this is the ATF2 dataset from encode so the best motif was a motif that was found in Novo that had an enrichment of only 1.8 but when you then add it a second motif this is a motif for CREP3 the enrichment went up to 2.75 another one it went up to 3.3 and finally with this 4 motif it went up to 4.2 so we provide a plot that shows you that as you add more motifs to the set how does the total enrichment go up and we also provide for you a correlation plot that shows you which of these motifs in the set tend to co-occur or tend to avoid co-occurring so for example you see here a red dot between sp4 of Hokomoko and sp1 of Hokomoko which says that those two motifs tend to co-occur in these all right so you can explore yourself what the results of crunch looks like and you can for that we've made available a large number of results on chipset datasets from the encode project so if you go to the start page and you go here to the top left to the encode reports link you will get a page with a long list of datasets and so these datasets are given the name of the transcription factor the cell line in which the experiment was done the lab that has done the experiment and a link and links to the actual raw data and so to illustrate for you the results I took the results for the BRCA 1a transcription factor done in the GM1287a cell line by the group of Mike Snider with Stanford okay so the crunch report at the top has basically a menu that allows you to guide to the different parts of the results so quality control peaks motifs set top motifs and downloads and you can use those to navigate and at the top there is a summary that gives you an overview statistics of the quality of the results for this chipset data set so it will tell you what fraction of the reads in the IP sample map was successfully mapped in this case that's almost 86 percent and now comparing across all encode datasets we have this is about 150 datasets or so we see that this is quite good so this is sort of 73rd percentile this is better than average fraction of math reads from the background samples input samples that were mapped 89 percent okay that's 78 percentile then next remember that when we fitted our mixture model for the peaks all right there was this there was this um quantity sigma that tells you how noisy was the read density as you went along the genome so that was fitted for this dataset and so we next show you uh what was the noise level um of this uh chipset signal and in this case this noise level was point 14 which is quite low it's in the 23rd percentile so this it was a good dataset in the sense that the read densities weren't very noisy across the genome and the second thing that we provide is how well does this um once we have fitted the model and transform the read densities in these regions into the Z scores how well do the Z follows follow this statistical model and here this is also quite good this sort of 30 percentile so to give you an idea of what we mean by this I already told you that this sigma is the noise in the chip signal and remember that if the there was a perfect fit to our statistical model then all the regions that are not enriched would follow this black parabola and so I've shown you here a couple of datasets where I'm showing you on the right here in these colored dots what are the deviations of the fit so the best one is the is this um dark green one that has a deviation of about 0.4 and it looks like this so you see it fits very well the black curve and only deviates here in the truly rich regions then the light green okay so this is sort of two-thirds of the datasets is at least as good as this light green and then in yellow and red are two examples of all the datasets that don't look so good right so you see that in some dataset there seem to be regions that look enriched in the background versus the IP so they're systematically lower in the IP than in the background and there are many reasons why this can happen as a as a matter of a technical artifact in the experiment but so this statistic gives you a way by comparing how well this distribution fits this curve how successful the model was in fitting the read densities that were actually observed across the genome all right so that is these then for the peak statistics you get how many peaks you found so in this case it was actually really low all right was only 1300 peaks genome-wide this is in the 13 percentile and we also report what fraction of all reads mapped to the peaks which is in this case a 0.4 percent which is also quite low okay so there are not many peaks and they're not extremely enriched and then finally how successful was the motif enrichment so the best enriched motif was fivefold enriched which is sort of average 56 percentile and the complementary set when you take all motifs together in our set it had a enrichment of 10.6 which is also sort of average all right so you first get this kind of summary that gives you an idea of how good your dataset is you get then information on mapping quality so it tells you how many reads there were how many were left after removing low quality reads how many were left after removal of adapter and low complexity reads and how many after math so you see you go from 22 and a half million to 19 and a half million and this is okay all right the same here for a second replicate you went from 30 million to about 26 million there are also detailed reports and all these things that give you more detailed information like what kind of errors you saw whether they were at the five prime the three prime and so on and so on okay then the report on the peak calling it will tell you how many windows along the genome matched were above this z value cutoff so these were about 3000 but there were a lot of them were overlapping so when we fuse them into regions not overlapping regions we had 1700 regions and then we when we fitted individual binary peaks in these regions there were 1348 binary peaks left now this is the distribution of z scores that we found in our windows this actually fits quite well the predicted standard Gaussian curve and so again the truly enriched regions offer here here is the reverse cumulative distribution of z scores and the red line shows you where we put the color okay so here are examples of peaks fitted to these enriched regions right so in many cases you will find only one statistically significant peak right so both here and there we fitted two Gaussian peaks but one of them was basically not sufficiently significant to make it in the final report and so in both these cases only the green peak made it in the end all right so list of annotated peaks I showed you this before and so let's now go through what the meaning is of all these columns so first is the location on the genome of the peak and now I get to see what I wanted to show you so if you click on this link you will actually be taken to the genome browser and notice right that this annotation already tells you that to the right 45 base pairs only to the right is the start of a gene called nr whereas on the left it is 22 kilobases okay so if you follow this link you will take it to the genome browser and you will see that the yellow region here is actually the peak region and because this overlaps a promoter so here you see a known promoter from a promoter annotation this is a promoter of the gene nr and these are the binding sites predicted in this regular one in this promoter okay so the second column is the z score of this peak that's really how high it is nearest gene on the left distance to this gene nearest gene on the right and distance to the tss of that gene all right and all this information you get of course for all peaks and they're available in a flat file this where the complementary motifs that were fitted for this braka one so we we found a set of two for six motifs they're here sorted by the size of their contribution so this motif was put first it was a denormal motif it was found new that had an enrichment of 5.3 when the second motif was edited went up to 8.5 and so on so it shows you under this table you get plots that show you how this enrichment went up as more and more motifs were added and in the right panel correlation of the occurrence of these motifs in the same peaks and you see that for this set of motifs there is hardly any significant correlation of motifs occurrence it looks like these motifs are occurring independently of each other all right now for each of these motifs you can click on a link and you will get taken to more information about this motif the first thing you get is the logo of the motif and the reverse complement of this logo and then you will get a list this is especially useful for new not no motifs but motifs that were found denormal we basically went through our database of no motifs and asked what are the no motifs that are most similar to this motif that we found denormal all right so here is this list together with a distance metric this is basically sort of what how much percent difference is there between this motif and each of these motifs so this one is a three percent difference right this one at six percent and logos of these motifs and you see if you look by eye then these motifs look indeed very similar to the motif we found so we find that this novel but the novel motif is close to a motif that's called gfx in holmer to a motif that's called ua1 in encode a motif that's called zbtb33 in holmer another motif that's called zbd1 in htclex and a motif that's called kaizo in okamoko now i can tell you that it is now known that the transcription factor that binds to this motif is in fact called kaizo that's the name of the transcription factor that binds to this motif and this ua1 motif or encode is also just a motif for kaizo all right so that's one piece of information you get and then i already showed you these other statistics about the motifs so you get the precision recall and the correlation of how many how high is the peak and how many sites for the motif how much enrichment and binding sites relative to the rest of the peak region and in what number of the peaks does this motif occur and notice that for almost all these guys the motif occurs only in sort of half of the peaks kaizo so this is actually also with quite common that that often your peaks that cannot be explained by a single motif you really need a set of multiple motifs to explain all your peaks all right so finally at the end of the report is a whole set of files for download so you can download the complete report so that is all the hdmi files with all the links and all the downloadable files you can download flat files of the peaks often you want those and by the way this is important that i forgot to mention so we will also annotate for each peaks where the binding sites occur for each of the motifs in the motif set all right so this complementary motif set consists of six motifs so we then went with all these motifs through all the binding peaks and annotated where sites occur for each of these motifs so if you download the peak set you will get also annotations in there of where binding sites occur for each of these transcription factors all right so you also can download pdf reports with more details about the statistics of the peak calling and the motif analysis you also get for each data set that you uploaded a wig file so a wig file is a file that you can upload to a genome browser that gives you the density along the genome and that you can use to yourself now look at the peaks we predicted and actually look at the profiles yourself in the genome browser similar it also gives you bed files with all the mappings this is if you just want to have the mappings and also reports on the mapping for each of these reference all right so I just want to close with a couple of remarks first mentioned some things that we noticed when we were analyzing these encode data sets so one of the things we noticed is that for some set of mode of transcription factors you find only a single motif and for another set of transcription factors you really need multiple motifs okay so if you ask how much extra info once I've put the top motif how much extra information is there now in adding more motifs there is a whole set of transcription factors where there is essentially no more information that you can add it seems that the binding peaks are described by just a single motif for a single transcription factor but this is true for about maybe one third of the data set for two-thirds of the data set there is either a moderate amount or a large amount of extra information that is contained in other motifs so it seems that transcription factors sort of naturally separate into some transcription factors that can define peaks all by themselves and transcription factors that are only defined by a whole group of motifs all right a second thing that we noticed that we thought is actually quite striking I also I thought this was actually quite striking is that in these large collections of motifs from different databases you get many motifs that look very very very similar so I just want to show you here an example of the top five library motifs for the transcription factor max from two different experiments okay so there were experiments done with the max transcription factor in a Hila cell line and there were also experiments done Chipsick in the GM 12878 cell line and I'm just now showing you the top five motifs that crunch found from the Chipsick data in your cells and here from the Chipsick data in GM 12878 cell lines now if you look at these sequence logos by eye they all look identical okay you see the CAC GTG okay maybe the height of the letters is a tiny bit different but these look very very similar and these are indeed motifs from different databases for the max transcription factor this is from HD cell X this is from Hokomoko this is from Swiss Regulon this is another HD cell X and this is another one from Swiss Regulon and even though these motifs look very very similar if you let crunch analyze the data from two different cell lines they come almost in the exact same order okay so even though they look extremely similar this HD cell X max motif came out on top in both cases all right and here that numbers two and three were flipped but the numbers four were also in the numbers five so out of the top four five sequences the top four were essentially the same as in both cell lines so it really looks like you can use this analysis to tell apart subtle differences between motifs so we find this kind of consistency across most of the transcription factors that were analyzed in multiple cell lines okay so those were two elements and sorry those were two observations and then finally I want to close with telling you something about what we're currently doing and hoping to put out soon so as you've seen crunch is a automated pipeline to essentially take one experiment of IP with some transcription factor plus some input DNA and to find basically all the places in the genome where the transcription factor finds and then report for you which motifs occur among these binding peaks all right but in many cases and I some of you already brought this up in the morning you're interested in comparing chipset data across different samples so so you might have the same transcription factor in different tissues or at different time points in development or maybe you're not looking at the transcription factor but you're looking at just chromatin accessibility with a dark sec or with DNA sec and you're interested in comparing this across a number of samples so we've been actually been working on that so Anna Kramer in my PhD student in my group has been working on integrating crunch and mara to do this kind of analysis so we call this Kramer for cis regulatory element motif activities so what this approach does is we take all the samples then we run crunch on each sample to get the binding peaks in each sample so in this case there are here three samples shown and you see there's this green peak here in the first sample there are three peaks in the second orange sample and there are here two peaks in the third sample and then what we do is we take the union of these peaks from all the samples and we call these cis regulatory elements so in this case there are three cis regulatory elements in this region we then measure the height of each of these peak across our samples all right so the the the first CRE is highest in the orange sample the second CRE is is higher in the orange and the blue and the third CRE is sort of high in all three all right so we we basically get a height of each of these peaks across all of the samples and what we then do is we use Motivo with our library of non motifs to predict binding sites in each CRE across the junior and then finally we use the Mara model that I told you about in the morning to explain the changes in height or even absence or presence of these peaks across the different samples in terms of the binding sites at the current each peak and motif activities all right so this is something that is sort of now close to to ready to share at the moment we you cannot run this yet online but I'm telling you that this is something that is coming soon and then you will be able to sort of integrate the crunch and Mara analysis and look at sort of motif activities in driving either accessible chromatin or tensed affected binding genome-wide across the center samples all right so with that I've come to the end I just wanted to acknowledge the various people that have helped developing this pipeline I mean this pipeline it took quite a number of years it was something that we were just running in the lab in our group for a while and so there's a whole number of people that contributed parts of the processing and so for example Nick Kelly and Sylvia Salatino they all they both contributed scripts to the pre-processing and Said Omidy helped setting up a general pipeline for combining all these things together Phil Arnold was he developed the first version of the peak finder and he also was the main developer of the Motivo algorithm so we call you already heard about and then I had a extremely talented master student called Severing Berger who actually then put the whole pipeline together and made this all into an integrated system where you can now simply upload your FASTQ files and all this analysis is done automatically all right so with that I've come to the end and I would invite you all to ask questions if you have questions