 Welcome everybody to the second session, which will be on the EPD, the eukaryotic promoter database. That's a relatively old resource, more than 30 years old. So I take the liberty to start with kind of more philosophical remarks. Obviously if you want to develop a database for something, you should know what is your target. And promoters are a difficult case in this respect, because it's an ill-defined term, a little bit like genes, which maybe have created more confusion than clarifications. So we probably should go back to the origins. It was introduced by Schafko and Mono, who defined it as a DNA sequence located upstream of the gene, to which RNA polymerase binds and initiates transcription. So, you know, one question you could ask is whether this was a discovery or an invention, and another question you can ask whether such DNA sequences actually exist, and they do certainly exist in the case of the Locco Perron in Nikolai. But the problem is, when people started to analyze eukaryotic promoters, on the one hand they thought promoters would be something very interesting in this case, but it also became clear that such things do not exist in eukaryotes. Specifically, so the green, what is in green, that is actually applies to eukaryotic promoters as well, and what is red is false or only partially true. So in Nikolai, promoters are directly recognized by RNA polymerase in eukaryotes. So in the case and located upstream sometimes, but in the case, for instance, of tRNA genes, the promoters are said to be downstream. But again, I'm now referring to a specific sequence motif that is located downstream, not necessarily to eukaryotic promoter in the modern sense. So the only definition that somehow works for eukaryotes and which we applied in EPD when EPD was created is that promoter is a site of a short DNA region where poll 2 initiates gene transcription. And note that I specified that it should initiate gene transcription and not just transcription, because RNA polymerase can also initiate transcripts which do not give rise to RNA or protein products. So this is a theoretical definition. I mean, it's something that we imagine is happening in the nucleus. But if you want to compile a database, you need an operational definition, based on experimental data. And here the definition is, it is a cluster of mRNA 5-prime ends mapped to the genome. Note also, these thoughts about the definition basically have created a situation where many people ask me, so why do you call it a promoter database? It is a transcription start site database. I think it would make sense. It would make things easier if nobody had introduced the term promoter for eukaryotes. But given the background, we call it promoters. And this is basically also reflected by the newly coined term promoter arm, which is used to describe, to refer to the entirety of all transcription initiation sites of a genome. Now, I should say gene is also an ambiguous term. I wanted to remark that we recently extended EPD new to certain classes of non-coding RNAs. But most of EPD really relates to a protein coding gene. Let me give a brief historical overview. I started about how many years ago, almost 40 years ago, when a compilation of about 60 eukaryotic promoters was published in a review article by Breistner and Schambau. And so because I was interested at that time in promoters, I typed in this compilation into a computer, a big one about the size of a refrigerator, which had six kilobases of memory. And then I continued to update this collection with new promoters that I found reported in papers. So in 1986, we compiled, I started a PhD at the Weizmann Institute in Israel at that time. We compiled a list of 168 promoters in a paperinucleic acids research. And then I had the idea that maybe it would be better to publish this in machine-readable form. And I contacted people of the MBL Data Library and they were very happy to put my text file onto their magnetic tapes. So then later in, so that's basically how it became a kind of database in the modern sense. Although content-wise it was of course not different from the table that we published on paper. So in 1997, we introduced the WebCyber server. Then in 2005 to 2007, we, that means mostly Giovanna, developed the SSA Web Interface and the Gipsy server. Somewhat later, we created the MGA, which is a structured, a more structured data backend. And then it became necessary to completely redesign EPD. So it changed the basic structure and also the maintenance procedure. I will mostly show you what you see today is EPD new. And in 2020, EPD new covers 15 model organisms from diverse eukaryotes. We have problems switching. Okay. So the guiding principles of the old EPD database, which basically have not changed in its essence, but in the details very much. So promoter evidence, we accepted only experimental evidence, but under some under defined rules, also homology-inferred promoters. So this we don't do anymore in EPD new. Now an important assumption is that kept five prime ends of eukaryotic RNAs actually are created by transcriptional initiation, not by internal cleavage of an RNA. This is debatable. This has been debated. But if we don't make this assumption, we just have no data to make a database. So I won't comment much on this question. The primary data source were journal articles. The experiments that were used at that time to characterize promoters are called RNA sequencing, nucleus protection, primary extension, is the exception of RNA sequences. The other ones are not used anymore. What is important to underline is that the contents is based on a critical analysis of data shown, and that means mostly pictures, and not just on conclusions reported by the authors. So it was a lot of work to read all these articles. The promoters, that's something very unusual at that time. We didn't represent promoters as sequences, although most people thought of promoters as sequences. We represented them by pointers to positions in sequences. And the purpose was comparative analysis of promoter analysis. EPD was also used as training and test sets for promoter prediction algorithms. And we thought it would be a nice resource for bench biologists who are just interested in specific promoters of their preferred genes. But I think we were not so successful in publicizing EPD with bench biologists. So just an example of the old experiments. You see the picture. This is a primary extension experiment. People use a primer inside somewhere in the upstream regions of the RNA and then reverse transcript this primer and put the product on a gel. And you can, with the parallel sequencing lane, you can send a sequencing lane. You can guess, you can map the transcription initiation side to the sequence. In this case, it's actually quite sharp. What you may not realize is that at that time, one promoter was worth a paper. So in order to compile the 170 or 180 promoters in our first publications, we had to read that many papers. So this is slightly repetitive. So my personal motivation to create EPD was discovery and characterization of promoter motifs. And then other people's interests were more, especially computational biologists, were more to use it as a training and tested for promoter prediction. And what I want to say at this point is that actually the signal search analysis package, which Giovanna introduced this morning, was developed in parallel and highly synergistically with EPD. So basically, if I wanted to characterize promoter motifs, I needed a data resource of promoters and I needed a software tool to find the motifs. Motif discovery in promoters was so obvious as one could think. In a typical motif discovery situation, you have DNA sequences with a beginning and an end. And you know that these sequences have some function or they have some biochemical activity. And then you try to find a motif that is overrepresented within these sequences. But with promoters, you don't know where the promoters start and where the promoters end. And this has not changed since then. A promoter is an object which is mostly defined by a kind of center position, the TSS, or the major TSS, but which has very fuzzy borders. So that's why we basically, SSA has a different design from most other motif discovery programs. So to set up the problem of finding promoter motifs in a rigorous way, I defined promoter motifs as the generate sequence patterns, which can be represented by consensus sequences, possibly with mismatches allowed or a position weight matrix or maybe something more advanced. And the second element in the definition is that these motifs must occur at a fixed or constrained distance from the transcription start site. And finally, they should be overrepresented in this region relative to the start site in a statistically significant way. So that means the litmus test for promoter motifs is that they are locally overrepresented at a certain distance from the TSS and this overrepresentation is statistically significant or maybe just intuitively convincing by a motif occurrence profile as shown in the lower right corner of this slide, which is from a publication from 1990. So this figure was actually produced with an earlier version, which produced somewhat different pictures. So actually, this was the main use of EPD. This is also maybe what made it popular and which also called attention to my work. So in 1990, we published matrix descriptions of four major vertebrate RNA-POL2 promoter elements which we derived at that time from 500 to unrelated promoter sequences and you see the sequence logos of these motifs below. Now, since I've changed and we really have to understand that the very biological research on transcriptional regulation and promoters in particular has completely changed. When I started EPD, the data were produced by small labs, everything, all data. The technologies that were available were targeted at individual genes and usually research groups were also interested in a specific gene and the data were published in journal articles, piece by piece. And then in the middle of the late 90s, the development started, which one could call the high throughput biology revolution, the two factors which contributed to this development on the one hand, the invention of high throughput technologies which allowed, it was very applicable, which produced data in large amounts and some of them were applicable on a genome-wide scale. And the second factor was the human genome project which publicized the global approach. So you should go global. And associated with this was that the data from high throughput biology, they were no longer really accessible through journals, maybe conclusions, statistical analysis results were published in journals, but the data as such were only available online in computer readable form. So this basically made the traditional data collection procedure by EPD ineffective, mainly reading papers. So in recent times, the most significant technological advance in gene regulation was the introduction of so-called next generation sequencing technologies. They obviously not any more next generations. They are more like previous generation, but the term has remained. And these assays, they produce huge amount of data, they are all applied genome-wise and among the NGS-based assays which contributed most to our understanding of promoters or cage, which is a technology for quantitative single-based CSS mapping, gypsy for mapping in vivo protein DNA interactions and M&A seek, which allows the precise localization of nucleosomes. So just a slide on CAP analysis of gene expression cage. On the left side, it's schematic outline how the technology works. The essential part is that a method to separate messenger RNAs from degradation products based on the presence of a CAP structure, then a linker is added to the CAP structure and from this linker, the sequences of the five prime ends of bona fide complete transcripts are generated. Initially, there is a technology called Sage, but today, obviously, with any of the popular short-treat sequencing technology, these reads are then mapped to the genome. So what is important about cage is that it is a quantitative method to characterize RNA. So in a way, there are some substitutes for other types of gene expression essays like RNA seek or microarrays. However, the advantage that it gives you expression values, promoter-specific expression values, as you probably know, many eukaryotic promoters have multiple. Eukaryotic genes have multiple promoters, and these promoters often are regulated in tissue-specific manners. And as a side product, this technology also provides a detailed picture of the transcription initiation pattern in the promoter regions, which may tell you something about the specific molecular mechanisms by which transcription is initiated and regulated. Now, cage is not the only massively parallel five-prime-end sequencing technology. Similar essays have been invented. TSS seek is based on an oligocapping method to sequence kept RNA. More interesting is grow-capped. This is a method which sequence kept nuclear run-on transcripts. This technology has some importance for exotic organisms where actually the kept end of messenger RNAs do not derive from the promoter, but from a spliced needle that is spliced onto the five-prime region of the gene. This is true for C. elegans, for instance. So without grow-cap, there will not be EPD for C. elegans. Now, just to show you what kept does, you see here the initiation patterns of two promoters revealed by cage. On top, you see a narrow promoter where most transcripts actually initiate at exactly the same position. And below, there is a more broad promoter where the transcription initiation pattern is quite messy. I would say, please have a look at the numbers at the scale on the right, on the left side, especially in the upper part. So the peak represents 83,000 individual cage tags. So that is pretty high coverage. Now you can extrapolate it to maybe 30,000 promoters of the genome. And you realize how much information you get with a single cage experiment. And don't forget, cage is applied to tissue samples. So the cage resources, they offer that level of resolution for hundreds of the different tissue samples. MNA-seq is also an old technology, basically, which has been used for some time, which allows to separate nucleosomes. So MNA-seq is a particular, as shown on the left side, is a particular nucleus, which does not attack DNA that is reptile nucleosomes, but then precisely cuts at the borders of the nucleosome. So MNA-seq is used basically as gypsum. It just cuts out, it produces nucleosomal fragments. The picture on the right is wrong in the sense that the overhanging green fragments are far too long in MNA-seq. They would be very close to the protein DNA complex. But I also introduced, show this picture because MNA-seq can be combined with this gypsy for histone marks and in which case it produces a high resolution mapping of subsets of nucleosomes which carry a certain mark. Now MNA-seq led to a surprise discovery, namely that the promoters have a very specific nucleosome architecture. Left you see the picture that was obtained for yeast and on the right side, a similar picture for humans, which we published some time ago. So you should... The black curve shows the distribution obtained after... Well, the red and the green curves basically show the distribution obtained with the five prime and the three prime tags. So the red curve marks the beginning of a nucleosome and the green curve, the end of the nucleosome. At that time we did not yet shift the tags. So NGS has really changed our views of promoters. Some new insights gained cage, revealed that not only promoters but also enhancers are transcribed by polymerase II. And Chipsic allowed us to identify promoter-specific histone marks like H3K4ME3 and to investigate the function of histone marks in a more systematic manner. And finally, MNAC has revealed that active promoters are nucleosome-free. This was suspected but not really demonstrated and that they are flanked by position nucleosomes on the downstream side. Now, I should say something about position nucleosomes. Maybe let's go back to the previous figure. It wasn't clear, it's still not entirely clear in general whether nucleosomes in different cells or on sister chromosomes within the same cells actually occupy the same positions or whether they randomly fill up the chromosomes and just at leaving spaces between them of variable lengths of 10 to 50 base pairs. Now, what today is clear that in a large part of the genome, nucleosomes are completely randomly positioned, maybe 80%, maybe 60%. I'm speaking of human, not of yeast. And if the nucleosomes would be randomly positioned, I mean, in any cell or in any chromosome, obviously the nucleosome occupies a very specific site at any given time. But if they would occupy different positions in different cells, then we would see a completely flat pattern in this plot. And so these peaks, they actually mean that for a large number of promoters, at least, the nucleosomes occur at precisely the same positions relative to the TSS. And this is specifically the case in the downstream regions. So in response to the high school put biology revolution, the time has come where we came to the conclusion that we need to completely redesign EPD. This doesn't necessarily mean that we would not stick to the basic principles, like that we rely exclusively on experimental data, so we wouldn't use sequence-based predictions, or that we base the data contents on quality-controlled data and the like. But it meant that we would completely change the procedure to generate the data and also choose new formats and new ways to link the promoters to experimental data. So that's why in the years 2001 to 2002 we completely redesigned EPD from scratch. And so that means that during a short time period we had basically two databases in parallel, which were used, but in the meantime EPD new has completely superseded EPD and so we call it again EPD. Here the design principle. So since we didn't have a large labor force at our disposal, so EPD new was designed to be a light resource that would be cost-effective in terms of maintenance. Unlike the old EPD database, we focused on model organisms, so we decided that has to do with the fact that now EPD is based on data obtained with technologies that characterize the genome at once, so that was not the case before. So that means that there are relatively few organisms for which we have genome-wide data. And then we switched to automatic extraction of promoter information from high school data. But that obviously is dangerous. I mean, the danger is garbage in, garbage out. And so we complemented the automatic procedures with stringent quality control of the input data and the stringent quality control of the input data is still largely done in an intuitive and manual manner. So we also do visual and automatic checks of the output database, and if they don't look okay, so then we change the extraction procedure. We have an entry viewer which is now mostly based on the UCSC genome browser, but in this viewer we add homemade custom tracks. Promotor selection and analysis tools are offered by independent resources, namely the gypsum and SSI servers. And information on promoter structure is not anymore hard-coded in the database itself. It's delivered by on-the-fly integration tools accessing the MGA repository. And we also, for the web server mostly, we made it more interoperable with external resources through direct navigation buttons. So the goals are in a way very similar. We still, we now can see there's two major objectives. One is to produce an accurate SS collection for a selected number of model organisms, and if I probably should have said high quality rather than accurate, so high quality in terms of comprehensiveness, higher enrichment in true promoters, whatever that means, and higher accuracy in the TSS mapping itself. The target users are computational biologists, and they are also geoscientific research researchers. Now the second main part of EPD is now a viewer for viewing promoters as such and related chromatin data. And the viewer is meant to provide a useful data selection of cage, chip, sec, M&As, seek data and additional data, so to provide an optimal data display in the UCSC genome browser. The target users for the viewer are more wet lab biologists interested in specific genes. Now, and third, we deploy promoter selection and analysis tools via accessory resources. There's still a search page where promoters can be selected on the basis of hard-coded information in the EPD database, but otherwise we delegate the selection and analysis tools to the gypsum and SSA server, which obviously can be applied to other genomic features than promoters. So in particular, we can select promoters on the basis of motif content, chromatin state, and basically any type of additional genomic information. And still we offer promoter element data for download from the mass genome annotation data repository. The target users for promoter selection and analysis tools are, of course, everybody, experimental biologists and computational biologists. Now, this is the most important slide of my talk. It's a busy slide and it explains in principle everything you should, you have to know about EPD new in order to understand what it is. So it all starts with external data from, for instance, from GEO, from Ensoble, from EVI. And in particular, we need as input cage data or similar data and gene annotation catalogs like the GenCode gene catalog. Now, these data, they go to quality evaluation control and some never make it into the MGA data repository. I listed here EPD, MGA data repository, but the EPD is not really necessary because most of the data in the MGA repository are somehow relevant to promoters. Now, from there, we do a second selection step to select the input data for the EPD assembly pipeline. This is the computational pipeline that takes some data as input and in the end creates a catalog of promoters with consensus or DSS positions. So the output of the EPD assembly pipeline is then evaluated. We have a motif-based evaluation and we do also manual checks. So what usually happens is that after the first trial, we are not happy. So the quality control steps provide feedback and sometimes the feedback prompts us to remove a data set from the EPD source data because we realize that it is of bad quality or it has some other disadvantages and sometimes we may add some additional data. But most of the fine-tuning then goes into the assembly in the computational processing pipeline and we repeat these cycles typically two to three times and then we tend to be happy or we don't see any improvement in terms of quality control. So I'm now going to say a little bit more on the elements shown on this slide. The first element is the MGA repository. So Javan already told many things about it so this may duplicate. First, it is an FDP accessible directory it is hierarchically organized at three levels genome assembly, series and samples. All data are in SGA format and for each series there is an HTML documentation file and then this describes the source data, the samples and also the reformatting procedures that we use to generate our productive version of the data. The MGA repository is accessible from EPD it's accessible from SSA it's accessible also from the chip six over and additional tools we have developed. The interface, that's an important point. The interface with the web servers is achieved through machine-readable data description files so each sub-directory in the MGA repository has two data description files one that describes the series and one that describes all the samples and the menus you see both in the chip six over and on the SSA server they are automatically generated by reading these data description files which are in the MGA repositories and if you are curious what they look like then you just have to go can go to the FDP site and look at these data description files. Note that the MGA provides the source data for EPD and that is mostly caged data and other high throughput transcription start site mapping data but it also provides the complementary functional genomics data for promoter analysis and for promoter selection. Now just a very brief overview over the MGA it contains primary data, gypsy, chromatin accessibility DNA methylation, not much for the moment and transcription profiling data we call it now transcription profiling because it contains only data that are related to the transcriptional process and which are collinear with the genome so we exclude RNA-seq data that are produced from spliced cytoplasmic RNA then it contains also derived data so we consider gypsy peaks for instance published peak lists like the peak list from encode and genome annotations such as promoters, splice junctions and the like and we also have a very interesting and diverse section called sequence intrinsic these are features that are computed from genome sequences only they don't need any additional experiments in addition to the genome sequence those include cross species conservation scores genome variation data like SNPs for instance so let me say a few words on deep in the assembly pipeline so the goal is to identify promoters and promoters are defined not as single transcription start site but as transcription initiation regions and underlying is a certain assumption about the size of the promoters so we think that a promoter can be as large as about 100 base pairs or maybe 150 base pairs that's maybe what corresponds to the largest transcription start site clusters we observe clusters that are clearly associated with the gene and for these clusters we try to select the most representative single initiation site often we choose the one which is most used but in general we take the one that is most used but usage is in a specific way averaged overall tissues that are available so that's how it works so this is an old slide which refers to version 3 of EPD-New now we are at version 6 but the principle is the same so the input data is a gene catalog at that time we use two CSC known genes and primary TSS data these are cage libraries in general the primary TSS data they undergo a read mapping state and then we do kind of peak calling and identify clusters of transcription initiation sites the UCSC gene catalog is filtered based on some criteria so for instance we still treat the basically filter for protein coding genes the recent collections for non-coding RNA promoters are based on a separate pipeline these filters for that kind of genes and we also eliminate pseudo genes or genes which are incomplete or where the coding region where the initiation codon is missing or other strange things basically we map the peak the cage type peaks to genes by proximity many of the peaks fall through either because they are not near any gene or because they are just a little bit too far away from a gene we require a different range of about 100 base pairs relative to a gene start annotated in the gene catalog so this leads to a validated list of transcription start sites for each individual cage library and then we basically do a kind of similar procedure for the TSS obtained from each samples we now have many we probably have more than 1000 samples so we merge this and we take not the promoter which has the highest number of cage tags in total but the one which was selected as the major promoter in the highest number of samples and that leads to a new EPD promoter collections which contains the reference TSS site for genes sorry so now we go to the evaluation so I will this slide shows you the primary data for the current version of EPD so we now use Gencode as the gene catalog we also use a new data type called Rampage this is a technology which is based on paradigm sequencing and it's actually meant to directly associate the capped 5 prime end with transcript which to which the 3 prime end of the same fragment matches as you can see on the tags the number of tags are pretty astronomical I mean we have now nearly 40 billion individual cage tags now to motif the evaluation I will show just motif based evaluation so we introduce this procedure about 15 years ago so the principle is that we know actually that specific motifs are over represented at characteristic distances from the TSS these motifs are for instance the Tata box or the initiator side and then we say that the height of a motif peak is indicated of promoter enrichment in the collection of course if I say enrichment we target something like 95 or 98% and the width of a motif peak is indicated of the precision of the TSS mapping so the result of this evaluation procedure is a statistical quality measure for a promoter collection as a whole it is not a method that would allow you to decide whether an individual promoter is actually a true or a false one so in that sense it's not an evaluation procedure for promoters it's an evaluation procedure for the computational pipeline that extracts promoters from high throughput data and unfortunately we don't know the absolute enrichment figure because we only get relative values but of course the visual checks we do they indicate enrichment is very high there are some caveats with this procedure especially if we compare promoter collections between different organisms in the first EPD release published in 1986 70% of all promoters contain a Tata box and today we are down at 10% so why this is the case it's the case because initial sets were biased by strong promoters at that time you could only characterize a promoter if you were able to extract tons of messenger RNAs this is the case for genes like beta-globin so messenger in a cell is encode for beta-globin also for hormones insulin or valvoli and the like so this means that one has to pay attention an enrichment in a motif also may mean that we favor for instance highly expressed promoters which we do not want so this is an example below is the peak the signal occurrence profile again generated is OPOF for an EPD for the current EPD release for comparison is the same figure obtained for gene starts from a gene catalog so you see a very sharp peak indicating that the mapping is precise but as said before the number of promoters which contain a tartar box is not extremely high actually this is for an older release it has increased a little bit now note that this is a screenshot from the EPD website the EPD website posts the quality this type of quality report for each data collection now the second important element of EPD is the EPD promoter viewer so the entry point for the user is a single HVML page with essential information and links so the contents of the page is text-based annotations genome coordinates, gene symbols and the like a picture of the genomic contextual feature this is a picture downloaded from UCSC a link to UCSC for dynamic visualisation and hyperlinks to other resources for instance works with Regulom now the graphical interface is UCSC genome browser based it has EPD supplied custom tracks for instance for display of cage data and it also provides a customised view so the implementation is as a public truck hub so this is a structure that has been introduced by UCSC a structured and annotated collection of homemade data tracks in standard genome browser track formats like big week and big bed now a truck hub can be temporarily connected to the UCSC genome browser by various mechanisms directly from the browser menu or automatically via HTML links the EPD viewer is also based on a configuration and annotation file in text format and all data are really downloadable via a URL so the viewer configuration file it's called a session in UCSC terminology triggers the connection of the browser to external public or private truck hubs it defines which tracks are displayed or hidden it defines also how the tracks should be displayed and in what order and the session files they can be manually uploaded or automatically via a URL so I think now it's time to go to a demonstration on the web server so let me try to switch to another screen okay so this is the EPD home page so if you want to access the viewer you click on an organism and then on this on this page you see some summary information for instance you have 29,000 genes promoters for 16,000 genes and you have a picture on this side now this is the UCSC snapshot of the UCSC view you can click this picture and then you are in the UCSC browser so this shows a selection of tracks some of the tracks are hosted by UCSC all those come from the EPD track hub if you scroll down you see the EPD viewer hub and you'll see the menu of the different components so let me go through the tracks you see on top is the notifications so what you see in this example is that you have positioned an H3K4 ME3 labeled nucleosomes on this side on both sides so that's probably because we have two promoters which direct transcription in different directions so then we have tracks this is a histone modification tracks specifically H3K2 H3K27 acetylation which is supposed to reflect the activity of the promoters but also of enhancements if you are interested in what's going on you can change of course the display mode and instead of transparent you choose non and now you see the different tracks separately you see that this promoter appears to be highly active in the UVAC cell so if you go further down we see actually the cage provides a single base resolution this is actually the distribution if you combine all the cage libraries and here you have a tissue specific view in dense display for healer cells which look pretty much the same we have three collections we have encode cage, encode rampage and we have phantom 5 for phantom 5 we have one of the tracks now if you are interested in tissue specific cage usage then you can display more tracks so for instance you can activate all last tracks and you should keep it dense and then you will see lots of tracks so that's not what I wanted you reset the results first by mistake you reset ok so let's try I don't want full and I went to activate ok let's try again so this takes a little bit of time because there are hundreds of libraries but actually not too much because the browser only accesses the data of a particular region but I still don't see it scroll down maybe ok ok it put it ok sorry I put it at the end all the data you can choose your sample and you also get an impression where it is strongly expressed but I would say in this case it's largely a constitutive promoter you see additional transcription start sites in some tissues and you can explore them individually so ok so what else can you do you can also zoom in on the initiation patterns to get so now you see how the the transcription starts are actually distributed and you see that they are actually not so not so sharp there are several sites they tend to be the same in different collections although there are some differences between encoded and phantom but the major site is always the same and this is the site which is indicated by this EPD icon which shows 10 base pairs of the RNA by a thick bar and then 50 base pairs of the upstream region by a thinner line ok then you can also you have DNA sensitive clusters and you have additional data I will now switch I will go back I will go back to this a promoter database home page as you can see we have additional viewers for instance this was the viewer for genome assembly 38 but we also have a viewer for for HG 19 now because I want to look at the same promoter I click again in this promoter and I use the menu which is below the picture you will also see that I have the choice between the UCSC server and the European mirror now usually the data transfer is faster with the European mirror so if you are based in Europe I suggest that you use the European mirror now let's look at the viewer for HG 19 because there are some tracks which I like better in the HG 19 viewer but I said it's faster than UCSC well it's not so fast I don't know but ok so the picture is the same more or less the same what we have in the European viewer is a chip seek track which I now I'm going to expand no full or pack is fine so now we chose chip seek peaks from encode and the color meaning dark black means a strong peak and light gray means a weak peak you also see these green parts these green areas correspond to a sequence motif from the factor database so we can look what kind of interestingly what we find here is we find CT CF sites with motifs and we find Elk 4 which is a member of the AIDS family this is not a very strong peak by the way but Elk has a you see Elk is another member of the AIDS family which is strong and if you click on this icon then we get the summary from encode but this peak has been found in four different cell types and you see that in this one it's relatively weak but it's a very strong peak in the other three cell types including the lymphoblastoid cell line GM12878 which has been extensively characterized by encode you then also see the sequence motif and have some documentation not surprisingly we see somewhere a peak for RNA polymerase but I know ok here so we can also look at this one and here we see obviously it's found in a lot of cell types and you can maybe assume that the coverage by RNA polymerase may also give you some indication about the activity of the promoters now further below is the sequence conservation profile so this is derived from cross species comparisons I guess these strong areas they correspond to exons but there are also some spikes in the promoter region so it's always interesting to check whether some of these short regions which are conserved exactly coincide with transcription factor binding sites so let's zoom in on a short region so what you can see here is that apparently there is a transcription factor binding site it is for L4 for L1 and for L1 and for GABPE now these are all transcription factors of the AIDS family and they have virtually identical binding specificity so the motifs are basically the same but obviously these are different proteins and so different members of the elk family may recognize the same sequence motif in the same promoter and what is really nice here is the precise correspondence of the elk motif and the conservation peak you can assume that this peak is probably also conserved but now going back going back to these motifs so if you are interested in which cell type elk binds to this motif you click on this and here are the cell types we saw before so it's not many actually and maybe GABPE is binding this motif in other cell types let's see not necessarily because these two we saw before I think Kela was not present but so that would indicate that the same motif is recognized by different members of the same transcription factor family in the same cells now I didn't want to give a lecture about it's one family member the motifs of these proteins and the role in promoters I just wanted you to show how you can dig in on biology using the EPD viewer I now obviously you can do all these things without going through EPD so we see our mission to provide you with a useful subset of tracks that we automatically activate through a different file and which may be helpful to you in the sense that you will get to this use to this view as a starting point where you can find interesting information the UCSC default view is not tuned to a particular user community so it contains a broader mixture of tracks and it is then more difficult for ordinary users to find those which are particularly interesting to their domain so we see our role as kind of guides to interesting genome information for people who are interested in promoters there is note that we also have a cell type specific viewer we plan to add more cell type specific viewer in the future for the lymphoblastoid cell line at GM 12 878 so here you have histone marks only for this cell type and obviously you have the cage data you have all cage data then you have you have dns1 data and the dns1 data there is something interesting it's this single base resolution track which is available from UCSC directly in principle this gives shows you the fine structure of the promoter region so wherever you have high signal the DNA is probably not occupied by a protein complex and where you have low signal it may be occupied so black is highly accessible and low signal is protected and you could speculate I don't know whether it's true that here you have a kind of valley and this may be correspond to the area where the pre initiation complex is assembled on the promoter and it may coincide with the polymerase one peaks okay let me switch back to the slide collection okay this is just the slides for the promoter viewer you have seen this examples of interesting promoters so this is a case where we have tissue specific promoters in particular there is a strong promoter which appears in two cancer cell lines it is shown on the sorry on this side and this promoter is absent in most other cell types so this is related repetitive elements in principle contain promoter regions they are typically inactivated in some ways but in cancer cells sometimes they become active again so this is recorded as a promoter in EPD as an alternative promoter of the main promoter so this is also a quite interesting case so this is from zebrafish so it's a time course it's cage data for a time course and what you see is that in the psychotic samples we have to know that in early development messenger RNA transcription starts at a later stage the initial during the initial cell division events protein synthesis relies on maternal messengers so the maternal messengers they apparently start here there is a tartar box or kind of a tartar box near near this side and then at later stages transcription initiation switches to this place so this is just a summary of the promoter analysis tools I think I can more or less keep this slide you have seen chip core, you have seen chip extract and we also have seen opoff and then these tools or related tools can be used as promoter subset selection tools you will see how you can do it in an exercise this afternoon but basically you can use all the data in the MGA repository to promote the selection to select subsets which are enriched in a certain feature or subsets that are depleted in a certain feature, you can do this with chip core followed by a feature selection menu, but also find them, it's a program from the SSA patch, it works in a similar way as opoff but rather than plotting a profile of motif occurrences it selects promoters that contain a certain motif and these tools can be used in a cascade so again this is partly redundant so for instance here we look at histone modifications in EPD promoters so this is the input form we have seen this already with Giovanna we select EPD, EPD is available from both the SSA and chip 6-hour menus then we select histone modification from this data set and this produces this picture the feature selection tool appears on the output page so you could use, this is not a good example for not the best example to illustrate it so with this tool you can select the promoters which have between minus 500 and 500 more than 100 mapped tags, so this would be the more strongly labeled promoters you can also just enter in the select top and rich depleted reference features you can say you want to select 1000 top and rich features and you can switch it, you can switch to depleted feature selection which allows you to select promoters that are weekly covered by this histone mark or it could also be a transcription factor side, now if you use this tool then if you use this tool then you get a new page where you have the promoter that has been selected and you can do whatever you want with the promoters that are highly enriched in H3K4M3 you can use chip extract to look at the heterogeneity which is hidden behind a general aggregation plot and you can see if you extract the tags for individual promoters and you cluster them by K-means then you see very distinct classes of promoters, promoters that have strongly positioned H3K4M3 3 label nucleosomes on the downstream sites, promoters that have such nucleosomes on both sides and this number of promoters which still do not have this mark now you can also look at nucleosome data and this is just a picture you obtain if you analyze the nucleosome positions in yeast relative to transcription actually no, it is not yeast it is fission yeast S-pombi and you can use SSA to look at the enrichment of a specific transcription factor binding sites in human promoters so that's how you fill out the form of the exercise in the afternoon and that is the picture you get as a result note that you can require to do the same analysis for shuffled sequences shuffling is regional it means the computer takes each sequence cuts it into 10 base pair windows and changes the order of the bases in each window this preserves, is supposed to preserve the base composition, and this is sometimes important because promoters are highly G-series and G-series transcription factor binding motifs may only be overrepresented because they are G-series but here this is not the case the shuffled sequences show a flat distribution and a via binding site are highly enriched some concluding remarks so EPD together with the MGA chip seek and SSA is actually can be viewed as a data integration hub data analysis platform so for instance if you're interested in the promoter or in the group of promoters you may ask questions like in which tissue are they active what sequence motifs do occur how are nuclear zones arranged and whether they are SNPs near the TSS and these answers can basically be obtained under fly accessing data that are in the MGA repository or by running tools or by do by underfly motif analysis in genome sequences and I think this design has a major important over a static conventional database the individual resources can be asynchronously maintained so we don't have if we have a new data set chip seek data set we don't have to add this to EPD we add it simply to the MGA repository and then it will automatically be linked to EPD and analysable via EPD we can use software that is useful for other purposes that also relates to the independence maintenance argument so we have not necessarily developed chip seek for EPD but it is very useful for EPD and that's why we think there's no point in developing a more specialised tool for EPD the underfly combination or assembly of information about promoters also makes the information more traceable to the primary data it offers a lot of flexibility to the end user and it's label efficient at the developers end we also try to make EPD open science and so the general principle is that all components should be open sourced and publicly accessible the MGA repository is an open FTP accessible directory the source URLs and are given for all the scripts that were available that have been used for data reformating the web applications they post you may not have noticed it a link to the script that has been run on the server so if something funny occurs you may be able to trace it back you can also copy the script to your computer and run the same analysis then with the command line tools we also try to provide the numerical data for all the graphics so that you can make nicer graphics and in EPD we try to provide a precise description of the assembly pipelines it's not yet reproducible files and of course we link to all the source data and with this I come to an end of my talk and I want to say thank you to all the people who have worked so hard on these tools Reida was our quality control manager Giovanna did almost everything on the access tools for which I have no photo unfortunately have been the main EPD developers and Roman Grover was a PhD student which had his own project but also helped on many occasions to make our resources better now these are not the people who created the EPD in the first place even though these are the people who developed the EPD new but this will not have been possible with the former team members Claude Bonner, Vivian Prat, Thomas Junier and Christoph Schmid who also did a tremendous job they created the first web interface and started to develop the automatic procedures so I'm not sure whether we have five minutes time for questions Patricia or what do you think I think we have time for questions let's get started