 Hi everybody, welcome to the Graduate Seed Computational Biology Semino Series. Today we have the pleasure to have Philippe Buecher from the Computational Cancer Genomics at TPFL. Philippe completed a study first, the molecular biology in Zurich, and then he completed his PhD in Computational Biology in 89 at Weizmann Institute in Israel, where he worked on comparative analysis of eukaryotic promoters, as well as on the development of computer algorithms to characterize nucleotide sequence patterns, positionally correlated with biological sites. Then from 1991 to 1995 he worked as a postdoctoral fellow at Stanford University in the US on statistical analysis of protein sequences, and as well as ISREC back in Switzerland, developing sequence analysis algorithms. Then from 1995 to 1999 he became associate scientist also at ISREC, working on gene discovery and characterization of new protein domains. And then from 2000 to 2007, as a senior scientist, he moved more to the computational cancer genomic area. And since 2008, Philippe is a co-leader at TPFL and at SIB. So his group is interested in gene regulation, both in healthy and disease cells. The group is currently pursuing two main research directions. One is on the epigenetic profile, and the other one is on the so-called ultra-conserved element to track the still largely animatic regulatory code of the human genome. Philippe's group is also involving, as I said before, in developing new algorithms, computer software and web services and databases that will help the scientific community to extract knowledge and understanding from large volumes of genomic data. So today, Philippe will give us an overview on the bioinformatics resources that are developed by the eukaryotic formular database team. Philippe, thank you again for accepting this invitation, and the floor is yours. Thank you very much for this kind introduction. So these are the resources I'm going to talk about. The oldest one is the eukaryotic database, which actually I initiated in the early 80s. It's essentially a catalog of experimentally determined transcription stop sites in eukaryotic species. There has been a major redesign of the database about five years ago due to data trends. And so today, essentially, it's two databases. It's the old EPD database, which is a frozen collection, and it's EPD new, which is actively maintained and rapidly expanding. Then I'm going to talk about the signal search analysis, SSA server. This is a web interface to a collection of old programs which have been developed in parallel with EPD, which are, in many ways, tailored to use with EPD. And essentially, it's a collection for finding and discovering DNA motifs. Then I'm going to introduce the Chipsick command line tools and web server. These are new programs that we have developed in response to the advent of the Chipsick technology, which in my opinion revolutionized research on gene regulation. Then I'm going to say a few words on the MGA, the mass genome annotation data repository, which is the data backend of all our servers. And finally, I will present one of the newer tools we developed, not these SSA dinosaurs which have been developed in the 80s, which we make public kind of a server named PWOM, position weight matrix tools server. So this is a packed program. I apologize in advance that I will not be able to explain all the details behind these resources. Now first, EPD, so this is just a home page. Now this is a brief history. It started maybe in 82 with a computer-readable collection of transcription start sites. I think one, and also the introduction of kind of a new concept out of a functional position set. But I would like to mention is that EPD has never been a collection of DNA sequences. It always has been a collection of sequence coordinates. Reference is to sequence positions. And in that sense, EPD is a very early version of genome custom track 5. And sequence analysis has been performed by extracting underfly sequences around reference positions. Then in 1986, we published the first paper on the Euclidean promoter collection. And at the same time, the EMBL Nucleotide Sequence Data Library agreed to distribute an electronic version of EPD on the EMBL tapes. In 1997, we introduced a richer format and we set up the EPD web server. During about 2002, we used computational methods to define transcription start sites initially from EST sequences, from full-length CDNA clones generated with the oligocapping technology. We called this in silico-primer extension because in many ways it does the same as in in vitro-primer extension experiment just in the computer. And then around 2005 to 2007, we interfaced the EPD web server with the accessory web servers. In particular, the SSA web server and the chipset web server, and this made it possible to analyze promoter sets and subsets of promoters in terms of sequence motifs and in terms of histone modifications and nodal chromatin profiling features in a very interactive and user-friendly manner. Then between 2008 and 2010, we created the mass genome annotation repository. Actually, we had a large collection of useful high-sulput genomics data on our servers and at some point we decided to organize this collection in a way that, in a computer-readable way, but also in a way that external users can download the data in a standardized format. And at that time, we also started to the conception of EPD-NU, a completely new database, a new schema, a new compilation procedures. In 2011, we published the first beta release of EPD-NU and I would say in 2014 it has become a completely productive and expanding data resource. Now, just some essential points, so the promoter definition is essentially an experimentally mapped transcription start-up or initiation site cluster. And one of the important assumptions is that a tapped 5-primand of a cytoplasmatic RNA corresponds to an initiation event and is basically never produced by an endonucleotic cleavage event. So, this has been called into question, but as a general rule, with some minor exception it still holds. So, that's why we can use, actually, mRNA 5-prime end data to define transcription start sites. Now, so the primary data have been RNA sequencing data, nucleus protection data, primary extension data published in journal articles for the old EPD database, and for the new one, it's essentially high-suprote sequencing based mRNA 5-prime end mapping data, like CAGE, or CDNAs obtained with oligocaptive, also called TSS seed. Now, in parallel, the contents of the old database was based on a critical interpretation of published results, mostly pictorial results, and never on published conclusions. And likewise for EPD-new, the EPD-new is based on an independent processing of public data from high-suprote data. Now, the purpose and the target of the user community, so it's primarily a resource for researchers who would like to investigate promoter sequences, structure, and function. It has been for some time used as a training and test set for developing and testing promoter prediction algorithm, when this was a fashionable discipline in computation biology. And it is also a resource for experimental researchers who just would like to know more about a particular promoter. So again, EPD-new has basically two components. One is the transcription start-side collection, and we try to be, to offer the best one. And best is defined in terms of high enrichment in true promoters. Very accurate TSS mapping, and to a lesser extent also by providing links to genes and to genome annotations. Now, and then the second part of EPD-new is the EPD-view. It enables users to look at a particular promoter in a genome browser window, and the goal is to provide a useful customized data selection, for instance, caged data, useful to look at the dispersional transcription start-sides, gypsy data, useful for understanding promoters, DNA mesylation data. So to provide a customized view and also to display the high throughput data in the optimal manner, at the optimal level of resolution. And here the target users are rather wet lab biologists, which are interested in specific genes, and would like to know more about the chromatin state of the promoter of these genes. Now, some of the design principles. EPD-new, with EPD-new, we aim at complete coverage of model organism, and we also have the intention to process large volumes of genomics data. So that's why we opted for a light resource design, which is relatively cost-effective in terms of maintenance. It is focused on a few model organisms. We also extended the promoter definition a little bit in that we took into account chromatin structure, for instance, a nucleosome-free region defined by MNA-seq data would be an indication that there is a promoter. It uses an automatic compilation pipeline for defining promoters from mass genome annotation data, mostly from caged data, and it uses automatic manual quality control of input data and also the final product. So we are not really looking at each individual promoter, but we look carefully at the input data to prevent the garbage-in-garbage-out syndrome, and we then also look at the promoters, whether they intuitively make sense in terms of chromatin feature, and we decided to provide a user which is entirely UCSC-based, again, motivated by cost-effectiveness in terms of development and maintenance. So once again I repeat, the procedures are different, but the principles are the same. We base EPD on an independent critical evaluation of experimental data and journal articles, and again for EPD-U we use quality control and independent processing of published data. So we take KHO data, we are not importing already processed clusters or something similar. Just to give you an idea about the data which go in, so these are caged-act distributions for two promoters, they are displayed at single base resolution, you see a narrow promoter where almost all initiation events happen at the single position, note that we have, I see 80,000 caged-acts hitting exactly the same position, so there is really a based on large volumes of data, and below we have a broad cluster, and so basically what the automatic compilation pipeline does in a simplified manner, it identifies these clusters which may have quite different shapes and then selects a reference position and maps it to a plausible target shape. Now this is EPD-U from the developer's perspective, the data come from public repositories such as Geo or Osable, Osable we use basically for gene catalogs, then we make a preliminary evaluation before we incorporate the data into the data repository, from the repository we make a data selection, design an EPD source data, we run what we call the EPD assembly pipeline, the output is a list of transcription start sites or start site clusters, and then we check it in two ways in a automatic manner by looking at over-represented motifs that we expect in promoters and in a manual manner by looking at selected randomly selected promoters in a UCSC browser window. Now this is EPD from the user's perspective, there is an FTP site, you can download the coordinates of the TSS positions, you can also download surrounding sequences, and then you can do remote analysis with your preferred software, or you can use the web server, you can look at individual promoters via the promoter rule, or you can analyze promoters as the accessory tools, you can either analyze the entire collection, for instance looking at history marks, or you can select a subset, for instance promoters that have a bivalent chromatin structure in the embryonic stem cells, and then you can look for sequence motifs. These are the current totals, so we cover eight species, two vertebrates, three invertebrates, one plant, two fungi, there are for the vertebrates species the cover which is almost complete in terms of genes, and it's based on a large number of cage states, so this is seven billion cage states, an impressive volume, and for the other organisms we have not so good coverations, so we also have lower gene covers, for instance for Arabidopsis we have only 32%. Now the promoter rule, which I mentioned already several times, it provides essential information on a similar HTML page, there is text-based annotation, genome coordinates, gene symbols, and such things, hyperlinks to other resources, for instance Swiss Regulon, mostly promoter browsers, and a picture showing the genomic context, and then there's the graphical interface, which is UCSC-based, and it shows EPD supplied custom tracks, for instance optimized for displaying cage data, similar base resolution, and it also presents a customized, it automatically loads a customized selection of UCSC tracks that are particularly useful when looking at promoters, so this is the text part of the EPD rule, these are the links to the various promoter browsers, I'll show you the UCSC view, so you have histo marks for the selected cell line, then you have the single base resolution cage tag profile, here again it goes very high to 60,000 as in a position, then you have selected cage tag coverage for a number of cell types, and you see that in this case, it's the MGMT promoter, this promoter is active in two cell lines of the selection, and in the other ones the promoter appears to be completely silent, and you also see that there are no transcripts in the opposite orientation in this promoter, you'll notice here a CPG island, that's something useful if you look at promoters, and you have conservation tracks and repetitive elements and SNPs, now this is the standard UCSC view, and you see here you have much less promoter relevant information, now the signal search analysis, so this is the home page, and it's likely beneath an old resource, it has been, development has started in 1984, the general purpose is to discover locally overrepresented sequence motifs, now that's kind of a strange concept, what it is, it's motifs that preferentially occur at a specific distance from a physiologically defined site in DNA or RNA sequences, and such sites may be transcription start sites, translation start sites, polyadenylation sites, or splice junctions, for instance, and there's an important difference to most auto motifs discovery approaches, so a locally overrepresented sequence motifs is enriched at a specific distance from a functional site, the distance and the location is not known in advance, and most other programs just look at general overrepresentation within given sequences, and there's also an interesting consequence for the results, whereas a standard motif discovery program returns you a motif, just a motif, the SSA programs always return a motif plus a certain region relative to a functional site, now this is a graphical illustration of the concept of a locally overrepresented sequence motifs, so these are sequences aligned on a functional site, for instance a transcription start site, and these are the motifs, these are the occurrences of certain motifs, and you see that the red one is totally overrepresented in a narrow region upstream of the functional site, and the green and the blue one are also overrepresented upstream, but in a larger region maybe 100-150 basements, and the enriching is not that strong, now generally all SSA programs motifs can be defined as consensus sequences, for instance TATA for a cut-out motif, a limited number of mismatches can be allowed, and the consensus sequence may include ambiguous codes like R for A or G, and motifs can also be uploaded as position weight matrices, also called position specific scoring matrices, which essentially such matrices consist of a table of numbers where each number relates to a particular base at a particular position in a fixed length motif, there are two numerical representations, so called base counter-base probability matrices which indicate the frequency of a base at a particular position, and the true weight matrix which contains additive scores that are used to compute the score of a particular sequence, which is supposed to be proportion, inversely proportion to the binding energy. This is a weight matrix on the left side and the corresponding sequence logo on the right side, it's a weight matrix which I derived from the Leukaryotic TATA box in 1990, so we can use the program OPROF to analyze the distribution of TATA boxes relative to a transcription start sites, what goes in is a list of genomic positions, then the program automatically extracts DNA sequences from the genome in a certain range, the user uploads a motif, for instance position weight matrix for the TATA box, then the program identifies motif occurrences in each of the sequences and displays the distribution, the occurrence frequency of these motifs in such a type of program. This is the input form, so EPD is directly accessible via a menu, it's at the back end of the soul, it's installed at the back end of the soul, you have to specify a range, a sequence range, a window with window shift, you can select the TATA box motif again from a menu, you have to define a p-value, here it's 0.01, that's the occurrence probability pair genome position, so relatively high p-value, you can specify a background base composition and the other inputs are not that important and that's the result. That you get, you see a nice sharp peak, but you also see that the peak is not particularly high, it reaches 10%. Now, this is a slide which I already showed you and I just want to mention what I showed you just before is motif based evaluation, so you see nice peak could promote the connection. Now, the next slide I will do exactly the same analysis for the TSS collection extracted for ensemble and from the UCSC normal GU collection and you see this is EPD for ensemble to get several peaks and they don't exceed 4% and for UCSC it looks slightly better. So, but overall the conclusion is that compared to EPD, these collections contain many false positives or if they contain many promoters, the TSS has not been made precisely. Hypothesis is true and that's exactly the way how we optimize our procedures. If we get multiple peaks, then we will, for the data selection and then the assembly time behind them, we iterate until we are happy with the result. Now, chip-seq. It was developed in response to chip-seq data, as mentioned before, design principles, simple tools, they should be easy to understand even for non-specialists. Fast algorithms because we don't expect computing power to increase drastically, but we do expect biological data algorithms to increase drastically. Generic methods if possible, so not overly specialist for chip-seq data, so that they can be used for other types of data. Modulability, one problem comes from the elementary task and C-programs for computation expensive procedures to better expose. Now, the very interface, the nice feature it provides access to a large collection of public data, you can also upload your own data, you can combine the analysis, you can analyze the context of your piece in terms of history modifications from any code. It is interoperable with other analysis programs, for instance, the SSA server programs, and it provides direct links to send the results to other servers. Now, I will use the following example for illustration. It's data from a landmark paper, the goal was to map start one binding sites, but you have to know about start one is that it usually resides in the cytoplasm, and that's not valid to delay. Upon cytokine stimulation, it moves to the nucleus, and it binds to a consensus motif approximated by TTC and NNGA. Now, in this study, they used interferal gas-dimulated heat cells, and as a control of the heat cells. Now, this just to remind you how chip-seq experiment works, so you start with chromatin, chromatin is costly, then your fragment, the chromatin by some as sonication, for instance, you pull down fragments that are bound by a specific total interest using an antibody specific for this protein, and then you sequence the fragments from the ends. Now, if everything works like illustrated here, then the sequence reads mapping to the blast end of the genome should form across the upstream of the binding sites and the one binding to the minus trend should form across the boundary. Now, just regarding our examples, it resulted in about 15 or 13 million map sequence. Now, here you can see a true start of one binding site and you can see the data. So, you indeed see two peaks, the green one are the reads that map to the blast trend, the red one are the reads that map to the minus trend, and they are shifted by about 150 base bands. Now, somewhat different from here, here the clusters do not overlap, here they do slightly. Now, taking into account this particularity of chip 6 data, we usually offer tag-centering for the analysis. Tag-centering means that you shift the tags mapping to the blast trend downstream by about half the fragment lengths and the tags mapping to the minus trend upstream. And then you should be able to fuse the two peaks in single line. Now, centering is offered as a standard program, but it is a use. It is also offered just as a pre-processing features for all other applications. This is the centering maybe. So, you define, you want to use any tag, you shift it by 75 base bands, for instance. This is the result of centering. You see now you have a single peak which falls in between the two peaks. And if you look at this track, there is an annotated conserved start 1 by inside, which exactly creates this peak. So, for peak data, actually, offer a program called chip peak. It implements a simple algorithm. It takes as input centered tags, typically, and the output is a list of peaks and positions. Note the difference to other peak fibers, which typically represent peaks by a begin and start position. We report only the center position. Now, to make it very briefly, it considers only positions in the genome where there is at least one tag mapped to it. That all reduces the search space considerably. Then it evaluates the cumulative tag frequencies in a window around these positions. And it retains as peaks those windows which have a tag coverage higher than the threshold and which are maximum within the vicinity range. There is an option you can define the peak position based on the tag distribution within the window. Now, this is the interface. We can select to start one data from the solo menu. We define centering. We use a window with 300, this is the range, also 300. And you can define the peaks threshold in relative terms as relative enrichment over the average tag frequency. And with these input parameters, you get this as output, the output page reports the number of peaks. You can download the peak list in various format and directly go to the grade zone for analyzing go term enrichment of the genes in the vicinity of the peaks. Then you can extract sequences or you can pass the peak list to other applications for instance. Now, one of the questions when you have generated the peak list is whether the peaks are two binding sites and one way to answer this question is to look for motifs. Again, we can essentially do the same as we do for providers. If these are two peaks, then we expect to start one binding motif in the middle of the peak. Now, these are just the peak numbers that we get with different absolute tag thresholds. And obviously with a low threshold, we probably have many false positives. And if we have a high threshold, we probably miss many true positives. I'm not going to show the detailed analysis. I'm just going to show what we get with our peak list. So we define a range of minus 500 to plus 500. Here we use a window of 100, we shift it by 5, and we use the consensus sequence, TTC, NNM, GA, A. And that's the picture we get. So it's a very beautiful approach. Now, it goes up to 30%. This is another estimate of the proven analysis, because on the one hand, it's not one tolerates sometimes an estimate. And we were searching the motif without an estimate. And secondly, a window of 100 does not capture the entire range. It's a little bit too slow. So the true enrichment is more something like maybe 60, 70%. Next, we could explore the genomic context of start one peaks. So are they associated with certain histone marks, for instance K-207 or a satylation, K4 or MSC, which is a provolter mark, or K-207 MSC, which is an aggressive mark. So to answer this question, we upload the peak list to an application, which is called the chip core. And then we choose, we look at the distribution of histone, chipset, WGT, from input. Now basically, chip core, input, this is our thought, two types of genomic features. We call the reference feature, the other one target feature. For each reference feature, the program counts the number of target features in the surrounding genomic regions from position beginning to end. This region is further subdivided into windows of the size W. And then from these numbers, the program generates a so-called aggregation plot, which shows the target feature frequency as a function of the distance from the reference feature. Output is a graphic plus a five containing the numerical data, which you can download. This I'm going, this is just the input form. It looks similar to the other input form. Here we go for a larger region. So minus 5,000 to plus 5,000. That's one of the difference. And we use these encode data, obviously, for the license because the start one experiment was carried out in real sense, but these are not the real sense. So these are not real sense, that's not wrong. This is the result we get. Now I download it. This is not the server output. I download the text output and super post it in this graphic. You see a strong enrichment in K-27 satellite, also pretty strong enrichment in K-4 and mystery in the flag. Here's some interpretation. So we see this valley, which is kind of conspicuous. We think this indicates that the start one dynamics as a nucleus of three. So we have a depletion of histone tags pertaining to histone modifications. We also, since we have a strong K-27 mark, we think it's probably start one who likes more enhancer regions, promoter regions. And the lack of enrichment of the repressive mark is consistent with the assumption that start one is an activated transcription factor, which never represses a gene. An interesting aspect is that this is from non-stimulated cells, suggesting that the sites to which start one binds are already nucleus of three and surrounded by activating marks before start one moves to nucleus. So these histone marks somehow pre-select the region to which start one can bind. Now this is just another, very briefly, another application of chip core healing as input EPD new. We use as target feature cage tags, cage tags from an embryonic stem cell line, and then we get this profile, and on the output page we can use a sub-menu which enables us to select certain reference features, promoters of this case. So we select those which have more than 10 cage tags between minus 50 and plus 50, and as a result we get about 12,000 promoters. So these are promoters that are minimum and active at some level in an embryonic stem cell. And then we can obviously go further and analyze what happens to these promoters in derived cell types. Now very briefly the MG A repository, so it's a higher half for this talk directly. One series is one actually, the series concept is similar to the series concept here. A series contains a number of data files, they may be of different types, for instance chip seek and MNN seek. Then there are two configuration files, one that describes the series and one that describes each individual source. And then there is also an HTML file for a user, a human readable HTML file which provides documentation of the data. Now from these configuration files the server automatically reads the program. So that's how it works. But the whole archive is also accessible just by FTP and you can now load all data in a standardized format. This is the carbon content, so you have more than 11 cells of samples in total, most of them for human and most of them chip seek, as you can see. And the large numbers here, they mostly come from the encode, encodernated programs. But you have all the types of data. There are also DNAs, MNNs and DNA mesolation data, various kinds of genome annotations, you can select spice chunks as reference features. And we have sequence derived features like conservation source or SNPs. It's a very rich collection. Very briefly the PWM tools is a new package, the publish of PWM related tools. What is different to the SSA tools is for the one hand they use different formats and secondly some of the programs can use chip seek, functional genomics data, sequence data together. Now the most useful tool is PWM's tab. The input is a whole genome, which you select from the menu, plus a motif which is supplied as a PWM or as a consensus. And the output is a complete list of all multiple sequences in the corresponding gene. The output functions are similar, you can pause this list, go and perform an extract sequences around the motif, what's all. Now the method works as for all. The motif is expanded to a list of qualifying sequences. So for instance, PWM of place 9 is expanded, is replaced by all n-mails that reaches core island threshold. So this may be a collection of a few size sequences. And then we use a fast read marking software to map these sequences to the genome. Either fetch GWI, which has been developed by Christian Misery, some time ago at SID or Bolton. And for longer general motifs, this approach is not effective, so we have a backup program which implements standard algorithms. These are some performance features, so you see stat 1, p-value 10 to the minus 5 is fetched GWI, it takes one second to retrieve all stat 1 occurrences in the genome. It's very fast. And for somewhat tougher cases like stat 1 at the lowest threshold or CTCF, which is a long motif for our team, at the 10 to the minus 5 threshold, in both times it becomes more effective. And for very difficult cases, CTCF, a very long motif at the low p-value threshold, probably a billion of qualifying sequences. Then the fast read mapping approach fails, even both times it takes a long time. And the backup you use to convention algorithms, which still results reasonably tight. Okay, just I want to mention a few things. I think in the future we will try to extend EPD-New to new model organism, maybe another plant species, some other vertebrate species, red, because it's very interesting to compare sequence motifs in different species. We will focus on the energy repository. In particular, we are somewhat underpopulated with invertebrate sequences, we'd like to add more Rosofila data and also add more DNA insulation data. And then we make continuous effort to improve our documentation and online tutorials. And finally, I wanted to call your attention to this upcoming event, we are giving a one week course in production, close to the use of our resources in April or April. Now these are the people I have to meet my thanks for. And Adreos is the man in charge and he also is large in charge of the data archive to run on was he has developed the chipset, the SSN, the PWM tools. So the other group members have also contributed some pieces to the software and they have been very helpful for quality control rooms and I thank you for your attention.