 Hello everyone, so yes, I will cover a bit of downstream analysis with the data sets that we have been generating in the past modules, and I'll cover integrative and online tools, stuff that you can, online resources, where you can find data sets, where you can, online resources that you can use to help you analyze your data set and these kind of things. So the learning objectives in this module are, first you explore components of downstream analysis that can be done with epigenomic asset data, then we'll discover sources of publicly available data sets that can be used in anyone's project, identify challenges of using public data sets in one's analysis, and finally learn about online portals and tools that can ease epigenomic data analysis. So we'll start by talking a bit about downstream functional analysis, then we'll move on to working with public data sets, challenges of large scale analysis, online visualization analysis tools, and at the end we'll cover an introduction of the Galaxy platform to run a lot of the tools that we've covered in the workshop so far through a web interface. So downstream functional analysis, well this is like just repeating a lot of what has been said so far, but like genes, the genome account for 2% of the overall genome, that means that 98% of the genome does not encode for protein sequences, however like more than three fourths of the genome gets transcribed, so nearly half of the genome is accessible through genetic regulatory proteins such as transcription factors, so like putting in contact information that we have on variants, methylation profile, histone modification, transcription and so on, can help ease our understanding of the underlying biology. So with what we've done so far, we've covered gypsy analysis, methylation analysis, so once we're done executing the tools that we've covered in the lab, we have for instance a set of peak calls for a gypsych assay, we have methylation level at CPG sites for isophyte sequencing assays, so next we can use this data to run some functional analysis by for instance comparing different regions from the same data set, multiple samples of the same group, or comparing different groups together. I think Guillaume has covered that already this morning, but for instance for methylation, we've got, once we get the methylation profile for one given experiment, we can push things a bit further by for instance comparing different shalee methylated sites across samples, across groups of samples, across regions on the genome and so on. So I put one example of such integrative analysis that's been done, that's from a paper from the Roadmap Consortium published a couple of years ago where components of different type of epigenomic assays were put together to compare different regions of the genome based on the type of region, what represented, what was the DNA methylation like, so classifying things such as regions by profiles of DNA methylation, gypsy DNA accessibility and so on. And another thing that we can do for instance with gypsy data, well we've talked a lot about gypsy in the past modules and this kind of experiment can help us identify motifs on the genome. Motifs are short recurring patterns of DNA in the DNA that are presumed to have some kind of biological function. A motif doesn't have to be, it's usually a series of bases, a series of nucleotides that repeat in the genome, they don't have to be exactly the same from one to the other but can allow for slight differences. So it often indicates like that a binding site for instance for transcription factors to allow you to bind you to start transcribing a region of the genome. So like in this example, if I allow for one base mismatch, there are two motifs in the sequence I'm providing here and these are highlighted in bold black and blue. So using regions previously labeled as peaks, like we've done previously by using Max2, we can try to identify motifs. Identifying transcription factor binding sites can be useful for such a thing as understanding regulatory network transcription mechanism and so on. So we'll cover that in the lab later in the last lab of this workshop. We'll run the tool called Homer. Homer is a motif finding tool which tries to identify regulatory elements enriched in one set of sequence compared to another. So it allows for discovery of known motifs and it has its own database of existing known motifs and also tries to identify the novel motifs, motifs which are not in the database but that are identified as a recurring pattern in the sequences provided to the software. And it attempts to match them to known patterns that are already identified. So the command we will use in the module is called a find motif renowns and it attempts to identify motifs provided in the stable genomic regions. So the input file is basically a bed file. We've covered a bit in the workshop what bed files are. So basically it's a file containing, basically it's a list of regions in the genome with at least six columns. Bed actually can have many types of configuration but the usual one expected by Homer is the first column will be the chromosome and then start and ending position. A unique big ID column five is just ignore and column six is the strand where this region was identified. And we also provide a reference genome assembly and a fragment size to try to identify the motifs. So what Homer does is he will verify peaks provided in the bed file that you give as an input. He will extract sequences from the genome corresponding to the regions in the input file. We'll calculate the CG content of peak sequences, pre-parse genomic sequences of the selected size to serve as background sequences. Randomly select background regions for motif discovery, normalize, check for enrichment of non-motifs and try to identify the novel meaning like new motifs. And we'll get, in the exercise we'll do, we'll get two result files. We've got more than that but the two that are interesting for us will be the two HTML reports displaying non-motifs identified in the bed file provided and the novel motifs. Another tool we'll cover in the lab later is great. Great is an online tool so you don't need to install anything. You can just run it from your web browser. What it does is it's looking for a significant goal enrichment. So a goal stands for gene ontology and gene ontology is a set of structured control vocabulary to use for annotating genes, gene products and sequences. So we'll cover this tool shortly. And while this is from the grade paper, it's just saying that like great looks beyond just the really proximal region of identification in BSPOR, trying to identify genes of interest. What we'll provide for great is a bed file and basically the reference genome that we want to run it on. Unfortunately, great is not that much up to date at the moment. It's not possible to run it on the reference genome HD38. So hopefully that will be addressed in the future but for now we'll run a data set from the HD19 reference genome. And so what we'll get as an output is a list of code terms for things such as molecular function, biological processes, phenotypes, diseases and so on. This is an example, we'll run this exact example later in the lab but this is an example of biological processes list that I can get by executing the software with, yes? So yeah, you can provide whole genome and then it will just consider the whole genome as a background but you can also there's an option to specify a specific background that you want to be using. Well, I've mentioned this already but it's talking about the NIH roadmap for as an example of a derivative analysis project. Next part is working with public data sets. So there is a available on the internet at the moment like a lot of very large data sets that you can use to run your own analysis either against your own data set or just trying to collect public data sets to answer questions that you're looking for. Many of these large consortia offer stuff such as like different diseases and different tissues and for different phenotypes and so on. These resources are usually for free and can be used in anyone's project. The down point to that is that there's absolutely no control over you don't have a control over the quality of the data set, how they're organized and this kind of thing. We'll come back a lot on that in this presentation. One of the earlier sources for epigenomic data sets was the roadmap epigenome project which has its own portal where you can navigate and to discover reference epigenomes for different type of tissues and so on. They even have this online resource called January where you can run some bioinformatics analysis tools on data sets that are available. There's also the Incult Consortium which has its own portal offering again chip seek, methylation, RNA seek and all kinds of experience of assays run on a lot of cell types on both human and non-human data sets. The GTEX project is focusing on gene expression and trying to identify relationship between genetic variants and that gene expression. If you're interested, the URL is provided here. And nowadays the most complete epigenomic resource to get data sets to work with is the International Human Epigenome Consortium in short IHEC. IHEC is an international effort so it's basically a consortium of consortia. So in Canada there's the CERC which has two nodes producing data. In the U.S. there's ENCODE, there used to be Roadmap. In Europe we've got the Deep and Blueprint. We've also got consortium from Japan, Korea, Singapore, Hong Kong and so on. So IHEC's goal, the primary goal is to provide reference epigenomes for a variety of normal disease tissue. So a reference epigenome is basically a set of assays which are RNA-C, whole genome-by-spot sequencing, and gypsy for six histone marks. And I think Martin presented what are the considered to be the six main histone marks which are fundamental to gypsy analysis. Beyond that IHEC is a group of committees, different work groups trying to work on standards to improve things such as the way to analyze your data sets, how to better distribute, produce data, how to run successful integrative analysis, work groups on ethics and so on. And yeah, I'm just listing here the assays that I just mentioned which are considered a core set of epigenome assays. Yes. So what you could say, a reference epigenome would be for one sample, pooled or not, usually not pooled, but like one sample for one given tissue. And then you have all of these assays run on the sample. So it could be like, I don't know, you extract B cells from the blood of one individual and on that sample you run whole genome-by-spot sequencing, RNA-C, all those two gases. So it's a way to give you different epigenetic angles of the same data set. And usually, so you get a plethora of metadata identifiers to know the condition of the subject. Is this from a patient with a given disease, like maybe kind of age group and these kind of things. So like you can know that for an individual matching those criteria in this specific cell type maybe the epigenetic profile would look like this. That's basically the main goal of IHEC, to offer these big picture of an epigenome. So as I was saying, IHEC is a consortium of consortia, meaning like data is organized in a decentralized way. There's not like one central database collecting everything and offering it to the community. So like it's basically up to each individual group. So if I imagine like a data production group such as maybe the Canadian group would offer its data sets on a given server. All of these groups are then storing their data and metadata in controlled access repositories such as the EGA and then they run the analysis on the raw data to produce what we call like process data. So this is data that's considered a non-personally identifiable data meaning that you can just release it on the website and there's no risk for the study participant of being re-identified. And these process data sets are offered without, usually without restriction for you to analyze. Then these data sets are, these process data sets are agglomerated in a portal published by IHEC, which is the IHEC data portal. Users can go to the portal and visualize what IHEC has to offer, but ultimately what they will want is if they identify data sets of interest is to obtain the raw data to start analyzing things on their own. What they have to do then is to apply to a data access committee at each of the institution that produce the data that they're looking for and the data access committee will study the request, maybe ask a few questions and usually give you access to the data that you're looking for for you to download it. So I'll talk briefly about the IHEC data portal. So the goal of the portal is to integrate data set, the public data set, so like the process track as I was saying before, published by IHEC. Again, raw data is not available on the IHEC data portal because it's considered personally identifiable. So to protect privacy of study participants in these things, these data are stored at controlled access repositories and you have to apply to a DAC in order to obtain them. As of October 2017, within IHEC we've published so far over 10,000 human data sets. So that's like one epigenomic assay for one given sample. Over 230 of them are non-human and we have at the moment over 294 reference epigenomes. That means one sample, one cell type, all of the assays that I was mentioning before. And there's data set from eight consortia at the moment. So the portal offers tools for data set discovery, visualization and pre-analysis. We'll come back to that a bit later. Recently, from the main IHEC website, there's this resource that started offering a directory of all epigenomic analysis tools published by different IHEC members. So there's tools for data discovery, visualization, integrative analysis and so on. So if you follow the URL provided here, you can get such a list of tools. It's getting to be quite extensive. You can use the filtering options here to just select what's interesting for you. So I want to go back a bit on the notion of public versus controlled access data. So data sets made by IHEC are, by definition, publicly accessible for anyone's research, but they basically fall in two categories. The first one being controlled access data. So that's the data, as I was saying, that's considered personally identifiable. These are raw data from the sequencers. So I'm talking about FASTQs, BAM files, and all of these things which are considered more sensitive data. Clinical and sensitive information such as phenotypes. So maybe if I have a data set about rare disease and there's enough information that would allow to potentially re-identify a patient, this kind of data is supposed to be put in the control access repositories. These data are usually archived at sites such as the European Genome Archive or DBGaP, and the public data is the de-identified annotation tracks such as big beds, big wigs, and so on. The kind of annotation that you can visualize in IGV, the UCSC genome browser, the Washi browser and so on. Some of the donor samples and library metadata is also considered fully public meta-information when it's considered non-personally identifiable, and these are freely downloadable. Quality control on epigenomic data sets. As I was mentioning, data sets exist from multiple sources. The data you can obtain is of a variable levels of quality. Normally if you want to do things properly in analysis, the first thing you should do is to assess the quality of this data. We've seen one first example in this workshop where just fast QCs where I can assess the quality of my read sets on fast Q files. There are other flags that can be checked to identify if the data set seems good or not, such as looking for the signal-to-noise ratio. I give an example here where two data sets coming from the same institution for a chip-seq experiment in one case it's showing clear piece. This is from a very zoomed out level. That's why the peaks look so thin here, but in other cases you can see that there's much more background noise. There's always things to be assessed when you start looking at public data. In the case of iHack, one of the tools that can be used to assess this quality, there's this correlation tool which enables you to assess how similar over the whole genome the data sets that interests you are by using a Pearson correlation test over the whole genome. I'm giving an example here where for the same cell type and the same assay, one track for RNA-seq weirdly seems antico-related to all of the other RNA-seq data, which is highly suspicious. So then you can do your own investigation to figure out why, or you can just choose to not use this data set. The Assay Standard Working Group has defined a set of quality metrics to be published by data producers alongside the raw data. So the list of metrics and how to compute them are available following this link. And this is something that's gradually starting to be published alongside the data published in iHack. I just wanted to talk a bit about the Chrome Imputes. It's an interesting tool that allows you to input missing signal tracks, but it can also be used to check if a signal track that you have behaves the way it could be reasonably expected to behave. So as an input, Chrome Impute uses maybe a group of samples which are for a given chip-seq histone mark and other marks from the same sample and then you can compute these tracks which are expected signal maybe for a given chip-seq experiment and you can compare it to your own data set to see if you would be expecting. It's a bit of a laborious process to execute those, so we won't cover it in this lab, but it's just good to know. Next, I want to cover the challenges of large-scale analysis. When you want to run data analysis using public data, especially for large data sets, there's multiple challenges that you're going to face. One of them is obtaining access, because as I was saying, when you want to access raw data, the first thing you need to do is to write a bit about your project and who you are and these kind of things to a data access committee who will evaluate whether you deserve to obtain access to the data. So depending from one consortium to the other, from one project to the other, the requirements will be highly variable. In some cases, it's raised straightforward and in other cases, it can take quite a while. Downloading the data from a control access repository can prove to be challenging. Within IHEC, we've faced that a lot so far. Another thing is comparing data sets across projects can often be difficult, not because the data is not good or anything, just because the way the metadata is stored and offered can change a lot from one project to the other. The way the files are organized in the data set can vary a lot, too. So it can take quite a lot of housekeeping to make things work together. And of course, analyzing the data can turn out to be requesting a heavy use of resources, which is why we're often using HPCs for such type of analysis. So I'm just showing this slide coming from Jan Jolistock at the last IHEC annual meeting last year where there's this integrative analysis project with IHEC where we were aiming to download data coming from all the IHEC member consortia and analyze them together with a unified set of pipelines and these things. And just to show you, each column here is one IHEC member consortium, the type of things that are requested when you apply to a DAC to actually have access to the data. So you can see some things like acknowledgments are required by all sides. If you use my data sets, you should acknowledge in your paper that it's coming from us. In some cases, there's not so many restrictions while in other cases a lot of things, agreements and information are requested by each participant. So as I was saying, downloading the data can be a very long endeavor. For large data sets, downloading the data can take several months, in some case even longer than that. For big data sets, large amount of space is required. For instance, we don't have exact numbers yet, but to download the whole IHEC data, over 100 terabytes of space is expected to be needed. And that's just to download the raw data locally. Once you start analyzing it, you need much more space than that with all of the intermediary files and so on being generated. Analysis are often a processor and memory intensive, so it's not the kind of thing you will run on your own laptop. And several resources exist to address this issue, such as like Compute Canada, that we've been using so far, and commercial solutions, such as Amazon Web Services and so on. To go around, to solve these problems, IHEC currently has two initiatives, as I was mentioning, the Assy Standards Working Group, which is also preparing right now a set of unified bioinformatics analysis pipelines. So that means that one of the issue at the moment is like, if you go on a resource such as IHEC data portal and you want to visualize data coming from the blueprint consortium and from the CERC consortium in Canada, the problem is that the data generated has been generated with different bioinformatics analysis pipeline, different flags, different parameters, and the end result can be less comparable, let's say. Of course, by changing the way you analyze your data, the output can be very different, and that's why we in IHEC are currently preparing pipelines that we'll use in the same way for all datasets prepared in the consortium, so that should at least remove a bit of the artifacts in data that you can visualize. These pipelines will be available as singularity containers executable in any high-performance computing environment. There's also the APMAP project, which stands for, which is from the Integrative Analysis Work Group. The goal of the APMAP project is to prepare the main goal is to prepare a gold standard dataset that can be using the uniform pipelines generated by the Assisted Standard Working Group to develop ways to ease access to the data and basically learn by the experience of obtaining all of the IHEC data to try to analyze it and try to remove obstacles so that the greater public can download the data themselves. Now I'll cover online visualization and analysis tools. There's many resources that are available online. So far, we've done a lot of things at the Commonline, and we've downloaded tracks to visualize them with IGV. We've covered a bit with Misha, the UCSC GNOME browser as well. So in this section we'll cover some of the tools that are available for both data discovery and download, visualization and data analysis. So I already talked about the IHEC data portals. I'm just giving you an overview and we'll go back to this on the lab because the datasets we will use for grade and Homer, for instance, are coming from the portal. So we'll learn how to use it, how to fetch information and so on. The data is organized in this two-dimensional grid where the columns are for each assay offered for a given epigenome and the rows are for each individual tissue. And there's this correlation tool I was talking about basically showing correlation, how correlated over the whole genome two datasets are. We can see in this example that maybe more like this one, that for same datasets like in histone marks that are supposed to be anti-correlated are actually are and that's the marks that are supposed to be correlated to each other are as well. The portal offers a tool to download the data that we will use in the lab. Tracks are served directly from the portal server. So there's yearly snapshots of all of the tracks. So like even if a consortium server dies, a consortium disappears in these things, the tracks will be kept for as long as the portal exists. Then you can create permanent sessions to site, like for instance, to use as a citations in papers that you publish. So let's say you made some great finding and you can create a NIH X-Session to just site all of the dataset that you use for analysis. This is the kind of report you can get. There's a web API to obtain all of the metadata that's connected to a given track, to a given dataset. And what's coming ahead in the next couple of months is the ability to create community hubs. So basically this is a way to enable groups which are not main members of iHack to integrate their own datasets. So let's say you've got, you've just published a paper like about 20 datasets on h3k4me1 and you want those to be available to the greater community. There will be a way to integrate that in the portal with information on the paper you publish, your lab, and these kind of things. I'm going back to the encode portal. This one is a displaying all the data produced by the different phases of encode, one, two, three, four. There's also this grid showing like for different cell type, for different assays what are, what's currently available. It's possible to visualize results in the UCS Genome Browser and the Ensembled Browser and so on. The data discovery, Gtex, Gtex portal lists, as I was talking about before about the Gtex Consortium. This one is specifically at visualizing results from that consortium giving different tools to visualize to visualize their datasets. Deep blue is very focused on AP Genomics, so it's a portal that enables you to first discover data that's available from a few of the IHEC member consortia, but it gives additional tools such as enable you to build tracks on the fly regions that interest you. So let's just say you're interested in one specific region of one chromosome, instead of downloading all of the datasets over the whole genome on your servers and then doing manually the extraction of what interests you, you can use Deep Blue to do that and just say like I'm interested in this region, I'm interested in, I don't know, a specific DNA sequence motif and these things and the portal will enable you to build this dataset and just download it which you had to just obtain everything. So you can do that either from its web portal or there's even an R packet that allows you to do that from the R language. We've covered already in the workshop the UCSC genome browser, it's a way to visualize annotation tracks of different types of epigenomic assay over the whole genome. So I just wanted to talk quickly about UCSC genome browser track hubs. So a way, maybe how can I say, you come at a point where from your experiments you will have generated a set of annotations that you want to be visualizable in the UCSC genome browser. By creating an AhegData hub, you basically build a site that you can and you can just provide the URL to the UCSC genome browser and it will display your tracks alongside whatever annotation tracks that are already available in the browser. So what you need for that is to put the big wig and big bed and so on tracks on a given server. You prepare a text document that organizes this data basically and you can use this URL in the UCSC browser to display your tracks. So that's what I'm saying here. You can use this to easily distribute your data sets across collaborators and so on. The downside is that you need to generate this text document. It can be a bit convoluted at the time. I provide the help to how to do this here. But as I was saying, it's not straightforward. You have to generate text files. Let's say like, okay, so I have this track. The type is big wig. Give some metadata information about the track so that the UCSC browser knows how to release the property. Just to mention, we've developed recently a tool that allows you to do that in a more automated way from the web interface. So you just answer questions about your data sets on a web portal and then this will generate for you a UCSC Genome Browser Track Hub that you can load in the browser. Beyond the UCSC Genome Browser, there's also the EDI that produces the Ensembled Browser that has its own unique set of features. Both browsers are good and bad for different types of uses. There's also the WashUAP Genome Browser. Both of these browsers are also compatible with the UCSC Genome Browser Track Hubs, with these documents to display the annotations. So the WashUAP Genome Browser is a bit more focused on epigenomic assays, so there's a bit more features on how to organize data sets more specifically for these type of assays. Still on the web, there's the four data analysis. There's the Galaxy platform. Galaxy is a web-only framework offering like a user-friendly interface to map most popular bioinformatics analysis tools. So we've covered a lot of tools so far in different labs in these things. It's good to have learned how to do it from the common line. In some cases, there's people who like to do stuff from the common line, and there's people who like to do it from an interface, asking the questions as you need them, and Galaxy is there for that. So it's more of a data intensive biology for everyone. So there's a way to... In the same way as you would run steps from the common line, you know, like we did. You have a data set, you run a tool, you get an output, you put it in another tool, and then that's a series of steps that will be your bioinformatics analysis pipeline. You can do the same kind of things with Galaxy. So you can... All of these tools from the interface that we will cover will give you a kind of pipeline but from a user interface. So the good thing is it allows for reproducible results. So one thing... I don't remember if this has been mentioned so far in the workshop, but as you experiment with tools, as you experiment with parameters and these things, it's always a good idea to keep track of what you've done. Like if you run a specific software with a given set of flags and these things and then continue your analysis over the days and in the end you get your result. You might come to a point if you haven't written anything down where you want to reproduce your results but you just can't because you don't remember actually what you did since the beginning. Even if it's not a script, you have to keep the commands. It's a lot of mobile informatics that will take care of the commands that you use for the project. Pretty great. So that's definitely what you should do when you run stuff from the common line. From the Galaxy interface, this is some additional free view where like your history is kind of kept for you as you run commands, you have a history bar that keeps track of everything you've done with all the list of parameters and these things. So it allows you to know exactly what you've done to obtain your data set in the end and then you can reapply this recipe. You can extract the recipe and reapply it to any other sample you want in the future. So we'll cover that. I'll talk quickly about GenApp. GenApp is a Canadian computing platform for high science researchers. It uses the Canary High Speed Network and Compute Canada resources, Compute Canada HPCs. Basically it's a way for people to use their Compute Canada allocation with tools that are easier and basically web interface tools and this kind of thing. Again, it's free for Canadian academia. All you need is Compute Canada account. GenApp also offers a set of client informatics analysis pipelines if you're more of the type to use things from the common line. These pipelines include for RNA-seq, RNA-seq de novo, chip-seq, methyl-seq which is the same as original bisapide sequencing. All of these, same as the software we've been using in the lab so far, are available as pre-installed components on Compute Canada resources. So I'll just provide a URL here if you're interested to know more. I'll go back to GenApp, to Galaxy. So GenApp has its own flavor of Galaxy to run the Galaxy tools using your Compute Canada allocation. So in the same way as you'd be running tools from the common line and gradually using your allocation, you can do the same from Galaxy. And it's, well, yeah, it's just saying it's a bit faster than usegalaxy.org because usegalaxy is the main Galaxy website where everybody is connecting. The resources are a bit more limited. If you have a beefy allocation at Compute Canada and maybe it's the right place to use things. If you're interested to use this and you have Compute Canada account, you can basically go to genapp.ca and use your login password from Compute Canada and this will work directly into GenApp. And then you're ready to go. You can run Galaxy and start using it. There's another resource called Data Hub which is basically this place where you can put your own UCSC Genome Browser Track Hub with your tracks and you can deposit them there if you don't have a web server to serve your tracks and that they will take care of that for you. And so, yeah, I want to spend the rest of the presentation to explore with you the Galaxy web platform. So Galaxy, as I was saying, integrates hundreds of tools which are typically used from the common line interface. These are from basic file operations like I don't know, I have a tab delimited file, I want to remove one specific column or these kind of things to complex analysis jobs. So all the tools we've covered so far in this workshop, whether it's FastQC, you know, Max2 and so on, they're available from the Galaxy interface as well. So, yes. Did you use the main, like you use Galaxy.org or something? I think you get, I think the basic allocation, I don't know if it changed to like 50 or 100 gigabytes or something like that. I'm not sure if Galaxy has an issue with very large files. I know like in the case of many web resources when the file is higher than two gigabytes for some technical reason, there's an issue to complete the transfer. Well, if I may say this is not, you shouldn't face that issue specifically with the GenApp Galaxy instance. We've tested it with much larger files. So it's, but yeah, I mean, there can be tons of reasons why the download completed. It's sometimes a bit hard to investigate, but yeah, I mean... But what if actually with the Galaxy instance on the GenApp that he's talking about, is he bypassing some of the challenges and things? Yeah. So like, there's a way either to do the upload from the web interface or you can just dump your files in your Compute Canada directory and that instance of Galaxy will just find them so there wouldn't be this transfer issue that you're having. Okay. So yeah, there can be many reasons that have to be eliminated one by one. So as I was saying, most of the tools that you might want to use, I mean, most of the main tools are available from the Galaxy interface. And in the case of GenApp Galaxy, all Compute jobs are using your yearly allocation and you get a report from the web interface about the status and these kind of things. So each tool has its own interface to input the parameters that you might want to provide with your execution. So for instance, just showing the FastQC quality control can be executed from Galaxy and you can consult the report in the same way without having to move from the cluster to your own computer and these kind of things. So this can be a kind of shortcut. Galaxy also allows for pipeline design. So there's a graphical user interface that allows you to stick pieces together and just say, okay, so I have this FastQ and what I want is a big wig showing the coverage of my RNDC experiment and you can just plug the pieces from one software to the other where like basically after that you just have to run your workflow on the basic file and you'll obtain your track at the end without much more user involvement. So yeah, I'm saying that Galaxy gradually evolves too. There's always new versions. The newer versions allow to do much more things such as running the same pipeline over a large amount of files instead of having to redo the same thing again and again. There's a quick walkthrough of the interface. So each tool has its own input interface. I'm giving the example of BWA for alignment here and the reason why I use this one is I like what it's saying at the bottom here where you have to know what you're doing even if you're not running the tool from the common line. It's the same in Galaxy. You input parameters as Marcin was saying in his presentation, it's garbage in, garbage out. So like if you don't know the tools and you're just using the defaults for all tools without understanding the parameters probably the answer you'll get at the end is not what you're looking for. So using Galaxy doesn't remove the need to understand what parameters for the tools are doing. So from the left side bar there's an interface to list all the tool that exists. There's a filtering option. So specifically I want to use BWA. We'll list all the tools which are involved with BWA. That's a kind of shortcut. There's different tools to get data either from uploading from your web browser. You can extract annotations from sites such as the UCSC genome browser. For example, you can obtain if you want to do something with the list of common variants from dbSNP for a specific version or something you can do that. So I wanted to do a quick walkthrough of the interface of Galaxy before we move to the lab So... Is it... Yeah. Okay. So this is the instance of Galaxy we'll be using for the lab today which is running on GenApp resources. So it's an instance of a GenApp Galaxy but running on the same server that you've been using come and line so far in your labs. So as I was saying on the left side there's a list of all tools available within Galaxy categorized by types of tool and you can filter out let's say I was looking for BWA then it's filtering the tools the way I want them. The main... The central window is the place where you will, for instance, input your parameters visualize your results and so on and the sidebar here is the history bar this is what I was talking about like it gives you the list of all of the steps that you've executed in your analysis with the parameters and so on. There's... So... So I'll give you just a small example let's say I want to download from the UCSC a table browser the list of common variant for let's say a chromosome Y or something like that well maybe I went a bit fast now from the interface I'll go to get data and then there's this option where I can download all types of annotation from UCSC using the table browser So if I click here and it will show me the interface now I'll say what I'm looking for are the list of variants and I'll download the list of common SNP for instance just to show you a small example of tool execution and I'll say I want all common SNPs for let's say something not too big so chromosome Y and then it'll look up it will take the whole chromosome and once I'm ready I'll get the output and send the query to Galaxy and now I have a task that appears on my history bar there's different colors for that when the task is gray it means it's waiting before being executed yellow means it's running at the moment so right now Galaxy is fetching the information and once an item becomes green that means it was successfully executed if it's red that means something crashed something went on I need to check the report so by clicking here you'll be able to get more information for instance by clicking the little I information here and then I can just visualize my results if it's a text file or an HTML file for instance but these are the common SNPs that I ask for chromosome Y in a bed tabular format and yeah so as you execute more tools on this then these will be added on the bar and then you'll be able to run work to extract the steps that you did and to build a workflow out of it and to reapply it to other samples that those are things that I will cover at the end of the lab so soon we'll start the lab and at the end I will redo it with you and I'll show you additional tips on how to make your Galaxy experience better so to conclude the presentation in this model we've covered like some type of downstream analysis with epigenomic data how to obtain publicly accessible data sets for your own analysis some of the challenges of using public data sets how to visualize data sets using online tools some tools to do things not from the common mind but from a web interface the lab will cover Homer, Great, Galaxy and explore a bit how to use the IHIC data portal to discover data sets I include here the link to the main Galaxy server it's an account for free as I was saying I think you get about 50 or 100 gigabytes to upload things on you have space to get started but not to run like full-fledged analysis over a lot of samples but at least give you a feel of things and so if you didn't get the message already from the presentation if you're in Canadian academia get a Compute Canada account and maybe you can get an account to start using tools such as Galaxy I'm sorry it's using your own allocation yes exactly so if you get a certain amount of compute cycles you will use DAF as a server yes as well as using Galaxy they're both using the scheduler so even if you don't release it you can store it yeah yeah yeah so we will submit job in the scheduler in the same way so there shouldn't be unless you have a special situation where you have different allocations the same lab and these things the usual cases they will all go toward the same allocation and kind of compete for there's yes there's a set of parameters which are good for most of the cases you might have situations where very large data data files are hitting the wall time that's the case I encourage you to contact the GenApp support and this is the kind of things that can work out with you to make it work with your own data