 So, welcome to this last module of the workshop labeled the downstream analysis and online tools. First, I want to mention as we presented ourselves on the first day of the workshop, a lot of what I do in my work is to organize data and make the data like reusable by others, basically. Maybe you may have heard of the fair principles for data sets, how to make genomic and epigenomic data, findable, accessible, interoperable, and what was the R again, reusable, there you go. There are a lot of fancy terms just to say that there's a lot of work. Once the data has been produced to organize it in ways that others can reuse it. So all the data that's been generated, for instance, by the International Human Epigenome Consortium, once it's out there, once it's released in the public archives and those things, some resources and portals and databases and APIs need to organize this data to make it discoverable and reusable by others. That's why a significant portion of this module I will present tools, data sources, and things that are like rather than being more on the common line side are more available online. So we'll cover, I may have as well, get out the learning objective for this module. So first we'll explore a few components of downstream analysis that can be applied to epigenomic acid data. But as I was saying, we'll discover sources of publicly available data sets that can be used in your project. So it's one thing to generate data set and it's a lot of work. And then you need to analyze and analyze, but to get more meaningful results, often we need a very large data set size. These data sets, there's a lot of public sources for genomics and epigenomic data sets that exist out there. So we'll cover a few of these sources. We'll also talk about identifying challenges of using public data sets in your own analyses. And we'll learn about some online portals and tools that can ease epigenomic data analysis. So the sections are more or less corresponding to this. So first, downstream analysis tools. We've seen in the past two days and a half ways to analyze epigenomic data at the level of chip-seq, at the level of whole genome bisulfite sequencing. And of course, there's a whole lot of other assays that could be done at the level of transcriptomic with RNA-seq, 3D structure of the DNA, et cetera. So this information that we can get out of these experiments once they've been pre-processed, we of course want to put it in context to basically answer the question, answer specific questions that we're looking for. So once, as we've seen in the previous modules, once primary data analysis is completed for a given epigenomic assay, we have, for instance, at the level of chip-seq, a set of peak calls like the bed files that we've seen. The level of the whole genome bisulfite sequencing will have methylation levels at various CPG sites and so on. So this process data can be used to run some functional analysis tools. There's many such tools, so here we'll cover two of them. One is motif detection with Homer, and the other one being genontology term enrichments or go-term enrichments with Great. First of all, a little introduction on motif detections. What are motifs? They're short, recurring patterns in the DNA that are presumed to have a biological function. If we have a look at the lower part of the slide, there's a sequence of DNA. If you look at it carefully, you can identify that there are regions of this that are repeated multiple times. Like in the example, if we allow for a one-base mismatch, there are two distinctive motifs that can be observed here, and they're labeled in a specific color in the text. So the sequence-specific binoc, sometimes this can indicate sequence-specific binoc size for proteins such as nucleas and transcription factors. So using the regions previously labeled as peaks, like the big beds or the bed files that we've got, we can try to identify motifs. And we'll do that in the lab as well. We'll take the result of one experiment from ENCODE for a chip-seq for a transcription factor binding sites, and we'll try to identify what this might relate to. So for that, we'll use Homer. Homer is a commonline tool that we will be running on your AWS instance. The goal of Homer is to identify regulatory elements and mention one set of sequences as compared to another. So this is an algorithm that's like for DNA sequences. It has its own database to try to identify motifs that are in the sequences that are provided for known motifs in the database and try to also identify the novel motifs. The software we will use like Homer comes with a bunch of tools. The one we will use is fine-motive genomes that attempt to identify motifs as I previously mentioned. So the two inputs will be a bed file containing the regions of interest. So that's the peak file and the reference genome assembly. We've covered this in the previous lectures and labs, but like a bed file is basically a text file with multiple columns. Traditionally, the bed file can have a variable number of columns with variable content. There's a specific structure that Homer expects to get in your bed file and the columns are identified here. So these are the Homer execution steps. Of course, there will be some kind of a smoke test to validate that peak file. The bed file that has been provided has a proper content. And then Homer will extract sequences from the genome corresponding to the regions and input file. So in a bed file for the specific chromosome and start-stop positions, it will get the genomic sequences for the relevant reference genome and try to identify motifs. So once the sequence is extracted, this calculates CG and CPG content, it will repart the genomic sequences of the selected size. You can basically specify the genomic size window that needs to be used for the analysis. And if you look at the Homer documentation, basically like for different type of analysis, they will recommend a different window size. So we will randomly select background regions for motif discovery, run some auto-normalization, and then do the check for non-motifs and try to find the novel motifs as well. So two reports will be generated out of this. One is the report on the novel motifs it has found and the other one on non-motifs that already exist in its database. Okay, now jumping to go-term enrichments. So we can look at the biological significance of peaks for a given or a list of regions in a bed file using the gene ontology annotations. So gene ontology go is a set of structured control vocabulary for community use in annotating genes, gene products and sequences. So the tool we'll use to do this, there are other tools, this one is called GREAT. It's an online tool. So like it's a website to which you connect and where you'll upload a bed file and choose your reference genome and it will basically run the check in the background. So this is not a command line tool. And of course, being a public resource, the size of the bed file that you upload has to be limited because to save on computation power and so on. Just a few additional words about gene ontologies. So the ontologies themselves are not like control vocabulary. Like a control vocabulary would be, for instance, a list of terms for a specific domain that is a predetermined. But without necessarily any, you don't necessarily have to have relationship amongst the term. Whereas an ontology will give you for that domain the list of term but like relationships between terms. So we have an example on the chart here for, if I look for the go term 43 299, which is a leukocyte degranulation. I can get, this is coming from the gene ontology website. But you can see the relationship between this term and a bunch of other terms at a level of molecular functions, cellular components, biological processes. Gene ontologies allow you to characterize specific biological functions of what you're describing, the location on an organism and so on. So as I was saying, there's a main website for gene ontology and then like you can explore that and there's a lot of a useful tool to establish relationships amongst terms. So great is your right to know what I was talking about previously. Once again, the input for this tool will be a bed file with the regions of interest and ideally a little else. So like if the results of great seem like a little too vague or unsatisfying, sometimes it can be like, for instance, if you take a file that's representing that has a lot of piece over the whole genome. Well, it's difficult to establish like you have piece everywhere. Like it's difficult to establish that there's some specific go terms that belong to your request. So it's good to maybe do some kind of cleanup first. The output will be the matching go terms for the molecular function, the biological processes, the phenotypes, et cetera. We have an example below with a chip seek histone H3K27AC experiments. So we took the peaks and we run this like the sample, the biological sample was a bone marrow sample. So running this gives us a bunch of biological processes in which this is involved. It seems to be matching well with the type of sample that I submitted. H3K27AC being an enhancer, an active enhancer as we've said before. Okay, now I'll give a little note because we will be using in the lab a couple of times the set of UCSC utilities. So in the other labs now like you've generated the like bed files, and we've seen the content of bed file previously in the slides, but like from public resorts, often what you will obtain is big bed files or big wig files. So these are basically binary indexed versions of the bed files or the wiggle files that you could generate. So this is like for space efficiency and it's also for like while making these annotations usable in tools across the internet. So the formats were originally developed for the UCSC genome browser and what it allows you being binary indexed files, it allows you to stream portions of a genomic file over the internet for visualization in a browser that could be located at another place that your file is located. So there's a lot of tools that are interesting for doing manipulation between file formats such as bed graphs to big wig, bed to big bed, big bed to bed, etc. These tools have been pre-installed in your AWS instance. So you will be able to use them all during the lab part. Some examples I include here are big bed to bed which allows you to convert a binary indexed bed file to a ASCII like a text formatted bed file. A bed graph to big wigs allows you to take a text file that's a bed graph that one of such has been generated in lab 4 with Hector previously. And you can convert that to a big wig format which can be visualized in online browsers such as the UCSC genome browser. So you have a big wig merge which allows you, for instance, let's say you have multiple big wigs, you'd like to merge them for whatever reason. Like if you have a signal tracks that you want to put together, you can use this to merge file together. Big wig info gives you some basic information about your file and sometimes that's useful to assess why a big wig file is not behaving properly. What I have in mind there is, for instance, we were analyzing data from multiple big wigs files and one of them was not behaving nicely. And by checking the information on the file, we realized that there was a bunch of chromosomes that were missing from the file. Okay, for this section, I'll just finish with by giving a few examples of integrative analysis efforts. Like there are many large scale consortia that take epigenomic data from multiple types of assays such as methyl-seq, such as, well, all genome bisophat sequencing, RNA-seq, histone-chip-seq assays, and so on, and try to answer biological questions out of this. One of these being the NIH World Map. And as Martin talked about previously, within the International Humanity Genome Consortium, there's currently the Epiatlas effort, which has taken all of the datasets produced by all members of the IHEC and re-analyzing them together to even out, I guess, the rough edges of raw dataset being processed by different groups and not being able to analyze them together. So IHEC is preparing currently a gold standard dataset, if you want, of a thousand of submitted epigenomes. And they have been standardized and they have been processed, analyzed in standardized ways. This includes the metadata being evened out a little bit. Sometimes a group is talking about B-cells, and another group is talking about B-cells with an S. There's a lot of wrangling of the metadata that's involved in there as well, and a group with an Epiatlas has spent a lot of time evening things out across groups. So I will offer a method to ease the access to the raw data. Hopefully it should improve the overall experience of accessing and analyzing IHEC data. And this should be coming out in spring 2024. So working with public datasets. So as I was mentioning before, large consortia such as IHEC are offering datasets from multiple tissues, from multiple disease, from multiple conditions, phenotypes, etc. They're offering them for the scientific community to use in their own research. These resources are generally free and can be used in anyone's project. The public datasets, one caveat I guess is that public datasets have a various variable levels of quality. The metadata and the annotations on the dataset can have a variable level of quality as well I guess. So that's something to keep in mind. And then I'm offering a few examples of large scale initiatives that are providing epigenetic datasets online for anyone to use in their project. One of them being the roadmap epigenomic project, the encode consortium, which is now pretty much over. But the encode 4 part of the project has put several thousands of epigenetic experiments available and completely really accessible. So these are downloadable from their website for anyone to use. The GTEC projects is interesting because it's providing like transcriptomic data, but it also provides like a variance on like the genomic data on the given sample. So and they also provide tools to explore the interaction between variants on the genome and the level of expression for specific genes. Human dataset, human cell at last is another important source. That's it like another of these consortia of like many different groups that are just agreeing together to use a specific set of standards to annotate their data and make it available through a resource such as the HCA. And of course there's the International Human Epigenome Consortium, which is one of the most complete epigenomic resources. It is an international effort with several funding agencies. So like it's like each country or set of country has are or have been part of IHEC at some point in the countries you see here. So there's a in Canada we have the we have the CERC in U.S. There was an encode and roadmap which were which were part of the initiative in Europe. We had the blueprint and the deep consortium in Asia. We had the Crest, KNIH and then I'm forgetting a few but there's the these reference epigenomes that IHEC is working on have are the fruit of multiple members. So IHEC is like a consortium of consortia internationally, as I said, and it provides standardized reference epigenomes for a variety of normal and disease tissues. And there have been at different points in time different comedies or working groups working on given standards such as what is like in the beginning I remember this discussion on what are the the proper ways of running a chip seek What is what kind of coverage should be offered in terms of ecosystem is like how how do you annotate the data like with like a metadata schema for like describing your donors describing your biological samples describing the experiments that were run on these samples. As a group on ethics is it not nowadays that the most active active group is the integrative analysis group, which is working on the API at last. It looks like David froze. Is it frozen for anyone else. Yes, it looks like he's frozen for me too. Okay. I'll send him a message separately to let him know that maybe his connection is down. Give me a few seconds. Yes, he seems to have dropped out of the call so while we wait for him to come back does anyone have any questions anything there oh he looks like he's back now. Sorry for that I, I don't know why zoom cut, but yeah. Any questions so far before we continue. I'll just share my screen again. Okay, I had a question about the great tool. It seems like most of the descriptions on your slides were related to go analysis but it does great have the ability to give granularity at a per gene level in terms of regulators or would you advise against something like that if I was interested in like specific interactions with gene like that specific gene of interest. Would you say, I don't know about using great maybe use something else. Oh, good question like one thing I think I can answer is that like, you could decide if what you're interested in is like specific region or specific genes to like submit bed files that are specific to the things that are interested to you but probably maybe someone else who knows about a resource that they use for these types of analysis. Okay. So, one thing to mention also is that like to answer a question like you have a gene in mind. And you're interested in their their interactions with like yeah specific biological and process and those things. And the information can, there's a bunch of tools online to do this kind of exploration as well. And gene gene ontology resorts, I believe has a way to explore biological processes that are involved with a specific gene of interest so this is something that you could be looking into. Right, I'll have a look in those resources. Thanks, David. You're welcome. All right. Okay. So, I'm not sure exactly where the slides cut. Did I reach. Did I reach this already or who were there. All right. Okay. So, as I was saying, the IHEC has this concept of full reference epigenomes, which means I take a biological sample, whether it's a piece of tissue, whether it's a cell line, a primary specific primary type of primary cell like these cells, these cells, etc. And so, basically, it members are taking these types of samples and characterizing them over a bunch of over a bunch of epigenetic acids, including original bisophile sequencing RNA seek and six chip seek histone marks, and also providing input to be able to have one peak calling and so on. A couple of words on IHEC data integration and sharing strategy. As I said, IHEC is a consortium of consortia, which means that it doesn't have a centralized data coordination center that will receive raw data from the various group and just like merge them and analyze them together. Basically, the way it works is each of these gray boxes is a different data production group that submits their own data to data repositories. In some cases, it's at the EGA. In some cases, it's DDBJ, DBGAP and so on. In the cases of data sets like ENCODE, as I was mentioning, this is fully open data without restrictions on having to request access or those things. Whereas for, given this is human data and considered very sensitive data because of that, a lot of the consortia have decided to put their data under a controlled access repository. These are some examples here, EGA and DDBJ. So the data is deposited in those repositories. It is then processed using standardized data processing pipelines. Although there had been some discussions on what kind of pipeline should be used for which situation and those things, each group have been using pipelines that seem relevant for them to use. Which means the process data has been prepared a little bit differently. More or less it talks the same thing, but we've seen over the years that it can mean that there are some artifacts, some biases in the process data, the big beds and the big rigged files that are due to the differences in the way the data has been processed. But so this data nonetheless is the fully publicly accessible portion of the data that's being produced in IHEC. Therefore, that's the data that's being taken and organized in a data portal, the IHEC data portal that we will cover today as well. And then users are requesting, are going to that portal to discover the data that's being made accessible through IHEC. And once data sets of interests have been identified, let's say you're interested in all brain epigenomic samples, epigenetic assays run on all brain samples that are available in IHEC, then a data access request can be placed to the proper data access committee. These processes are usually like just a matter of a couple of weeks to get acceptance. And once a research project has been approved, data can be downloaded from resources such as the EGA. IHEC is also the nest, maybe if I can say of many tools that have been developed by its various members. And some of them are for data discovery, at the level of visualization, at the level of data analysis and those things. And I invite you to have a look at these tools. Now, going back again a little bit on the data being made available by IHEC, there's what I talked about previously, that is the public data. These are the annotation track, the process data that's fully accessible on IHEC servers and that are explorable using a browser such as the UCSC genome browser, such as Ensembl, or you can even download the tracks on your server if you want to visualize them with IGV. Some metadata that's deemed publicly distributable on donor samples and library. So anything that's considered non-personally identifiable is distributed on the open on the IHEC website. And yet again this is freely downloadable. For everything regarding the raw data, so the raw output from the sequencers, this is deemed personally identifiable information. Therefore, with a few exceptions, this data is placed on a controlled access repository such as EGA and TBGAP. And then if you want to access the data, you have to follow the process I mentioned before. Moving on to challenges with analysis of public data sets. So using online data sets of interest once they've been identified can, as I mentioned, bring its share of challenges. One of those challenges is, well first, as I said, you have to apply to obtain access. I'm putting this chart here because it was an example of just within IHEC, you know, you have all of these data producing consortia. And each of them have specific requirements in terms of what they expect research project to do or to be able to provide in order to be accepted to obtain the data. So like, getting data for the IHEC as a whole used to be like a little bit of a challenge. This is one of the great things that's coming ahead with the EPF last data set is that this is all like, this is an obstacle that we tried to remove when trying to use the IHEC data sets. So transferring data from the repository, like obviously like data download can be a very long endeavor. I'm saying sometimes up to several months, it's not because like transferring very large amount of data can be slow on your side. On the downloader side, it's just the fact that like sometimes like in the case of IHEC, if you have over 100 terabytes that are located on EGA server, the burden, the slow side, it can often be the data servers themselves. Comparing data sets across projects, like as I said, like metadata is often hard to collate across project because it's not been collected and organized and harmonized in the same way. The way to describe an experiment in, I don't know, the human cell at last project versus a data set that I identified at a geo can be very different in one case. The experiment, the molecule, the biosample sorts and those things are super well explained. In other ways, it is less. So if you try to analyze all of this together, sometimes you'll come out with those surprises. Experimental methods for giving assays are not always the same. The results have varying levels of death and quality and so on. Of course, the last challenge is that analyzing large data sets on local resources can be very intensive in terms of a processor, in terms of memory, in terms of a storage space. It's usually unless we're talking about really small data sets. That's something that you can usually do on your own laptop. Fortunately, there are a lot of resources existing nowadays to help with this issue, one of them being of course commercials and solutions such as AWS that you are using in this workshop. In Canada, there's also the Digital Research Alliance of Canada that can provide access to compute clusters that can are very powerful and allow you to run these large scale analysis on powerful computers that you could not do otherwise. This is a free resource that's being made available to Canadian researchers. So if you're in Canada and running research and you don't know about the Digital Research Alliance of Canada, I suggest that you look at that. So I talk about challenges. I can talk quickly about the IHEC initiatives to help with some of these challenges. First of all, for the IP at last project, one thing, so I mentioned earlier that like the process data that's on the IHEC data portal has been processed by different pipeline in different ways. What the IP at last project has done is downloading all of the data from all of them and member consortia and centralize it in one report to analyze it together. And then the data analysis has been done using uniformized data analysis pipelines. I include, if you're curious, the URL to the three pipelines that have been used to process the chip seek, the RNA seek, and the whole genome bisal fight sequencing data. And then once all of this is done, there's this gold standard that's coming ahead, harmonized metadata and access. Once this is finalized, the process data will have been same pipeline, so it should be even more comparable than it is now. And this release will also be put on the IHEC data portal. So this will most likely be the next release for the IHEC data portal. There's also the IP share project, which is a project that ran in the last couple of years, aiming to standardize, to improve standards and methods to share and discover available epigenomes. So there is things such as they developed an infrastructure of nodes to share data sets and to securely store the data. And also it has worked on data analysis containers that can be executed at a storage location. I can talk quickly about the APVAR browser, which was one such project developed with an IP share. Basically, what we did is allowing users, like as I said, the dynamic and epigenomic data, it's super large data sets. These large data sets can be cumbersome to centralize at the same place and analyze to ask you questions. So we built an online tool that allows you to explore the effect of given variants on a set of epigenetic features. Like for instance, I give an example in the screenshot, I give a specific variant of interest. And then for this variant on a given data set, I'm looking at a bunch of epigenetic assays, and I'm looking at over the whole genome, does it seem like this variant has a specific effect on this epigenetic feature? The data set that is represented on the APVAR browser is connected to a bundle of publications that is just about to get released, but the resource is already available online. So this data set is about flu infection. So you have the data set on a non-infected versus flu-infected patient for two population groups, African-American versus European-American. And based on your genotype for the specific allele, you can see the effect of your genotype for this specific variant on different epigenetic features. Next I can talk quickly about D-Path, because handling epigenetic and genomic data, as I said, for immune patients, there's a lot of considerations to take in mind, right? It can go all the way to in Europe, you know, like there's the GPDR that can affect what you can do and not do with a patient's data if your patient is of European origin. So like what D-Path does is basically based on like a couple of questions that it asking you. It will tell you what are your responsibilities and precautions you should be taking while handling epigenetic data on your server, especially if you're building tools, all of this and those things like, what are the responsibilities that you should be taking? So this was available online. It's just like, as I said, you answer a couple of questions and it tells you what is the accountability, what are the security features that you should take into account when storing and handling that data and those kind of things. Hector just mentioned about this at the end of his lab, but there's a lot of the bioinformatics analysis pipelines that are available in the open, GenPypes being one of them. The reason why I'm talking about this is because I get a question once in a while and we talked about this in the social last night as well. People are wondering, so in the labs, we've seen all these tools, the software, and we follow all these steps one after the other. But when you guys have these large data sets and you analyze them, do you apply these steps one by one and check the output and then put it in the next one? And the answer to that is no because it will be way too labor intensive and these are tasks that are very automatable. And this is what bioinformatics analysis pipelines do. One example being GenPypes, but there are like a lot of other options. And these options are offering epigenetic data analysis pipelines for RNA-seq, for ChIP-seq, for bisoptical sequencing and so on. What I find interesting about GenPypes is the fact that if you are on Compute Canada or Digital Research Alliance of Canada, and you're using one of their superclusters, all of the tools and the pipelines themselves are already installed there. So it's a matter of configuring the pipelines for what you want them to do and you're already good to go. So this is something that you can have a look at if you're using Beluga, if you're using Seder, if you're using Graham, Narval, and so on. Last one, maybe on the quality controls of epigenetic datasets. We've talked generously about the fast you see already. It's just to say that there's a bunch of tool that allow you to assess the quality of data that you would find online, because as I said, it can be of variable levels of quality. I'm just thinking like two tracks that could used to be findable in online resorts. And then if you look at the signal you get on these ChIP-seq experiments, they seem very different in terms of output. So the message I want to communicate here is that the data you find online, especially for something produced by a smaller consortium, it's good to try to run some level of quality check on it. And Martin already talked about the FRIPS score as a way to assess the quality of a ChIP-seq experiment. On the IHEG data portal, we'll cover that later. There's a tool to give you a kind of a correlation matrix on datasets of interest that you would select. So we'll cover that in the lab. IHEG has been working and has published recommendations on the set of quality control matrix that you can use to assess the quality of each of your types of epigenetic assets. I see the time is flying, I have about 10 minutes. I will go a bit quicker. But Chrom Impute is this tool that was developed by a manual scale groups before that allows and mostly developed by Jason Hurst, which allows you to generate imputed annotation tracks for samples using training data. So you have a sample on specific blood tissues, like maybe B cells. And then you have other datasets that are B cells as well. And you have multiple experiments around on H3K27AC. You can use these to impute for your sample a track of what your epigenetic experiment should look like. And then there's tools within that allow you to see how different your track is actually from the expected track. And I guess it's one way that could be used to assess the quality of your track as well. So now we talked a lot about the resources, how to find them, how to obtain the data, analyze it. And then there's a lot of tools that exist online to help you with your data analysis, quality assessment and those kind of things. So in the time we have remaining, we will cover some of these tools. So I'll start quickly by talking about the IHIC data portal, which is the place to obtain the epigenetic data produced by IHIC. So as of October 2020, that's the latest build that has been produced. And as I said, there's another one coming ahead. You can find over 10,000 human datasets over and you can find over 1000 mice and primary dataset as well. There is 450 full reference epigenomes that are available and data set from eight consortia. So the goal of the portal, as I mentioned earlier, is to make available and organize the publicly accessible data portion of the IHIC dataset. So the raw data is in controlled access repositories such as EGA and DBAB. So it offers a tool for datasets discovery, visualization and pre-analysis. We'll cover the exploration of the data on IHIC within the lab. As I said, the portal includes a dataset correlation tool that allows you to establish within IHIC how similar a dataset of a specific assay would be with like another dataset, maybe of the same assay for instance. So the dataset correlation allows you to see potentially identify outliers and similarity across groups. There's a tool to download the process data and we will use that in the lab to obtain some big beds and big wigs that will be used in your analysis. The tracks are all served on the IHIC data portal locally. The reason for that is that like as time passes consortia come and go, the existence of servers and their organization changes. So to offer some level of standardization at some point, the IHIC data portal centralized all of the epigenetic tracks from each of the consortia and makes them available on the portal. The portal has a concept of permanent session, which means basically that you can select a dataset of interest and you can share it with collaborators. This is what I use for my analysis. You can generate a unique ID and reuse that. There's a feature for session reports, which allows you to see all of the metadata that's available for a specific sample. I want to talk also about the encode portal, which has really well organized its thousands, several thousands of experiments on the whole bunch of epigenetic assays going from the data we can find in IHIC, like histone chip seek, like RNA seek and so on, and a whole bunch of other assays such as ATAC seek, transcription factor binding sites chip seek. So this portal offers different faceting tools to identify datasets of interest based on the organism, based on the assay, based on the biological sample, and it also includes links to visualization and public browsers. Just a few words on the chip seek at last, which is a resource that scraped a lot of online resources for completely publicly accessible experiments. So the chip at last includes 375,000 publicly accessible such experiments. The level of chip seek, DNA seek, ATAC seek, biosophyte sequencing and so on, and you can use their resource to run some level of exploration, like there's this nice little peak browser there that can generate bad files that you can use in your IGV browser, for instance. You can also get links to where the data is deposited, if you want to use it, like in this case, there's a bunch of data being deposited at GO, so you can get links to download it there. The GTEX portal we talked about, GTEX previously is another place where you can download the data produced by the GTEX project. UCSC genome browser, we've talked about this and it's been covered by other labs already, but it offers a, it's like probably the most used genome browser online to visualize tracks for genomics data and epigenomics data. There's a way to, if you want with collaborators, you want to exchange like a set of data sets that are relevant to your project, you can use a track hub, which is basically a list of tracks, their location on the internet. So like, when you visualize data on the UCSC genome browser, the tracks that are here are not available, are not on the server of the UCSC genome browser, they're located elsewhere, for instance, on the IHEG data portal. And the way to visualize the data on the UCSC genome browser is to create a data, a track hub, which is a list of tracks that exist somewhere on the server and some metadata to organize it in the browser. So maybe I'll skip over this, but this is a smaller track hub example. So basically there's a standard and it's well explained on the website of how to organize your tracks for a UCSC genome browser session. There's obviously like other browsers that exist for visualizations such as the WashU epigenome browser. I can talk quickly about this tool, which is called EPGIG just because this is what generated the correlation scores in the IHEG data portal. So EPGIG is a pipeline that looks at IHEG data and it also allows you to upload your own data and you can assess the similarity of your track based on the tracks of the same assay or of the same cell type. So you can compare your data with IHEG data using it. And then you can generate a correlation matrix matrices of your data with IHEG data. Time is almost, time is running out. So I'll just talk quickly also about data sharing because like one, it's one thing to generate data. In your research project, you'll have raw data, but you'll have process data and you will want at some point to share it with collaborators who will have a look at it and most things. So one interesting thing, and so there's this online tool called GenApp which allows you to generate what you call a data hubs. A data hub is just an open space on the web where you can drop tracks such as big beds and big wings. You can deposit them there and you can share a link with a collaborator password protected or not. That's for you to decide and so you can put your analysis results on it. You can share reports and those things and you can share them with a collaborator. It used to be a Canadian research, a Canadian academia only. It's now open to international academia. So what you need to do is just your register, whether it's with a Google account and there's different ways to register. But once you're in, you can create a space, you get, I think it's what, 250 gigabytes or something. You can start uploading your data there and generate links to share with others. I'll just say a few words about Galaxy quickly in the time that's remaining. Galaxy is an online portal. I'm talking about this also because I discussed with some of the students at the social yesterday and at another occasion. And one thing sometimes that we're being asked is like, so all of these tools that you're running on a common line, is there a way, is there a tool online? Is there a user interface that would allow me to run these tools? And the answer is yes. There's this platform called Galaxy, which has the vast majority of the tools that, for instance, we've covered in the past two days and a half in the labs and those things. Running fast QC, running alignments, calling peaks, et cetera. So like this tool, this framework, Galaxy, allows you to do that. Basically you can set up workflow, analysis workflows, or you can run tools by the piece to run specific analysis on your data using a web interface. So I'll just give you an example, which is usegalaxy.org. This is a, you need to register, but it's completely free. It gives you 250 gigabytes of space for your analyses. So what you can do, you can start with your raw data, your fast Qs that you obtained from your experiments. You cannot upload them there and you can run tools as I said fast QC alignment and so on. You can analyze them on that portal. And well, of course, heavier jobs will be put in a queue, so you might have to wait for a little while. But it is an interesting solution. And one thing I really like with Galaxy, and one thing that we never emphasize enough is that when you run a dynamic or epigenomic analysis, it's very important to keep track of the steps that you've done to analyze your data. The reason for that is that, well, one reason is for reproducibility. Let's say you run your very nice analysis and you've been working for months on something, and then you reach the result in the end, and you haven't written down anything of what you've done. It's a bit like a lab book, right? You want to reproduce your result, you might have a really hard time to do so. So that's why it's always recommended to, whether you run stuff on common line or anything, like to keep a journal, to keep a log of what you've been doing in terms of analysis. And that's one interesting thing I like from Galaxy is that on the right here there's this history of everything you've done with your data, like which tools you run on them, what did you use as an input, what are the parameters that you gave to the software and those things. So you have this log that's kind of like there for you if you need to extract it at some point. So I am giving an example of a fast you see here, but one thing that's interesting is that once you've reached this series of steps that you like, and then you want to reapply it to other samples, you can extract recipes and create workflows out of them to reapply them to other data sets. So maybe again, I'll just re-emphasize here what has been said earlier in the workshop but for Galaxy same as with command line tools, you have to know what you're doing. You can't just expect the default parameters of Galaxy or of command line tools of any sort will be exactly what you need for the analysis that you're doing. You need to understand for each tool you're using what are the parameters, therefore it's very important to read the doc. So just to say, so like the lab we will do after the lunch break will not be especially covering Galaxy, but I included an extra lab at the end of the of the markdown document. So you can once you're finished with the lab, there's a link you'll see at the end that links to an introduction to use Galaxy.org. So you can use this to just create an account and upload a toy data set and then run some analysis and you'll they will give you a feel of what Galaxy is doing. Because like Galaxy in itself is can be the subject of can be the topic of a workshop in itself that last two, three days. So like there's just so many options what you can do. So in conclusion, what we've done in this module, we've covered examples of downstream analysis tools, how to obtain publicly accessible data sets, the challenges of using these public data sets. And we've covered some online resources to obtain data to run analysis with web interface. So the lab will will be covering some of these some of these tools that we've talked about over the last hour.