 Welcome back everyone! So I'll be presenting the last module, the dance room analysis and integrative tools. So the objectives of the lecture this afternoon will be first to explore some of sometimes downstream analysis that can be done with epigenomic assays, although Misha already went into some parts of that yesterday at the assignment of the day. We'll also cover sources for publicly available data sets that anyone can use in their own projects either for full analysis on public data sets or like to compare to, for example, their own data set that they produce within their team and we'll also cover online portal tools that can help you with stuff such as data analysis and like offering tools to do sometimes the same kind of thing that you've been doing from the current line and so on. We'll come back to that. So first of all we'll talk about the downstream analysis tools and then we'll move on to public data sets, ways to control the quality of the online data sets that you download and then we'll move on to online visualization and analysis tools and finally we'll give you a short introduction to the Galaxy web-based platform. So yeah, in advance I realized there's a lot of stuff in the presentation here, so maybe the more the more advanced concept I may go a bit over quickly because I would like at the end of the lecture to have a bit of time to give you a kind of live demo to the Galaxy web interface that might help you for the lab part I will come after. So yeah downstream functional analysis first point. Well this is like the focus of the whole workshop but motivation for epigenomic integrative analysis is like so yeah we you know genes account for only two percent of the genome so that means that like 98 percent of the genome does not encode for protein sequences. However 76 percent of the genome gets transcribed in some kind of ways and nearly half of the genome is accessible in some ways to transcription machinery and so on. So like by putting into context information that we get from DNA methylation, the chip C for histone modification, transcription factor with RNA transcription with RADC, chromatin accessibility and so on we can at least hope to ease our understanding of the underlying biology. So what we've seen so far in the workshop is that we've seen different softwares and tools to do primary data analysis for chip C and methylation and what we get at the end of that is for example in the case of chip C a set of peaks so often represented as bed files which tell you like on the genome where are the peaks that were identified in the case of methylation we get methylation level at different sites CPG islands or CPG specific sites all across the genome. So next we can use this data to run some functional analysis tools by comparing for example different regions in one specific data sets or like multiple samples of the same group or samples across different groups such as case versus control and so on. So Guillaume talked to you about that this morning but like for instance in the case of methylation we can look at differentially methylated sites across groups of individual cases versus controls and then we can look either as specific sites or we can look at groups of site or CPG islands and so on that seem to behave in similar ways and so on. This graph is from the road map if I may say a flagship paper which was looking at the methylation patterns different regions of the genome whether it's like transcription start sites for genes or like all other like intron and so on and like realizing with a high number of samples and mostly that like yes transcription start sites seems to be less methylated and more accessible to transcription machinery and so on. The next thing I want to talk about is motifs so like what are DNA motifs? They're recurring patterns in the DNA that like we think at least they may have a biological function. So often when you have these these little patterns of DNA across your genome you can expect that maybe some transcription mechanism can bind to it and start translating the DNA into a transcribing the DNA into RNA and so like in the example I have here if I look at this potential transcription this specific DNA motif and if I search for this in my results file and I allow for one base mismatch then I can realize that I have this this motif at several place in the genome so that's allowing for example here I have a one-to-one match but like in the case in this case here I have like a one this one here one letter mismatch so like and because like motifs which to which like transcription factors might make my bind to are not necessarily matching one-to-one every of the bases of my motif so like by having a software that looks at like allow a bit of mismatches in those things I'm more able to identify sites that me interested so yeah exploring motifs in gypsy data using the P files that we that we generated out of the the pipelines that we talked about yesterday we can try to identify such motifs so we talked a lot about chip seek in the constant in the context of identifying his stone tail modifications which may or may not enable transcription there's also a chip seek that can be used in the context of a transcription factor binding size so like for example if I'm if I'm looking for one specific transcription factor and I may I may use an antibody to bind to this specifically and try to see in the genome where it binds and then using the peaks that I get from the chip seek experiment and I can identify where my my motifs might be doing something so there's a software to do that that's called Homer we will run that in the lab part at the end of the afternoon so what it does is it's using a motif discovery algorithm to identify either known motifs because Homer has its own database of known motifs across the genome it tries to also identify the known motifs based on the list of known motifs that it has and that's it it generates each nice HTML reports with like those graphics and those things that like the confidence that it has on on the identify the motif and so on so that's the command that we will use this afternoon to to run that on there on the chip seek data set that we will download from the encoder website so what Homer takes as a parameter is first a bed file that the bed file I think yeah Misha presented you yesterday is I'm just giving you a yeah sure description of what it is here but it's a text file with several columns first one being the chromosome and then the start position of your of your feature the end possible your future when I talk about a feature is like so you have this bed and like each line in the bed file describes that there's something specifically starting at this position and ending at this position at one place on the genome so like a bed file will list the peaks so like the regions which are which have been identified in experiments so by providing to Homer this bed file in a reference genome assembly it can it can look for the motifs which are interesting for me this far it's so the execution step well quickly over that like it's as a quality of the bed files it's extracting the sequences so like my bed file doesn't contain any any information on DNA bases and those things it just has information about location I know that there's something starting there and ending there on my genome so I need a reference genome to get the sequences which are described by this bed file and then calculate the cgcpg content of these sequences repars the genomic sequences of a set of sites randomly select a background modification from about this discovery that's interesting as like we mentioned randomly that means like that from one execution to the other you might get slightly different results but just keep that in mind when you do the lab later that it doesn't look exactly like the screenshot and yeah that's it so it and it checks for known and the novel motifs and gives you a report that looks basically like that so this motif we know for the the first letter for sure for sure is a T and then this the second base of my motif is either a G or an a most of the time it's a G but like yeah so you can see with the size of the letters you can know like how how likely how often you will find one specific base in that motif the next thing is looking for a gene ontology enrichment so yesterday Misha presented David and David will present you yes so yeah I'll talk to you about about about great this afternoon using the details so and yes I'm saying we're looking for gene ontology which are a sense a set of structure and control vocabularies for all kinds of stuff such as like biological processes phenotype diseases and so on so I'll just talk quickly about great for select what they're saying why they're claiming that they're like better than other tools although like I don't know which extensive this is a this is true like nowadays but so that like great doesn't necessarily just look at the region that's just next to the feature that you're describing but it's looking a lot more after and a bit more before the region in order to try to widen its its horizon of potential features ontology is describing your your feature and yeah maybe I'll just skip over this this is like describing to two methods that you can be using to to identify descriptors so the input of great while you give it a a bed file with the region of interests and the output will be the matching gene ontology terms such as I was saying like molecular function biological processes phenotypes diseases and so on so you'll get one table for each of these categories with like the top ranked in order of a matching score so like the top one being the more the most potentially interesting of course look at the look at the scores before determining that this this song this gene ontology is really describing your data set so this is just an example for h3k 27 acp from a bone marrow sample where we see that it's small but like there's a lot of there's a lot of terms which seem to be matching to immune response immune system and these kind of things which like potentially make sense so it seems to be working well in this situation and then I'm just giving a couple more examples coming from the the the roadmap paper where for integrative analysis where in this case we're using genome-wide identified snips and to like comparing to chip seek data to try to identify the case a wide range of diseases and different cell type if there seem to be a correlation between specific snips and chip seeker outcome and yeah another example here for for methylation data for for example to say so depending how far I am from my transcription starting side energy the level of methylation and these kind of things so my my next section is on the working with public data sets so there's a lot of the sources on mind where you can download data sets which are publicly accessible to anyone you can download these and start going on your studies on a wide array of cell types diseases other based on other type of conditions or phenotypes yeah these resources are free you can do whatever you want with it but as others have mentioned before me you usually should assess for the quality of the files because like not all the public accessible data sets are at the expected level of quality if I may say one maybe one of the the first large-scale projects was the the roadmap project which is not has now been over for a couple of years but it provides such a like like lots of data set many samples around different types of a human and other species on things such as a whole known by the other types of assays in cold consortium website when you if you look at at the in your binders or in the online slide I put under the picture the URL where you can access the size of your interest is so the encode which is starting now it's its fourth phase of the data production and they have this this website where you can download all types of data sets as well there's the the GTX projects with the this one specifically its aim is to try to find a relation between SNPs identified the genome in comparison to in relationship with the gene expression so the URL is here and there's also I hack which is like right now the the main epigenomic effort worldwide basically it's a consortium of consortia there's many groups producing data on on different kind of issues so there's a strong also disease component on the the data sets with an I hack so in Canada we have the the CERC effort which is producing such type of data sets in the US we have a we used to have a roadmap and there's a code in the Europe we have the blueprint project which just finished last year and who produced a major in majority blood deep in Germany we have the Japan and the Korean so what is I hack its goal is its goal is to provide standardized reference epigenomes for different kind of normal and disease issues so then one of the main the main ideas behind this is is to have these these comedies working on different on establishing different kind of standards on all kinds of aspects for data production one being the assays to to create these data set and to analyze these data sets bioinformatically afterward there's a group for like data distribution and metadata organization and so on there's there's a work group and ethics on it there's a new one on starting on integrative analysis for for these data sets and so on when I talk about a reference epigenome produced by I hack I'm talking about so to have a full reference epigenome you need a whole genome bisophized sequencing data set you need an RNA seek transcript as the transcript on for for your data set and you need chip seek assays run on six different histone modification histone tail modifications so like one data set usually means like one samples in some cases we have pools of samples for for this as well but in all those cases it's like it's it's for this sample or this sample all of these assays are being around and then it can give you a a more integrative approach to that you can use to analyze one specific sample to answer your questions so the data data integration strategy and sharing strategy we know a heck basically it looks like this so we have all of these different consortia member member of I hack producing this data first of all they're producing the raw data which is the fast queue that was explored in depth yesterday so these are the raw sequences for whole genome bisophized sequencing chip seek and RNA seek and so on this raw data is not accessible directly like it's not open to the public because there's all kinds of among other things ethical considerations like many Parkinson's we want to want just their DNA to be in the open so to access this data it's possible for anyone to access this data it's just that you need to prepare an official request describing your project describing what you're going to do with the data and you have to have a academic institution backing you on this research and you have to agree to a whole lot of terms about what you can and cannot do with the data and those things once and so you apply through this process and then you get access to the raw data you can download to start your analysis so of course that's the most interesting data set I think it's like this is the raw data and you can analyze it in any kind of way you want but as a kind of shortcut so each of these groups have their own standardized processing pipelines right now like many most IHEC members have their their own pipeline which differ in different kind of ways and we're gradually working and standardizing these kind of things to have a results which are gradually getting more comparable but like this is to say that the process data that get output out of that is this is available on the open so the site where you can download this data set is the IHEC data portal you can navigate and we'll cover that later in the workshop but it's the place where you can download search for the data sets of interest and download the peaks or the methylation sites and methylation level for different sites on the genome and so on so that brings me to the IHEC data portal the goal of the portal is to make available and discoverable the public data sets produced by the different IHEC members so again raw data is in control accessories repositories but the public data that you can visualize and start analyzing right away is available on the IHEC data portal so as of November 2016 where we currently have like about 10,000 human data sets about 230 in mouse and primates and about 294 reference PgNome so like that's with all of the assays that I'm talking about I talked about earlier and there is currently eight consortia which have data sets on that site available so far and so the portal offers tools for discovery visualization and pre-analysis we'll come back to that and I explained that already but like this difference between controlled access data and public data control access data is the raw data from the sequencer but it also includes things such as phenotypes and clinical or sensitive information that that patients wouldn't want to be just out in the open and they archive that control access repositories so the public data on the other hand you can just start visualizing on things such as the UCSC Genome Browser, Ensembl, IGV, you'll see in the workshop later and in the lab part we will download some of these public data sets and we'll still manage to do some interesting things with these so this also includes metadata that makes some participants not directly identifiable so information sometimes like the donor sex maybe an age range, a disease if it's not too specific to directly identifiable like for rare diseases and those things and it's really available so assessing for quality controls of online resources as we mentioned many times so far the data sets have a different level of quality the quality therefore of the data set that you download should be assessed there's different kind of ways to do that one of the one of the ones that we've touched the most in this workshop is to use FastQC so if you download the FastQs from a control access repository for instance you can you can run FastQC on it and you'll have an overall idea of the quality of the data sets you get there's other things that you can do I'm putting at the bottom here an example of like a same data producer same type of assay but two different samples and you have one for Gypsy and then you have one where you have really like a clear piece while the other one has a lot of background noise which makes you like realize that maybe there's something something something going on something's wrong there there's other ways like you can simplistic but like just comparing signal to noise ratio in your Gypsy data for example the percentage of percentage of reads in your Gypsy that match to peaks that were identified and those kind of things can be the other kind of things that can be done so there's also online resources which can help you a bit with that for example in the data portal there's this this Pearson correlation test that's that's being around over like all all of the tracks which are archived by the portal and it can generate you these these these matrixes with these kind of heat maps that will cluster data set about based on their similarity there's also a currently in the IHEC Asset Standard Working Group there's a set of QC metrics that's getting that's gradually getting set and people are gradually starting to generate QC metrics on the data set that they that they release and then these metrics will be available within the meta within the metadata and then people who download the data sets will be able to to have an overall idea of the quality of the data sets that they use. I just wanted to talk quickly about Chromium Q because like that's a much more involved way to assess quality of data set but one that's existent nonetheless so Chromium Pute basically allows you to impute missing signal tracks so let's say you have a whole bunch of samples on which you run chip seek experiments and then in the case of some sample maybe I don't know just inventing a story but like just say you didn't have enough material to run the the sixth chip seek histone mark assay so like you're missing one data for for this for this sample so like if you have other samples with similar conditions such as I know same same cell type and those things that the other assays have been done with for the other similar samples you can use Chromium Pute to kind of impute the missing track of the missing signal track for for example so but like so this is interesting but it can also be interesting for a in a QC perspective so like if you have many samples which are supposed to be similar you can try to impute a track for for your data set even though you have the original and then you compare the original to the imputed track and can give you an idea how I like house if it's not similar allowed and allowed and maybe there's something wrong with the track you're looking for this tool has been used for has been created I think specifically for the the roadmap project and it's now available but it's very computational and computationally intensive and takes a while to run and next online visualization and analysis tools so okay I'll cover many online resources here first I'll talk about data discovery and download tools such as the IHEC data portal the encode portal the GTECs and then we'll go to a data visualization with UCCGNOM browser and symbol and the WashU browser and then I'll talk about the Alexi is there any questions for far you're asking which which data you well you mean so the on like the tool the tool the online tool that we didn't see they have their their own database in the back you're asking on what on what was the what was used to create these databases it's good question I indicated it's especially in the case of great I will use it I don't know what was used to I think it's on the whole set of gene gene ontology but yeah yeah we can take that out after we do okay so yeah going back on the on the IHEC data portal so that this is basically what you'll get when you when you log in for the first time so you have this interface where you can select for the whole IHEC and for a desired reference genome all of the data sets which are available per consortia so the consortia has made members of IHEC one more time and or by a tissue category or by type of assays if you're looking specifically for gypsy data sets for instance you might want to select the histone pie charts so like these charts are selectable you click them and to say what you want to see and then you can click on you selected and it will bring you to the grid which looks like this so basically this is a bi-dimensional well a grid which which presents you all of the available data sets with an IHEC per cell type and per assays and there's filtering criteria on the side to help you decide what you want to see at any given time and so all of the tools I mean there's a lot of external tools which are linked to the IHEC portal that more more are coming but like currently for example you can visualize tracks in the UCSC genome browser so like if you want to visualize all of these data sets here you can just click on them and click on visualizing genome browser and that will open the UCSC genome browser it will allow you to browse through all of these tracks at once there's also what I was talking about before the correlation tool of the IHEC portal so basically you select the data sets that you want and it will give you a clustering of your data so like you can see how similar data sets that you selected are to each other so for example if you choose a group of samples which are which are of the same cell type and like in all kinds of similar conditions and you look at different type of a chip-seeker histone marks you should expect that like repressor marks should cluster a bit more together like an enhancer mark activator marks should cluster more together and and so on so this is a way to assess the quality of the data and there's a there's a way on the side here to navigate the available metadata for all of these samples you can see this to redistribution by a whole lot of metadata terms we'll see that okay so this is an online tool only so it's from a browser at the moment it's only for IHEC data but if you want to compare your own data next to the IHEC data there's a feature coming for that in the IHEC data portal hopefully by the end of the year this is something that you'll be able to do so yeah I'm giving an example it's here for for roadmap data where I can see for example that my for the same samples my enhancer histone marks cluster well with my my other activators and my repressors cluster well with my other repressors and you can also download the tracks individually with with the download tool there was a button at the bottom of the grid so like you select the tracks that interest you you click on download it will give you this kind of a directory where you can download the tracks of interest the tracks are now hosted directly on the portal server so that that enables people to that that enables like kind of protecting against stuff such as disappearing tracks and disappearing data sets over time so consortia are coming and going when a project is finished you don't always know what will happen with the data and those things so now data is permanently archived in the IHEC data portal it can also generate sessions so like from the the grid the things that you select are you can create a session out of that which will generate a report of the samples with like all of the metadata that's available for each of the samples so like it is something that can be used for sharing for citation purposes or or so on and finally there's a web API which gives you in JSON format document documents with all of the available metadata for your samples I know here if you're familiar with the JSON format it's like it's more like it's more made as a kind of machine readable format but that's also very human readable as well right you have like a whole bunch of hierarchical kind of tree of keys and values so like I can see that for this specific experiment analysis attribute are removed here and then I had the tracks for this data set and then I have informational experimental attributes sample the sample attributes the donor attributes and so on moving to the encode portal so like not not that long ago like the encoder revamped their website and now they have this kind of neat grid with a lot of filtering feature based on all of the available metadata on the encode portal one thing to mention so encode is also a member of IHEC and some of encode's data is available on the IHEC portal however the one of the main challenges I would say of the of I the IHEC itself is the fact that like it's many different groups producing data in different kind of ways and trying to harmonize things as much as possible but like we're still not at a point where like metadata has been defined like in the exact same way across all sides sometimes it's it's difficult to compare samples because you have a lot of missing information like little holes here and there like you're looking for samples with a specific donor sex and a specific disease and like things have not been entered in the exact same way so like there's a lot of needed manual curation to make things fit together although there's a lot of work in with an IHEC right now to try to harmonize these things and it's already a lot better than it was like maybe I would say even a year ago so and it's gradually improving but so the data from encode of course is produced by one data coordination center so like the metadata as well is really neatly organized and there's a lot of filtering criteria on the portal to decide what you what you see and what you don't so you can select the data set that interests you in the grid and once you can also visualize those tracks on UCSC genome browser on the ensemble browser and yeah that that's it but again it's it's strict this one is strictly for encode data there's also this I like the the GTex portal that I talked about before just because it has a lot of a neat web two or three point or visualization tools which gives you all kind of information on the available data sets so I invite you have a look if you're interested and finally there's a there's deep blue which was made by the deep consortia in Germany as a member of a IHEC consortium so deep blue to be opposite of IHEC data portal web where the goals to download a whole data set in those things deep blue offers you different kind of tools to select regions of the data sets that you want so that you don't have to download the whole thing locally to your computer or to your server so it offers you like different features such as a filtering of metadata region attribute DNA sequence motif and so on allows like a binning pattern matching operation group operations and so on so if you're interested I invite you to have a look they even have a an R interface which can connect to the portal so you know if you prefer to to access the data using R well there's there's a full fledged API there's a paper that was published like I think a couple of months ago describing how to use it and then you can like bring the data much more easily without having to download the whole data set you just to reason that interests you for the subset of the sample that interests you and you can continue your analysis using R okay others I've talked have talked about it before but there's a full visualization there's the UCSC genome browser which basically is it well it it displays all kinds of tracks but like as a lot of the data sets for a genomic are expressed in the public data sets are expressed in the in formats that we call like big beds and big wigs we've talked a lot about beds so far so big beds it's just a binary indexed version of the bed of bed file so like if you open in a text editor you you won't be able to see it directly but it's made to be to be much more to be open much faster on for example resorts online resorts such as the UCSC genome browser so like if a server has a specific track in big bed format and I see it in the UCSC genome browser the browser doesn't need to download the whole track in order to be able to display it it will just look for the read for the region I'm looking at right now and just download just just the amount of information I need to see all these tracks and that makes it that it's possible using UCSC genome browser to visualize a whole lot of track at once otherwise it would take forever just to transfer the information to the browser so the just in case you're curious that's maybe a bit more technical but the way that data gets displayed in the UCSC genome browser is like groups who want to publish their data so like let's say you you you prepare a bed files and wiggle files and then you create big beds and big ways how how can you have them integrated in the UCSC genome browser to for like your paper publications or for to to share with the collaborators and those things and whether there's a way using like UCSC genome browser track hubs so it's the only income is like these are text documents like you need to create these documents basically and it tells the browser how to display them in the browser so like how how large from the track be what what what kind of a what kind of metadata I have around the sample and those things so I I'm giving a small example here like so one track and then you define style and you say you tell you the URL where the where the track is located and then you give it a label and these kind of things and by defining these documents you can just share your link with collaborators who can load it in the UCSC genome browser and other tools and start visualizing another data vision visualization tool is the acceptable browser which well it has its own set of unique feature the this one and the UCSC genome browser have a lot of features in common but this one is being developed by the the EBI in Europe and it also supports UCSC genome browser track hubs so like if you generate a track hub or if you get a track hub coming from the IHIC data portal from the encode portal and so on you can plug it in the ensemble browser and use it as well and UCSC genome browser track hubs are also supported by the WashU epigenome browser but this one is it's a really cool website too I think like a offering all kinds of features which are very specifically tailored for epigenomics data sets and recently it started supporting big bed files as well so it's getting gradually being able to display everything that's in UCSC genome browser track hubs on the WashU website the URL for this one is available at the bottom too okay so the last thing I want to talk about is I'll have a more detailed section after but the data analysis using Galaxy so Galaxy is this web-based framework which offers a user-friendly interface to do a lot of the bioinformatics analysis that traditionally you'll be doing from the common line so so far we've been using all kinds of common line tool typing stuff waiting for results and those things and it's so it and it works good and it's fine for for people who are like maybe a bit less keen on using the common line or like maybe less used to this kind of of interface Galaxy offers all kinds of of tool which like a clickable parameter that you can select and it makes things very easy to use so like that's why their motto is like data intensive biology for everyone how many people are familiar with Galaxy already yeah maybe a half the people so okay well hopefully you can still get something out of the last part of the presentation so yeah and another thing I like a lot is that it allows you it allows for reproducible results so all of the steps for your bioinformatics analysis that you're doing are are recorded in some kind of history so like when you when you're on the common line and you run this already prepared pipeline you know that your samples will go through all of the steps using the same parameters and those things but often when you're working on your own on like you said or two or two three or maybe a bunch of sample and you're doing things on the side you're testing this and then you're tweaking parameters and you're testing something else in the end you get you get results which are interesting and and then you try to to figure out what you did for each of the steps to arrive at this result and well that's kind of problematic because like you don't have a often you don't have a trace of everything unless you're very meticulous and you wrote down for each step this file generated using these parameters and those things but if you didn't do that that's sometimes you come to the end and you just you're trying to reproduce your results and it takes a while to figure out what you did to reach that point and like that's one good thing about Galaxy is that it remembers everything that you did so once you get your result you can you can just go in the history and see what you did or you can even extract a workflow out of it and then you can reapply the same workflow for the amount of samples that you want so I think it's a very honourable advantage. Okay yeah I'll just talk to you quickly about it about GenApp. So GenApp is this Canadian computing platform for life science researchers so it basically it leverages compute Canada HBCs and the Canary network to offer like all kind of tools which are tailored for bioinformatics so if you have access to the compute Canada network you can start using GenApp. Users can create their own private fully configurable galaxies they can decide who they share this galaxy with and they can invite collaborators whether they're Canadian or international and all these people can start using using GenApp and this includes as I said Galaxy so like it makes it that you can run jobs using Galaxy and using your compute Canada allocation rather than just waiting on a waiting list with the public galaxy that's available online as well so it's free for Canadian academia and all you need is a compute Canada account. There's also the GenApp pipelines which offer you a set of pre-constructed bioinformatics analysis pipelines there's one for Arnicig, Arnicig De Novo, there's one for Chipsig and there's in a couple of weeks we should have the methylation pipeline available as well so all the requirements for these pipelines are already installed through CDMFS at compute Canada HBCs so like if you're if you had access to compute Canada you can you're good to go and you can start using these pipelines right away so there's a there's a bit bucket side for it so like even if you don't have access to compute Canada resources you want to run this on Amazon or in other types of resources you can still download the Moogic pipelines source code and the only requirements that you'll have to install the different tools and reference libraries on the resources that on the server that you have but in the end the result will be the same yeah I was as I was saying there's a GenApp Galaxy which is using your compute Canada location to launch jobs so there's also I forgot to mention but like there's also a main galaxy websites called usegalaxy.org so that that's completely open to anyone so anyone can create an account and yet I forget how many gigabytes of space but like you can upload your tracks and launch analysis of of course this means like there's a lot of people using it so it's a waiting lines are sometimes a bit too long and you're limited in the amount of data you can upload but it's free and it's it works so actually so usegalaxy.org is the main one you can search online I just want to get you can search online there's a I forgot a website but it's a website listing all of the free galaxies which are open and that you can start using so I think there's one in Florida and there's a whole bunch of others so like if you if you're not lucky with usegalaxy.org it's too slow at a point you can always try another one or something that yes you have two choices either you do it through the Galaxy interface or you can upload it to your compute Canada there's a specific folder for that you can just copy it like the rsync or something or scp in one given folder and then there's a there's a tool that was specifically implemented in gen app galaxy to take data from that directory and bring it into galaxy yeah exactly exactly you can save a lot of time if you have a lot of samples to process so I'll just go quickly for if you're interested in interested in getting started with gen app galaxy so first thing is you have yes yes so that that's just what I was about to say if you if you are already member of compute Canada you can use your login and password from compute Canada to log into the portal so you don't you don't need to create a new account so like if you're interested and you don't have yet a compute Canada account the first thing I was going to do is to apply to compute Canada open open an account the next thing that that will happen is once you get your compute Canada account you need to apply for a local a consortium that's what I was presenting about yesterday morning you know there's can't get back there's West Grid and there's like six of them so you have to pick one and usually you pick the one that's the closest to your location okay but specifically in this case if you want to use gen app galaxy right now you have to apply to get your kid back because like we we have everything at located there gradually we're trying to roll it to other computing other sites but right now if you want to use job galaxy you should apply to a kitchen to make and then once you get access to that you will have space and the allocation to start working with the with with galaxy and other services so that's it that's the gen app portal you enter your compute Canada login and password once you're in you'll get well a couple of options but basically at what you'll be interested in is to access your your your galaxy instance so gen app has this concept of project that you need to create that's the place where you're going to dump your data and so yeah the project gets there's a default project that gets automatically assigned to you so you don't even need to do that all you need to do is to instantiate for this project a galaxy instance and you create it you wait about like I think 10 to maybe 30 minutes and then the project gets the galaxy gets good to go so more details on the on galaxy is there other questions so far so to go back more specifically on galaxy so I was saying a lot of the bioinformatics tools also a lot of the but the tool that we've covered in this workshop so far are available within galaxy so like when you open galaxy you have this this whole list of categories so these are just the categories of sample but like each category has a lot of tools as well and then there's all kind of stuff there from using a fast QC and like gypsy peak colors and this kind of thing up to like more simple tasks such as just unzipping a file converting a big bed file to a bed file and so on there's all kinds of operation you can do from the interface so like all compute jobs are launched from a web interface and like input interface looks a bit like this all the tools in Galaxy are like it's tools that are actually executed from the common lines just that galaxy takes care of submitting that job for you so like then all of the pyramid normally all of the parameters for the tool are available in these web interface so like there's very simple ones and there's others with a lot of parameters that's so that you can customize what you want the interface looks like this I think yeah I think I'll have enough time to to do the the live demo but it's just to show you there's a there's a toolbar here with this thing like all of the tools with it and then you have the search features to quickly identify the tools the tool that you might be likely to want to use in this case we in this screenshot it's displaying the output of a fast QC run that's been applied on a fast Q so like instead of having to run fast QC on a server and then download the HTML locally to visualize it well you can just get it right away by by clicking on the on on your data set the fast QC output and on the side here you had your history bar so that's that's what I was talking about you know like each step that you're doing gets gets recorded into Galaxy but like each tool that I'm running will create one where's my yeah we'll create one or more of these of these history rows if I may say and then you can visualize results of a specific job that you launch clicking the I icon and there's also this this nice feature that like you can design pipelines or workflow if you want of steps where like the first step will just be to select the data sets that that interests you and then all of the other steps will be done sequentially as you go without having to provide anything because you already said like at this step use the output from the previous step with these specific parameters so like you get from inputting the data up to at the end of your workflow the output data so again there's a friendly guy a GUI for that where you can you can modify things by hand in your workflow if you want and maybe one last detail is that like used to be that Galaxy is not really tailored for like a medium amount of sample if I may say because like every time you wanted to launch a job you need to select one sample and then choose your tool select one sample execute and choose the tool select another sample and execute but with the newer interface galaxy you can select you can do the exact same job on a on a large amount of samples just by selecting it in the interface so that uses things a bit so yeah conclusion so in this unit we've covered sometimes of time downstream analysis with economic data how to obtain publicly accessible data sets for your analysis and how to assess the quality of the data that you downloaded although that was also covered by other modules before and how to visualize epigenomic data sets using online tools and some ways to run some type of analysis with web interface although it's not completely over we'll do the the promised live demo after this and and after that the lab will provide you an extra introductions to some of the tool that we've talked about so we'll navigate yeah I had portal to download some data sets we will execute Homer and and great to try to identify some interesting results and we'll have a there will be a galaxy part also although I realize this lab is just one hour long so I pretty much expect that we won't have time to do everything but the galaxy part is I think is well detailed so if you if we don't have time to reach that in the lab you can you can download you can do it on your side after the workshop if you want so well I'm wrote on the board but I'll try to make sure that the resources for for this workshop so that the server and and the galaxy server gets available for for a bit of time after the workshop of course I it's difficult to say that it will keep it indefinitely so I at least until Wednesday and next week things should be fine so I would encourage you to go to your account and download your what you produced of the workshop locally if that's what you want to do and if we don't have time to reach the galaxy part after the workshop maybe on your side if you're interested like this weekend or Monday Tuesday you can you can connect to get to the galaxy that's at the URL that's provided in the lab and you can try that part on your side and okay so yeah I provide for for people who can't have access to compute Canada resources there's a the main galaxy which is available here use galaxy.org which again includes most of the tools covered in this workshop in the last two days and yeah and if you're in Canada you know academia I highly recommend that you get a compute Canada in a gen up account