 My name is David. I'm a developer for Compute Canada. And also, I'm born from a teacher for Compute Canada. And one of the things that I develop in Compute Canada is GALAX that's reported to Compute Canada very recently. So today, we're going to just learn a little bit of the GALAX interface and why someone should be motivated to use GALAX. So the basic idea behind GALAX... So basically, this is the outline of this talk. So we're going to talk about reproducible science. I'm going to give some examples of workflow we use in GALAX and also different types of server, GALAX server. GenApp, which is a Canadian project that includes GALAX. And I will talk very extensively about that. Putting data and analyzing data inside of GALAX and then processing new data. So the base motivation of having and using GALAX is that if you have to do something once, you probably don't need any sort of scripting and sort of programming. You just go there and do your analysis like you did this morning. But if you have to do that several times, then you better write your script. You better automate that task. The problem with scripting out is that the script normally, and you get very confused when you do a lot, is that the script doesn't have a version. A lot of the script is when people write these little things to do an alignment or mapping. People tend to do their own little things in their own computer and don't care about sharing or care about what's going to happen when someone sees that little program down the line. So very often and very soon, as script is going to become something else that was different, what was intended first, and you never know what's the version who wrote that script and what that script was intended. So GALAX is trying to avoid to prevent these things from happening and also prevents... It tries to enforce reproducibility in your data sets, in your experiments. So GALAX also is based on an open source idea, so you have the whole code of GALAX if you're ever trying to style GALAX yourselves. It has a very large community of developers. It's very well supported. It's flexible, it's expandable, it's scalable, and also it's cloud-aware, so you can put GALAX inside of a cloud, you can put GALAX inside of your computer, so you can move GALAX. It's user-friendly, it's kind of a different... User-friendly is very conceptual, so some people may find user-friendly, some people find command line more user-friendly, but GALAX is intended to make life easier for people that do not program. So the base idea behind GALAX actually came from this paper that's about 10 rules of reproducible science, and two of those authors of this paper actually is the author of GALAX, it's Anton Nicotenko and James Taylor, and those basic rules are that every result, for any research that you do, you should keep track of every result that was produced, and you should also avoid the manual manipulation of your data, so if you produce something and then you have to open your data and then do a little modification, that's going to be very hard for someone to reproduce that because you may not have the ability to put that in a manuscript or to document that. You have archived and extract all the versions of every software that you use, so that is very often a big problem, so I'm pretty sure majority of you here don't know the version of the software that you use, like the BWA that you use, or the Picard or whatever, you probably don't have the notion of the version. So down the line in two years, you're trying to reproduce that, you're going to try to use the most up-to-date version, and the most up-to-date version may not give the same result as the version that you're using right now. You should also version control all custom scripts, so if you have something that is very custom, that just belongs to you, you also have to be version control, and you should record all intermediate results that you have, so you cannot just record your FASTQ and then just show your WIG file at the end of your analysis and say, well, that's my result. People need to know what happened in between if they want to ever reproduce your result. You have also included any random seed that you use for including randomness inside of your research. Of course, I will have to store your raw data behind plots. This is very important. You have to generate hierarchical output, so your output should be incremental in terms of complexity, so that's why you would try to store, you do a SAM, then you sort your SAM, then you mark the application, et cetera. It has to be incremental in terms of complexity, so you can always revisit an intermediate state of your result. You should also, of course, put textual statements behind your results, so like Guillaume showed, that's a big plot of translocation. It means nothing, it's just a beautiful picture if you don't have textual statements. And you should provide public access to your scripts and to everything that you run and your result. So Galax was actually conceived based on those 10 rules. So this is the main Galax interface. So in our left side, we have a panel where we have these little menus where when you click there, it expands and it gives you a set of tools. So what Galax actually does, Galax creates a visual interface for any bioinformatics tools, any actually sort of tools. So any tools that you have used today can be put inside of Galax with a graphical interface, so you don't ever have to go to common line. So that's actually the original paper of Galax in 2010, although Galax is older than that. Also, that's an interface of Galax and Amazon Cloud, so you can go to Amazon and deploy your own Galax. And it was written basically by Ennis Afghan. So what's the reason why you should use Galax? So Galax integrates the input and the data source. So from one place, you can fetch your data. You don't ever have to leave Galax interface to fetch data. So Galax has several ways of fetching data, or upload data from your computer. Galax allow users to use many tools that they don't need to install. So the main Galax has more than 400 tools right now installed. And they are maintained and they are updated very frequently. So this takes actually a big burden from the user. And Galax allows you to run workflows. And that is a very, very interesting concept in Galax that if you go through a set of steps, you can record that steps in a workflow. And next time you're going to try to use, you're going to just tell Galax, those are my input files who run the same set of steps as I have run before. Galax has entered the next gen fully. So basically all the tools that you have used today are already ported to Galax. And Galax works on the cloud. So if you don't have access to a local Galax, or for some reason that I'm going to show you don't want to use the main Galax, you can actually have the options to go use Galax on the cloud. So Galax believe, like I said, in reproducibility. So you have the concept of history in Galax where every action that you are having inside, that you're doing inside of Galax is recorded. So not only the program that you run, but the versions of the programs, the versions of the genome that you upload, the size of your data sets, and the provenance of that data set. Everything is recorded. So Galax also has the concept of sharing. So once you do your analysis, you can select an individual from the Galax and share that data with that person. Or you can publish your data. You can make your data and the results publicly through the Galax interface. Galax has, well, it was made for biologists and it was made thinking about people that are not comfortable with programming. And it has actually a very, very large developer community. So I'm part of this community, so we often develop new tools to Galax and develop new ways of import, facilitating the data analysis. But Galax comes in very different flavors. There is no one Galax. Galax is an external platform that people can install and they do in several places. So those are the main resources that you have in Galax. First, you have the project.org, the Galax project.org, which is the main webpage where it explains everything about the project itself and how to get Galax, how to use Galax, mainly, so that's the starting point if you want to learn about Galax. The second link is the useGalax, which is the link of the main Galax server. And this Galax server is a public server, so it's open to the world. So anyone can go there, create an account, upload the data and do the analysis. It's free. You have to get Galax for people that want to try to install Galax. I do not recommend doing that. It's actually quite hard to install Galax if you just want to use, so prefer like a public data server. And you have the Galax on the cloud, so the name already says. We have a project called Gena. It's a project that was developed here in Canada by me and a group at McGee when it broke. And you have several other public Galax around the world that are more specialized. So some people, they package Galax just to use Cheap6, some people to earn a Cheap6, there is actually even people that have a Galax for astrophysics, because Galax is basically a wrapper around the command line. So which Galax will you use? That depends a lot on your requirement of processing power and security needs. So basically if you have a moderate size data set, then you can use the main Galax. You can use a local Galax, a Galax that you're trying to install your own machine. You can use the cloud or you can use the Gena. If your requirement, computational requirements are moderate, you still can use all of them. If you want to share your data, then the main Galax is still a good one. Local is very limited because I mean, people has to have access to your computer for sharing. So the cloud, yes, but with some restrictions because people has to know how to use your Galax cloud. It might not be open to the world. And you can use Gena for that. If you need all the tools that aren't the main Galax, so there's several tools, then of course the main is the way to go. Local probably not because installing all those 400 tools would be very complicated. The cloud is also a good option, although you're going to have to know a little bit of Linux to be able to have all the tools inside of the cloud. And Gena, please, are right there for you. If you need absolute security, that's the big problem here. So if you have data that are clinical data, you should never use the main Galax. And there's several reasons for that. One of them is that Galax, the main server of Galax, it uses a protocol to transfer data that's called FTP, and anything that goes through FTP goes unencrypted. So it's basically your text goes open. So if you have a hacker attack, we call it like a man in the middle, it's going to get everything that you have unencrypted and it's going to have all your data. So be careful what you're going to put inside of the main Galax. The local, yes, is inside of your computer as long as your computer is safe, you should be safe. The cloud, I put a yes with exclamation mark because you also need to know how to secure your cloud. So if you're comfortable with that, manage your certificates, manage security groups, then you'll be fine. If you're not that comfortable with this aspect of cloud, be careful how you put your data there. And GenApp is actually a safe bet because I modified the Galax on GenApp to use another protocol that you call SCP, and then all your data goes unencrypted. So even if they're intercepted, you would not have your data compromised. In terms of price, the main Galax is free. The local Galax is free, it's yours. The cloud is not free, so you're going to have a cost associated with firing up that image and also run your computation. So be very careful with that because when you look at the price of one hour Amazon instance, it looks very, very, it just comes with nothing. It's like 68 cents or a good size instance. But over time that actually adds up and it's pretty costly. And GenApp is free, so GenApp is a good option if you need security, scalability, and one price. So the challenge of having so many Galaxes, there are so many options of Galaxes. The problems with Galaxes are not created all equal. So it means that, like I said before, there are some people that have Galaxes that are very niche Galaxes, so you cannot just go because there's name Galaxes that are going to do what they want it to do. And the Galaxes team are trying to move to kind of an empty shell of installation of Galaxes where you install just the Galax engine and you choose which tools you're going to put inside. So that's why, like, not every Galax is made equal because some institutes might just do one type of analysis and might prefer just install the tools that they are interested in and not want to maintain all the type of tools. So adding tools to Galax also may cause problems, so they try to avoid putting everything inside of Galax. And the Galax now have this concept, well, not now but for a while, it has this concept of toolshed which is a repository of tools where they do the development of the tool and they put this repository and the admin of the Galax can just fetch the tool from there and it gets installed locally. Just to show how the project Galax looks like, so when you go to that website, you're going to end up... I'm going to hit the main Galax website where you have the options to use Galax. You're going to be redirected to the main server. You're going to have the Galax so if you want to try to install the Galax locally or if you want to use in the cloud so there's a tutorial on how to use that on Amazon. There are also some screencasts. Actually, there's a lot of screencasts about Galax and how to use it and a lot of pipelines. How to do Irony-Sig, CheapSig, Snip-Calling, you name it, there will be a tutorial there. And they get involved where we have several types of mailing lists for users or developers. So the main Galax, like I showed before, that's what it looks like. We have the Cloudman, which is the interface that Amazon has to deploy your Galax on the cloud. And you have a week page where it lists actually all the public Galax that we have in the world. So people that have a public Galax, they put their Galax there. But like I said, not all of them have all the tools so some of them may not have the tool that you want. Here's just an example that actually a user would not see that. But as an administrator, he would see that. So the ToolShad has this list of analysis such as assembly or CheapSig or Snip-Sig. And if you do a search, for example, for tools related to some files, you will have a set of tools and then descriptions of the tools and then a little bit more about who developed the tool, etc. And from there, a administrator can actually install the ToolShad of Galax. This is the Galax on the cloud. So as you can see on the top of the URL is an Amazon address. So it's kind of an Amazon DNS that maps to an IP address. But it's basically the same interface across projects. Any questions? So I'm going to talk just a little bit about this project called GenApp. It's a Canadian project. It's developed by a team here and the team of Guillain-Bourg, by the team of Brian Caron in the HPC from McGee and by a team of Chabrouk University. And we developed this platform to allow to facilitate the use of genetic and genomic data and tools for life research, life science research. The GenApp actually is on top of a Canary network, which is actually a high-speed network that we have in Canada and also supported by Compute Canada, which is an organization that oversees all the supercomputers in Canada. So GenApp actually sits on top of a very fast network and a network of supercomputers across the country. So the GenApp hosts tools available for Canadians, Canadian researchers and international collaborators through our web portal. And through the GenApp, you can instantiate your own galaxy so you can actually create a galaxy that only belongs to you, so that galaxy is not shared with anyone, but you and people that you choose to. So when you go to GenApp, that's what you're going to see at first. To make use of GenApp, you first need to create a Compute Canada account. So there's that link there. And once you register in Compute Canada, you automatically register to GenApp. So you don't actually need to come back and recreate an account in GenApp until things are done together. And once you create a Compute Canada account, you can log into GenApp and they're going to see an interface similar to that, where you can create different types of applications that any start using. So you can create your galaxy. There are data hubs. There are track hubs for a UCSC browser. You have a private UCSC browser inside of GenApp. So like I said, you have to first log into Compute Canada to create an account in Compute Canada and then log into GenApp and then you can create your galaxy and then you can control who you're going to use that galaxy with you, who you want to share that galaxy with. Recently we have those galaxies can be instantiated into supercomputers. One that's the universe of Sherbrooke. And the one at McGill. So the way you would create your galaxy is super simple. You go to that page and there's the button here that say create an application. You would choose galaxy and you tell which cluster you want to put that galaxy. And then you say create and voila, you have a galaxy for install that's private to you. And that's how hard it's going to get. As you can see the interface is pretty much the same as the main galaxy. We try to make galaxy a mirror of the main GenApp galaxy, the mirror of the main galaxy. So everything that you see on the main galaxy, you're going to find in GenApp some tools that I developed just for GenApp. Here's a little set of tools that we may find inside of galaxy. So you have ways to get your data. You have ways to visualize your data. Ways to some tools. Tools for peak calling, etc. etc. VCF tools. Virtually everything that you did in this workshop in these two days could have been done through that interface. So basically what you do in galaxy you start by logging to galaxy and then you get your data either you upload from your computer or you get you fetch your data from external source you manipulate your data you save your outputs in the workflow if that's some sort of analysis that you want to do more than one time and you have the option to publish that through its interface. So that's pretty much that's what you should do in the main galaxy so you would go to the login of the resistor whether or not you are returning a user and it's a very simple process in terms of creation a galaxy and a password and a public name Galaxy also allow you to use a public open ID so if you have an account on Google you just tell galaxy to authenticate you through that Google account. Actually it has Google Yahoo and Ao who is Ao nowadays but anyway so that's the interface to get your data so if you go to the get data and expand that little menu there it will appear this little box that's something that you're going to do in the practice and you have basically three options for that you can choose a local file that's a file that's in your computer choose a file that's in your FTP so it's something that you put in an FTP server and a fetch for you or you can give galaxy a web address so galaxy goes and fetch the data for you so you don't have to ever to download anything so in that example I'm downloading something from I'm downloading a fastsanger file it's just that I paste that URL there and then I just tell galaxy to start so galaxy also have other ways of fetching data one of them is fetching the data from the UCC browser there are actually two platforms that are very integrated so I won't explain for you what UCC browser because you're sure that you all know but from the genome tables you have a little button here that says sends your output to galaxy so you can actually I'm here I'm downloading a human genome I'm downloading chromosome Y and I'm going to in a bad format and I'm sending that output to galaxy so from the UCC browser interface I can import a data set into my galaxy history so galaxy actually like I said has hundreds of tools and it's kind of sometimes maybe complicated to navigate through its interface so one way of navigating to interface of galaxy is doing a search by keyword so you're going to type the keyword that you want the name of your tool or at least it has to be at least three letters keyword so galaxy will search through its menu and you're going to show the potential matches for what you're looking for and that's what you're going to I hope you do today because there's a lot of tools you have to use instead of going through the menu you just type the name so so like I said you can so since it can expand those little menus so you can actually look for a tool so for example here I'm looking for a tool that has some on it and I'm showing actually three results one what would look like in the cloud what it looks like in the main galaxy and what it looks like in gen app just by looking at some they are slightly different because of the way the search is implemented in the main galaxy they are more permissive so for example the first one there is nothing to do with some file but what is my mouse but because there is some in sample it's going to appear there while in gen app I don't allow that to happen so I want everything that has some on it to be there but it's basically the same the same interface so once you choose the tool that you want to run you're going to be presented to to its graphical interface where the first the first line here will be your data set which data set you're trying to analyze and the other lines are pretty much the parameters that you're going to give to the tool so we normally try to write tools that have like a help here so it tells what it does and what actually every single parameter means we try not to be too extensive but as much as we can so once you import the tool inside of Galaxy it's going to appear in your right side actually as a green box it's going to appear something like that and you have like these little icons here that you can make some action with that little icon so if you poke the eye it's going to show you the data set if you use the pencil you can set up the metadata of your data so you can tell which build your data is associated you can tell which type of file is there etc etc and the X is to delete the file so for example if you poke the eye of a fast file you would see if you use the pencil then you could set all the metadata so you could say well that data comes from g19 you can put information about the provenance of that data you can rename your data so that is in agreement with those reproducibility steps that we talked before which someone trying to do the same analysis will have all the information about your analysis just by looking at your history so as an example here I'm just showing how to do like a quality control for example if I want to do a quality control on those data sets here on those four data sets here so I would choose a tool called FastQC and then I would put the tool I would tell put the parameters and that's what Galax will show me so you are familiar with that we saw that yesterday it's the same type of result that you obtained using command line yesterday with Mathieu and after that you can go about trimming for quality of the data sets and then reanalyzing but there is no command line associated with that behind the scene under the hood Galax is doing all the command lines for you everything happens if it is a command line you as a user just don't see it happen so Galax has the concept of a history so basically everything every step that you are going to take is recorded so you can actually take that history put inside of a file and give to someone else they are going to be able to upload that history inside of Galax and see everything and reproduce your result from the history menu you have a little icon kind of an engine icon that have history actions basically some of the most important are extracting a workflow so you can say to Galax from these steps and construct some sort of workflow so I don't have to repeat that ever again and you can share that history with someone, with one user or with you can make that public to anyone with the web address and you can do things like delete the history rename the history export and import to a file so here is just an example of a publication so I click on my history and I am going to share with a user or I am going to publish and make that available to anyone and if you are trying to extract a workflow then what Galax is going to present is that you are going to just have to give the workflow a name you give a workflow a name and then you are going to just create the workflow and what the workflow looks like that is there so it is basically all the steps that you have you have done so next time you are trying to rerun that analysis you just need to inform Galax what are your input file and Galax will do the rest for you so Galax is aware of the steps and is aware of the dependence of the steps and if you need to to trim and then align for Galax we know that we are going to hold the job that is going to align until the trimming part is done and then we are going to start the job for the alignment so there is a lot of tutorials inside of Galax so you can go to videos and then Galax has a page on Vimeo Vimeo where it has several screencasts of how to do Galax analysis of several types of data he is an example of pick calling so another thing is our user for resources so you have the useGalax which and the Galax on the cloud so it is a good place to start Galax has a user support that is very very active it is called BioStar and that is actually where everybody has doubts of Galax and you can search you can just type your problem and people and normally like you can search problem the same type of problem that might have an answer already if it is not they are very active in answering you and finally we have GenApp so the GenApp web page is GenApp.ca we have a week as well for how to use GenApp and you have a support so if you ever have a problem I strongly encourage you to have an account in Compute Canada whoever here don't have Compute Canada accounts it is free we provide like an amazing amount of Compute Power for you and you have resources such as GenApp and Galax and actually pipelines pipelines in command line and pipelines that are like Galax that are available for you and we also have a a team of bioinformaticians that actually can support you so if you have problems with your analysis or problems with your code is too slow we won't help to parallelize your code we want to help to run your code in supercomputers we have a big team of people that can help you I strongly encourage you to have an account in Compute Canada and to try out GenApp so questions locally you can still ask questions about if something goes wrong locally yeah I may not be able to tell all the time what's going wrong because I don't know which libraries or programs are installed in your computer but yeah I'll prevent you from asking some questions yes yes yes you just have to Compute Canada and there's a little form that you can write and then yes there are yes there are actually several tools for I'm not sure if it's going to appear here but there are actually several tools in GenApp for bacterial genome there's actually several tools for that and if there's something that is not there and you need then you can contact us we can try to help you with that plants actually inside of galaxy behind galaxy we have basically three terabytes of data of different types of genomes we have right now more than 300 genomes that carry out the genomes in galaxy and bacteria we have almost everything that they have in the CBI reference genomes installed everything like I said just about 300 genomes so I don't have index for PWA G8K and sort of aligners and have everything so we have like basically six terabytes of data that are reference genomes inside of GenApp inside of galaxy not your tool you can load your own data not reference you can load your reference yes actually today in the practice you're going to actually create your own you're going to upload your own reference you're not going to use the reference that's there but yeah so if you have a genome and use that genome as a reference there's no problem do you use tool to add our server after the siloing so is it any possible? yes galaxy so that's another thing about galaxy galaxy never deletes anything so even if you delete something from the your history galaxy keeps that that data somewhere you just not display to you so yeah so how can we get... out your locally for you but you're going to see like when you delete some data there is a button in galaxy now it says un-delete my data and you can actually just recall you cannot look anymore so so cyber intrusions are after that so we just lost it and then after the internet come back but still cannot your galaxy didn't come back? who's maintaining your galaxy? who's maintaining your galaxy? um we have a lot about your galaxy I mean if they want to write to us I'll be okay helping them to bring it back can you extract the code behind galaxy let's say you work up your workflow you really like it and then you want to have the command lines behind it can you do that? no unfortunately no you can individually tell every single job like what's the command line you can see that but then to extract that you can actually copy and paste some sort of document but it is possible to know everything that you ran like with all the commands etc etc there are several signs on which side actually so you should go to genapp.ca you should have an accounting computer first so then you pick one side yes so right now we have galaxy in two sides one is in serbruc and the other one is in ethmedio and then you can start a galaxy wow so you have to have an account on this side first so because you want to put the inside of that computer I mean you can do that through the web interface so you don't have to actually do the command line you just ask for an account and then from galaxy you put your data there there are more advanced ways of putting the inside of galaxy that includes command line as well but the only thing that you need to have is an accounting account and an account on serbruc or on ethmedio and then you can use your galaxy right away no program required yes so the process of applying to compute canada is a bit convoluted somehow because you have to apply you have to ask for a compute canada account and then you have to go and say I need an account in a specific cluster so you have to say I have an account in compute canada and I need resources at serbruc or ethmedio and then you have it they are going to give you a login and password for that computer and that's what you are going to use to login to gena thank you very much