 Okay, so the Creative Commons license in front of all our presentations came by from my workshop. I add this one, which is basically encouraging you to share and copy and reuse. And in the context of a lecture, what remixing allows is basically you can go to the slide deck. So actually, I haven't linked to it yet, but I'm going to also make my PowerPoint files available. So that means you can go into my slide deck and take the one slide you want with the caveat that you need to share your slides as well. So this afternoon, sort of to ease off into the end of this workshop, we're going to do Galaxy. So how many of you have used Galaxy before? How many of you have never used Galaxy? I mean, if you have never used Galaxy, I've never heard of Galaxy. Okay, that's... How many of you have done it before? Oh, yeah. Oh, there you go. You did better. Well, as I've been as well. So the Galaxy developers, if you have any suggestions for my slides, let me know. This is not for Galaxy developers. So this is who I am. This is the tweet handles for... I should have shown that at the beginning of the workshop. And the use Galaxy is the major Twitter handle for whenever you tweet about Galaxy. So the disclaimer is I'm not going to make any profits from anything, companies or products I may talk about. I am on the Galaxy Scientific Advisory Board, but I don't get any money from that. So I just work for them for free. So we're going to talk about workflows. We're going to talk about basically reproducibility in science and how Galaxy can be used to do that. And we're going to use it in the context of next-gen sequence analysis. But Galaxy itself as a tool predates next-gen sequencing. And so there's a lot... Part of the Galaxy toolkit is actually used to do a lot of other things that predates next-gen sequencing. Really, if you're interested in Galaxy, and if you've already used it, you've probably used it outside the context of next-gen sequencing. So what do biologists do while we make observations, make hypotheses, and challenge them and include things in the right papers? And we now more and more are doing this in the context of RNA-seq, or protein mass-spec, or interaction pathways, and so forth. So there's a lot of information space that we use. The central dogma, as you know, is DNA makes DNA makes RNA makes proteins. This is what I like to call the NCBI version of the central dogma, is DNA makes RNA makes proteins, and then you write a paper about it. And in that, unfortunately, is the challenge with the reproducibility of science is that a lot of the information about paper, process, and experiment is actually highlighted, described, and the publication. And so if you're doing sort of bioinformatics archaeology, so you're trying to reproduce a pipeline or method that was used in a paper, you have to go read the paper in detail and try to extract the versions of the tools they use. If they talk about the versions, the tools that they use, if they actually talk about the tool, and so forth. So there's a lot of hidden information and not always available, unfortunately, and making the reproducibility in science very difficult. Some of the things that we do in the cells is to do experiments, as I mentioned, and so we do bioinformatics experiments. So I think of bioinformatics experiments as the same way as I would a sort of wet lab experiment. It's got reagents, it's got, you do controls, you do interpretation, and so forth. And so a classic bioinformatics experiment would be using BLAST. So how many of you have never heard of BLAST? A few. Okay, good. So basically in BLAST, you have reagents, and so your sequence and your databases, you have a method. So you're doing either a protein-protein search or a nucleotide protein search or a translation of the nucleotide and so forth. So those are the various types of BLAST you can do. And then you have an alignment from which you sort of get interpretation. You look at similarities, and you're testing, you're doing a hypothesis testing. So you're testing whether or not there's a similar protein or a similar gene in your organism versus organism X. Or if there's a gene that you've sequenced, you're trying to figure out what function it may have but hit having similarity to another gene in another organism for which it's been better studied. For example, study in mouse for which there's knockouts and phenotype versus your human disease gene. And so you have to know your reagents. And when you do an experiment, you have to know your methods and you have to do your controls. So what kind of control could one do in BLAST? Not all at once now. So if you want to do a control, what do you want to do a search? What's a conceptually? This is not a conceptually. What's an idea of a control? So yeah, so for example, if you have a CDNA, you know, you have the full genome of the organism that you've got the CDNA from, you do a CDNA and nucleotide against that genome, that single genome, you should be able to find it, right? So if you don't find it, there's something wrong with BLAST and there's something wrong with the parameters you're using. The same way, if you're using a protein database and you have your favorite protein and you're searching this, you know the proteins in the database, but if you use a certain set of parameters, then you can't find your protein. And you know you're using the wrong parameter. So that's sort of a positive and a negative control that you can think about. So the concept about doing and redoing experiments is that usually if you do something once, you don't usually script it up, but if you do it a hundred times or a thousand times, you definitely will want to. If you want to share the way you did something, it's usually better to, if you're telling somebody I use this tool and that tool, is actually share the script that you did the analysis with or the method. And sometimes, scripts get too complicated and you have to think about something else. So some of the requirements that are sort of standard in computational biology, not, oops, sorry, okay. So standard in computational biology, not 100% the same always, but definitely quite common is that the sources that we work, the tools that we work with are open source. We sort of make it a, it's been a standard in the bioinformatics.ca workshop series that all the tools that we teach are open source tools. I would say that the bioinformatics community, as opposed to the chemistry or as opposed to physics and whatnot, maybe not as much as physics, maybe less than chemists, has been fully embracing of the open source community in general. And so if you go to any sort of large scale sort of genome informatics or biology of genome conference, the tools that are talked about and the tools that are used in the analysis that are talked about are always open source tools. 99% of the time. The advantage, of course, of the open source community is that you have many more eyeballs on the data, on the source code and on the analysis and what the code is actually doing. It's not a black box phenomena, which is sort of a concern and allows people to share and make things provide different versions and so forth. A good example of a sort of a community that uses a really sort of large set of open source tools that has been talked about already at this workshop is R and Bioconductor. And R is a little hard to tag on a tweet on Twitter, so the usual tag for R is RSTAT, so in statistics. And Robert Gentleman wrote this great paper a few years ago about reproducibility of research and using bioinformatics as a case study. And Robert sort of argues for the inclusion of all, in all your papers, of all the scripts that are used to generate all the figures that are presented in the paper so that anybody can take your script that you wrote to generate that figure, they can use that figure, they can put in their numbers or their data and see what figure they generate or they can modify your script and make it better and so forth. So there's really sort of a strong argument for the R community and the use of Bioconductor, which is also itself got more and more modules over the years to do a next-gen analysis. Another sort of more recent paper that's been quite good is from the James and Anton and other colleagues in the sort of Galaxy community from a plus paper on ten simple rules for reproducibility in computational research. And this is sort of a quick summary of these rules for every result, keep track of how it was produced, manual data manipulation steps, i.e. going into a file and editing things that you don't like, archive the exact versions of all external programs used, version control all custom scripts, record and intermediate results when possible in a standard format for analysis that include randomness, note underlying, random seeds, so BWA is an example of that, always store raw data behind plots when that's possible. What's an example of where we don't keep raw data anymore? Wakey-wakey. No, in life sciences. Sort of related to this course a little bit. Yes? The image files from sequencers. We don't keep those anymore. And so that's, it's really, we are moving away from keeping all raw data. It's just, we don't do it. But it's a, in general, if you can and want to make things reproducible, you want to keep as feasible as possible. So there's feasibilities that comes into a factor, which is basically cost and storage and things like that, which are becoming, in some cases, becomes very difficult. Generally, hierarchical analysis output allowing layers of increasing details and to be inspected and connect textual statement in the underlying results and rule number 10, provide public access to scripts, runs and results. So there are different ways of doing that. One way that we use actually at the OICR is a tool called Seqware, which is a very command line rich and very powerful tool to run multiple pipelines. And so all the pipelines we do in next-gen sequence analysis at OICR in the sequence production group are all run through Seqware. Seqware in a nutshell will write down and writes all arguments in metadata files and so forth to a metadata base. So it keeps track of the tools, the versions of the tools, the arguments used on the tool and where the file came from and what the output file was written to and so forth. And so that you can go back and keep track of, have very detailed experiments. And also you can program it in a way that it knows when sequencing runs are finished, and then it can start the whole chain of events that it needs to do to run your pipelines. And so it's a very powerful tool. It's apart from one or two people in this class, it's sort of beyond the people in this class, but I just wanted to sort of make you aware of its existence. Another tool where you can do pipelines and keep track of which tool versions and which parameters and so forth to use is Galaxy and that's the one we're going to be using in this class. So there's lots of papers written about Galaxy. They're a Galaxy is an NIH funded project and so all, all, yeah, I would say all of its publications are open access publications. So this is a bit older paper but it's very relevant. This one is much more detailed on using Galaxy in the cloud and it's also very relevant to this class. The first thing you sort of become aware when you're starting to use Galaxy is the versions of Galaxy that are available. So if you want to see all the various types of homepage it's the galaxyprojects.org page which has information about all the things I'm going to talk about today. The main public Galaxy server is called usegalaxy.org and actually that's an entry to another server and the server has actually recently moved but the URL is still the same, usegalaxy.org and that's the main Galaxy public server so you can go there for free and use all the sort of the richness of Galaxy and we're going to be doing some of that later today. Another one is getgalaxy.org so that's the source code for Galaxy so that if you want to download Galaxy and install it locally on your own machine in your department or your institution that's where you would get the source code. If you're in a university that has good computer infrastructure and good systems people and so forth I would recommend this avenue where you can customize and make sure that you have the versions of the tools and so forth that you want locally. So at OICR we have a version of Galaxy so that we run internally behind our firewall that's all secure and with our data access and so forth so there's all sorts of, yes. Do you have to what? Update, ask the guy sitting next to you. So Xybin does all the sort of the Galaxy updates. The major updates is when people want a certain tool updated so that the actual, the box itself the Galaxy box doesn't have to be updated that often but adding versions on top hat too and so forth that's more updates. They do, they make a tool shed available and I'll talk about it a bit later which makes sort of updating of tools a bit easier but it's a, that's the main activity for maintenance I would say. So a third version of Galaxy the first one is the main public server the second one is you install on your own machine the third one is a cloud version which is basically using Galaxy in the cloud which is what we'll be using today also and that version is actually running on Amazon and you could install Galaxy on other clouds should you be so inclined but the Amazon image that has Galaxy is available, publicly available to anybody who wants to use it so that allows you to start using Amazon directly in the cloud. Yes? Can you have Galaxy installed locally? Yes. In a certain function you need more competition of power can you seriously go? So that's a good question. So the idea is, so the big caveat there is the data so then you have to transfer the data to the other place where you want to do the compute on. On Amazon in general they don't charge for uploads and so the bigger files that you're uploading and if you do a bunch of analysis and then you end up with a smaller file at the end usually and you download that file so you have to charge for the compute and you're charged for the download. So I think the concept, the use of clouds as a, you know, I need more horsepower three times a year is actually a very rational use of using the cloud and I think that's sort of the market niche that suits I think bioinformatics the best. I think the large HPC, people that have large compute infrastructure like ourselves actually we would want to use Amazon or someone like it on a sort of spot instance type of activity as well but we have challenges with respect to consent forms and access because it's human data that we're dealing with so it's a bit complicated to use on Amazon but even though Amazon is I think more secure than most academic clouds or academic compute infrastructures so apart from the data transfer, I assure so you can do, you know, you do or you can do use the public version until if you do the big jobs and then you do the big jobs on Amazon and then maybe you use the public version for the rest I mean that's another way of thinking about it as well so the nice thing about your private version of it is that you have all the controlled access rules that are usually applied to your institution or to your compute infrastructure that you control that's usually easier to deal with, yes like from a company, okay, yeah and there's other companies too, yeah and then actually a fourth or fifth option is other people and sort of this one sort of commercial entities fall into this category as well is sort of other versions of Galaxy hosted elsewhere that are either public or private and so you can have those so would have or often specialize so I know there are some specialized for example in transcriptomic analysis or in mass spec analysis and things like that so you can go to, there's more than 50 of them right now which are public you can go upload your own data it's at university x, y, or z and then you can do your analysis there so this is the homepage for the Galaxy project so it points to all the things I just talked about and it also points to a lot of tutorials, videos and so forth to learn about Galaxy and I highly recommend it so this is the usegalaxy.org homepage so if you go to Galaxy now this is what it looks like that said I don't want you to do that right now but if you did do it, that's what it would look like and on the left side there's generally is all the tools and then the actions that you can use to get started with Galaxy on the right side is sort of the history of where things are at and what you're doing and in the middle is usually the results or the parameter setting or sort of the activity side is where you do things and you look at results as well so this is the galaxy.org so it's got a whole sort of very active user community that will help you install and run I remember when we installed Galaxy here they were actually provided all sorts of help and even this is before I was on their scientific advisory board they even offered to send a developer if we needed to help us install it locally and they were really keen to get going now they have actually on their help desk they have two people basically one that deals with all the email that's probably on the order of 50 to 100 emails a day and responding to questions and so forth and sort of looking at all the various biostar and various activities various portals for questions and Twitter and so forth and then the other one is basically the second person is basically traveling show person and goes to all the various conferences and does a lot of the presentations and monitors also monitors the various news groups so the cloud has also lots of details about the various ways of using the cloud and I'll show you that a bit later and then like I mentioned there's about 50 plus Galaxy servers that are dealing with various types of specialized or more generic Galaxy project providers so who is the targeted audience for Galaxy and so forth so Galaxy talks about integrating input and data at the source it really talks about make many tools that you don't need to install and maintain so it's like a one-stop shop to do a lot of the bioinformatics simple and complicated bioinformatics analysis it also allows you to maintain workflows, reuse them and then share them so you can develop a workflow and then you can share with your colleagues and so forth so it makes it very easy to share and publish experiments so you can even think about working on a workflow publishing it within Galaxy making it publicly available putting it in your paper and having a very detailed description basically of what you did in your analysis through a workflow that is in publication so with the publication you would have all the metadata about your samples and your experiments and so forth but in the workflow they could see the version of the tools you use the parameters you use in your tools and they could plug in their own data and repeat your experiments so it would be a very positive way of sharing published protocols bioinformatics protocols and like I mentioned earlier Galaxy is fully in the next gen space but it predates so it's been around a lot longer before so there's a lot of analysis on genetic analysis GWAS type analysis our stats package are integrated into Galaxy and a number of other experiments and like I mentioned it also works in the cloud though so when the central core and dogmas of Galaxy is reproducibility so being able to reproduce an experiment so it keeps a history of what you did and so you don't have to write it down yourself it actually automatically tracks these things and you can save them and you can share the histories between with others and it makes it very easy to work with collaborators down the hall or across the globe Galaxy is really designed with the biologist in mind and so that it thinks like a biologist more than it thinks like a programmer and it's really meant for biologists to keep track of their experiments and its graphical user interface is made for biologists it's really a tool for biologists programmers may find Galaxy very frustrating because it's not scriptable as easily although there are new versions I'm seeing people with heads nodding in the back there I know there are people that are developing scriptable versions of Galaxy which is counterintuitive when you think about it but there's actually a big demand for that so there's a sort of the growing so I've done my pipeline I've done it once but I want to do it a hundred times now how do I make that happen and there are ways that are being developed to make that happen and so Galaxy is very team is very responsive it's about a dozen people or so and 8 to 12 people and the core team and on two sites and they are they really listen quite a lot to their community and but it also has a larger community a non-Galaxy staff basically that are also developing things for the tool shed people that are developing tools want a way to get their tools across use by the community is to wrap it in Galaxy and so then putting it up in the tool shed so new versions of the latest tool that you're making if you make them available in Galaxy that's a good way of propagating that tool to the community so the basically like I mentioned it helps biologists deals with all the tools and data NIH, NSF and Penn State and now Johns Hopkins University funded initiative my slides out of date it's not Emory anymore James Taylor just moved from Emory to Johns Hopkins last January and there's lots of learning material that's available in the URL here so some of the challenges with Galaxy is that not all galaxies are created equal so you can go to two different Galaxy public servers or you can go to the Amazon one versus the usegalaxy.org and you won't find the same tools that have their own so that can be a little frustrating sometimes and we might even share some of that frustration with you today and the model one of the things that Galaxy team is thinking about is sort of distributing galaxies an empty shell basically question nothing's guaranteed because any tool no, there's no guarantees because it can be basically the administrator of that Galaxy instance will have will have can change the tool version so what they can do is they can have multiple versions of a tool available so that you can you use one place, you use one version and you go to another place and you'll see there's two different versions so you can use the old one and the new one the Galaxy.org which is the public server one is probably the most has the most up to date generic tools across that are used mostly but then you'll have a specialized you know, a transcriptomic Galaxy server which will have the latest in transcriptomic analysis tools yes, so actually if you have when you if you have a URL for a page which has all the things you're tracking, that URL will have everything it will have the versions of the tools and so forth and you can take that page and that saved pipeline and run it on any server except it might break if they don't have the version of that tool and so then you have to tell the maintainer of that server to install that tool to be able to reproduce the exact but the one you did is accurately tracked in your document so like I mentioned so moving to sort of an empty shell and then would have sort of a cafeteria model where you go pick up the tools you want the versions of the tools you want and that's referred to as the tool shed and that's turning out to be I think a great solution for that because the tool shed not only offers you the full menu so to speak of all the things that are available it also provides you with star systems you can rate things and give it a comments and so forth so it's a really useful way of doing things so how does the general workflow work in Galaxy so first you log in so you don't have to log in but if you do log in then it'll keep track of history for you so that's a really useful thing to do is to actually if you use the public Galaxy or the cloud version is to actually log in then you get data or you upload your data so you can do one or the other so there's lots of data sources within Galaxy or you can upload your own data then you manipulate your data so you do experiments on your data and you can repeat that multiple times then you save your output and basically you can save that into a workflow so this part of manipulating the data, saving the output you can make a workflow out of that and you can publish a page which includes the data plus the workflow into a page document so that it allows you to use your data into this pipeline or you can say people can put their data into your pipeline it's relatively easy it's a bandwidth issue so if you're uploading large fast queue files it may take a while there are very very soft quotas right now on the public Galaxy server you can do it by same as IGV for example you can do it with URLs, you can do it with reading files from your directory and so forth so it's pretty from that point of view it's quite straightforward there are also examples of different clients to upload that can be used also like to large better than FTP and things like that that's available as well this is time for sponsor announcement so Xebin has been sort of a great Galaxy administrator so he's he used to work at OICR and he installed Galaxy at OICR he installed tools on the Galaxy so we still on contract we still take some of his time and enjoy working with him very much and he still does things for us and he's the one that's been administrating the Galaxy server for us for these two days and so I don't want you to do this right now but note the page maybe you put a corner on it because we'll come back to it later at the end of my lecture which is going to be basically how to log into Galaxy and what we're going to do in the class later we're going to log into the cloud which is basically going to kick off a Galaxy instance on Amazon so you hit the cloud you get this page you log in you put in your CVW number and the credentials we'll get that to you later you don't have it yet you purposefully are not showing you the passwords now because we don't want you to log in now and then you get this launch Galaxy page you can click on the link and then enter your login password from the previous page and then you press keep everything in a default state and choose the platform type and then you get this sort of cloud man console which shows you one instance of Galaxy running here if you wanted to we're not going to we're only going to each run one but if you wanted to if you're doing this at home and you wanted to have more boxes working on your project you have larger you want more CPU you can select more instances and increase and this is where you would do that after things are all warmed up then you use it has the start Galaxy button and then you press that and then you get an instance of Galaxy and this is a Galaxy in the cloud it's very similar to the other Galaxy that I showed you before it has all the tools on the left all your history to the right and in the middle part it's got the sort of work in progress so what I'm going to do today right now is I'm going to show you a pipeline of an exercise that you will not do but and what we're going to do is I'm going to go through just to show you the mechanics of using Galaxy and what I'm going to do is I'm going to do a transcriptomic experiment which is something that is not part of this workshop but is for those of you interested in mRNA analysis is a a taste for it we have actually a separate transcriptomic next-gen sequencing workshop that is sold out for this year so if you want it you have to come back next year if you haven't done it already and this year we actually didn't offer it in Toronto we only offered it in Montreal in Vancouver and next year it will be in Toronto so we're going to have it in Toronto next summer so if you're interested in transcriptomic workshop a little marketing here and so before you start as I mentioned on use Galaxy the good thing to do is to log in and so you register the public instance you'll be a new user so you'll have to register even if you registered on the cloud before it's as if you were the first time because this is a new machine that's never seen anybody before and so you'll have to register and once you've registered then you'll see it will show once you log in you really need to save look at history, save things and so forth so it's a really good way to go back to that stuff and so Galaxy in the cloud as I mentioned is the same tools on the left history on the right in the middle this is an example of let's say this page of Galaxy in the cloud the things that are in one but not the other and so I did basically I did a list of tools and I did diffs between the two and that's what the less than sign and greater than sign are from their diffs between the two lists of tools so this is an example just to give you a taste of the differences between the two the public usegalaxy.org versus the Galaxy in the cloud and what's going to happen is that actually in the lab that we're going to do today there are some tools which are only available on the public version and are not available in the cloud version so we're going to do the first part of the lab in the cloud we're going to take our saved file and then we're going to move it to the public version if we have time we're going to do something you can do on your own because the last part would be to be done on the public version of the cloud we're going to try to do the same module 2 and part of module 3 maybe in the same way that you did from the command line with Michael yesterday and so we're going to do that in Galaxy today so it's going to be the other way of doing it very similar but with a different version of WA and so forth so all the sort of caveats they're sort of representing all the challenges of different versions of things being done in different places so all the items on the left panel if you click on them they'll probably expand into multiple choice and so often it's actually the left panel is quite rich in different tools and sometimes it's a little hard to find the one you're looking for so a common way I do to go is let's say I'm thinking of a NGS tool then I'll just type in NGS and then it will filter on the top panel you type something and it will filter for that string of words on the tools that match that string they can look for filter if you know you're looking for BWA or something like that so you can do on the top search box you can use that to to find something so for example if you type in SAM then you'll find all the tools related to to SAM and so if you do SAM on the cloud versus SAM on use Galaxy you find different things so one of the big data source and I've only had a couple slides on this but just want to because it's an important data source apart from your own data so one of the things you may need for example is reference gene model set so known genes from HGNI team that are available so all the gene models that are available so one place to get those from is from the UCSC Genome Browser and so the browser makes the data available in the browser view as we saw the last couple of days but it also makes available the data available in table format and so this is the table format is used by Galaxy then it can then incorporate into various outputs that it generates it uses and Galaxy and USC Genome Browser know about each other and they will allow for example to send jobs one way or the other so you can have results from UCSC and up in Galaxy and vice versa Galaxy ends up in UCSC Genome Browser and so this is your standard homepage for the human genome project and so other examples I mentioned the standard genome view with all the tracks and so forth but you can also have separated files you can have sequence and fast day files fast day format there is bed format which is the browser extendable data format GFF which is general feature format and GTF which is the so all these formats can be used are generated by UCSC and can be imported into Galaxy so this is a fast day file so so has anybody ever used a fast day program you have and here's an older one fast N and fast P no okay I'm the only one used that one so fast day is a Libman and Pearson searching algorithm predates BLAST and one of the major contribution that that program did is to propose a format for protein and nucleotide files and the file format is relatively simple it's a greater than sign a string and then the sequence that's it that's the only indication of the format NCBI EBI and all these various other places have added more structure and Swiss Broad as well have added more structure that link that line that first line of string that they have some their own codes and so forth but basically fast day is a greater than sign any string that describes what's in the file and then the sequence at NCBI where I worked for a few years we have programmers that have come from physics and not the sort of life sciences world and we used to just explain to them that if it's the difference between a protein and a nucleotide file is that if it's less than 85% ACGT it's probably a protein file they got it most of the time so this is the bed format so one of the great things about the UCSC actually has a description very good description of all these various formats that are used by all these various that's used by IGVs, used by UCSC used by Galaxy and so forth one of the sort of go to place for me to for all the is the just search UCSC file format UCSC file formats and you'll get all the description of all these various file formats that are used for these various tools so GFF, GTF and so forth so this is a general workflow for Galaxy I mentioned and what there are and there are pages in Galaxy so you can actually go to a page and look at workflows that have been publicized by either by Galaxy staff or by other people that want to make things publicly available so there's a place to keep all the published pages and they have our rating and so you can have an RNA-seq page so they have make the data available they make the pipelines available and so forth so you can just download the data download the pipelines and run the workflows and it's quite quite useful Galaxy team member Jeremy has done a lot of RNA-seq analysis pipelines and workflows and pages and he's making them all available on this website so this is the RNA-seq analysis exercise all the data is publicly available data what they've done, same trick that we've done in many of the workshops here is they've done transcriptome from very small region of the genome so it makes the file smaller and easier to process and so in this case they have brain thymus pancreas and ovary RNA so these are the places to get the data set so these are all publicly available so you can copy and paste these URLs type these URLs and basically you can load the files so the first things you do in Galaxy is you load files and often you leave the parameters to default so in this case you can copy and paste the URL from the previous page and then you just hit execute and it will auto-detect the file format so it will recognize it as a FASQ file and so forth so on this right panel you have the history and so the first when you first load a command there are different colors so gray is basically it's waiting to get started then it switches to yellow which means it's actually running green means it's done and if it messed up for any reason then it turns red instead of green and so you know you have to go redo it again and so either you gave the wrong arguments or the command failed for whatever reason so you have to go analyze if you do this this exercise you'll have different numbers here the number of the first of the history so this is the first command, second command and I skipped one and then fourth and so forth and so each item once they're finished they have you can delete the command so you hit the X you can edit the attribute so you can actually edit you can edit the summary of that step if you want you add more information from your analysis and why you did it this way and so forth and then you have what I call poke the eye you can look at the file and always sort of call it poke the eye because it's a little eye to look at the file you poke the eye with your mouth and then you can look at the data so these are uploaded fast Q files and you can actually look at what the fast Q files look like edit the attributes you can change the default name so often it will be sample one of experiment one it'll be very cryptic sort of description of that file and this edit the attributes allows you to put a much better descriptive and a better way often shorter way of naming the files what I often do this is all users to each one's there'll be a default name that it generates I will take that I will copy and paste that into the notes as that that's the sort of galaxy generated name and I will put in my own text in the name of that file which makes it easier to go to this one these brain 1 fast Q brain 2 fast Q and so forth those are file names that I added to the files so for example this is a fast Q trimmer it's so you can sort of trim things up fast Q groomer to sort of convert things into fast Q and so this example here so we do a QC manipulation and so it's fast QC is a program you basically run every year all the fast Q files that we loaded you run fast QC and these are the four types of brain and adrenal RNA and fast Q format and the way they look what they look like bad reads that are bad bases and the reads so we have a cut off score that Michael mentioned and so we say if it's build 20 we remove them and so you can go do that for all of these reads and again the numbers here would vary they do an RNA analysis top hat tool is to run top hat you map the RNA seek against hg19 and because the reads are paired you'll want these sets interdistance is known so it'll be it's a default to 110 base pairs from this library so you know this there are two as we mentioned you can know this by looking at your file size but also in this case it's in document on the library that was provided to us so we know what that is so these are parameters that you will put in on to this by Michael thank you very much guys thank you safe travel so you run that so you put in the name of the file so you have multiple files and so it knows all the files that you've uploaded so you might want to make sure you pick the right file you built in the grooming you make sure you have the right reference build the paired end reads and which what you're going to the FASQ file you're going to read against and the length 110 I told you and then you just execute initially all the files are yellow and then this takes about 30 minutes to do all the files it takes a while and then you end up with sorry the top hat was was about 30 minutes and then you end up with green files and you can look at them there's different ways of looking at them the Galaxy has its own genome browsers and yet another genome browser this one is called Traxtor and which is actually quite nice, quite fast has advantages and disadvantages compared to the other ones that we've talked about I'm glad to talk a little bit more about it later but an important thing is that now that you've done your experiment in Galaxy it's easy to share with your colleagues it's also share the history and you can also extract workflow from the top right and option and basically you can select which steps of your work so it extracts all your work the steps you've done and you can select which ones you want to include or not include into workflow and then you can edit further edit your workflow and sort of a GUI interface where you can connect things or delete things or steps and so forth and I want to add that there's lots of tutorials, videos, mailing lists twitters and so forth available for Galaxy there's there's a Vimeo channel for Galaxy project which has good high quality videos there's for example one on chip seek data analysis there's other RNA seek with Traxtor pages that Jeremy did which are really good and there are other ways still to use Galaxy another way is a project called Genome Space so Genome Space is another NIH funded project that actually still today but it might not last forever but I've told you about it so you can go take advantage of it actually provides you not only Galaxy but a number of tools and it connects all these tools together so for example you can have an output of of Galaxy which goes into a gene pattern so it would be sort of a gene analysis and you can go into a sort of bigger statistical analysis project or it could go into cytoscape or it could go into IGV and so what all these tools that are within once you're registered in Genome Space do is basically provides the outputs of one tool and converts them into the input of the next tool and so you're able to use within Genome Space you can use Galaxy and these other tools and it ensures a connectivity between these tools and there's all sorts of worked examples of why you would want to use this and so forth the big secret right now is that and it's not really advertised or described that well here is that all the back end the back end that holds these pieces is actually AWS it's Amazon and it's free of charge okay so it's free Amazon for now it might not last forever but right now it is and so you can actually go do next-gen analysis on Amazon computer infrastructure and storage on Amazon and it's paid for by GenomeSpace.org and I'm sure there is and so I don't know what the numbers are exactly but I'm sure if you reach those limits thinking that you can ask for more and so forth so it's a it's a very well kept secret right now you know what the user agreement to tell like you don't know if someone's on it? no no no so I don't think Amazon actually we have some expertise on that in the room but I think Amazon owns any of the data that you upload to Amazon other companies may do that Galaxy certainly doesn't do that and none of these GenomeSpace projects do that either and most of them are NIH funded sort of type initiatives so that they wouldn't have any rights to any of the data that you generate or or use on their platform so from that point of view I think it's it's quite generous so okay so I'm almost done actually bit early again so some useful resources so Galaxy so usegalaxy.org usegalaxy cloud I mentioned the twitter account the mailing list there's a mailing list for developers there's a mailing list for users and there's actually a third mailing list for assistant men people yes actually there is that's correct that's right that's actually I forgot that I knew that and I forgot to mention that so so Michelle mentioned Biostar so they actually Biostar engine was developed at Penn State and they're actually going to have one for Galaxy as well so they're going to have all their Galaxy support is moving to away from the mailing list I still subscribe to them I still get mail from the mailing list that's why yeah and but and then there's developers mailing list and the system in mailing list so all of these things are going to be moving to Biostar so there's also Open Helix which is actually another interesting sort of projects Open Helix is a sort of commercial help desk so to speak and what they do is they make some of their many of their things openly available like the UCSC genome browser material so they have lots of tutorials and things like that for UCSC genome browser that they make available but they also make their packages available commercially to institutions and so forth so there's a lot of things which are not publicly available but some that are so UCSC actually got a grant to pay for Open Helix to make their stuff publicly available and that's one of the ways they make their stuff available C Cancer is actually a sort of another public repository of question and answer much more directed towards next-gen sequencing analysis like Biostar but Biostar has is more I would say more general bioinformatics more than next-gen analysis C Cancer has definitely a lot more of next-gen stuff available papers of interest that I've mentioned and so forth available here so before we go and coffee break what we're going to do now is we're going to go back to sort of page 20 or so which had this instructions on how to log into the cloud and with Zeven's permission he's nodding positively and he's crossing his fingers we're going to get started and actually log in and so after the lab we're going to do after the break we're going to do the lab so right now we're going to log in and make sure that everybody's logging before we go and break I just want to acknowledge Florence and Zeven for all the work and help they've done with me on making these lectures possible and making the labs work great, really thankful they've done a really great job but if there are any errors or mistakes those are all my fault, not theirs and so I'm going to be in coffee groups