 Good afternoon, everybody. So my name is Francis Ouellet. I'm Associate Director of Informatics and Biocomputing here at the OICR and also one of the founding faculty of the King Bioprime Workshop series, which we started 15 years ago, 16 years ago, 1999 in Calgary as the first workshop. And so I've been involved with the workshop since then. There's actually one other founding faculty that's still involved with the workshops as well. It's Dave Wischart at the University of Alberta and so he's still involved quite heavily. And as you know we're doing new workshops every year and this year one of the new ones we're doing is the Metagenomics Workshop, which is I think it's full now. So next year if you want to join it and the videos will be online at the end of the summer. So some of you have been here for four days. So the week is getting long and I totally appreciate how you guys are sticking it out. So that's really good. This last lecture in lab is going to be very, it's a bit of a mind shift for all of you. So we're actually going to leave the command line and go to the Graphical User Interface and learn some quite sort of powerful tool that's quite heavily used by many, many biologists worldwide. And so without further ado I'll get started. So as you know every presentation has a creative license slide in front of them and that's a very sort of important thing. We started about eight or nine years ago to make our material freely available to the community. So you know you can download the slides and so forth. But there's some little catch probably the other faculty haven't sort of told you about. So the Creative Commons License or CC there are several that come in many, many flavors and you don't really need a law degree to understand all the various flavors. But I'm just going to indicate so the two that we use, the two types of licensing we use in the CVW include CC by which means that you have to acknowledge who the thing, if you're going to use this material you have to acknowledge where you got it from. And SA which is share like. And that's a really sort of sneaky thing we stuck in there which means that if you're going to use these slides in your presentations you have to share your slides. That means you have to pay it forward sort of thing. And so that's what the CC SA means. On top of that I encourage people to log and tweet and what have you as long as you acknowledge where you got it from. So today I'm going to be doing a lecture on Galaxy and then after that we'll have a lab on Galaxy where you'll be using it. And how many of you have ever used Galaxy? How many of you have heard of it but have never used it? So you'll get to use it quite sensitively today. So this is my email address so feel free to if you have questions later on after the workshop or later tonight as you're going home or you want to ask me some more questions feel free to you can also follow me on Twitter and I've been posting some things about Twitter under the hashtag informatics of high throughput sequence data of 15 for this workshop and put it up the class picture and stuff like that. And people so this is a for those of you who are going to ever invent or name a software package to use a name that's very very very common in the internet is not a really good thing. So if you Google Galaxy you will not find Galaxy you'll find other galaxies but not not the one so if you get Google Galaxy bioinformatics you'll probably find it but the hashtag that actually the Galaxy community uses all the time is use Galaxy so if you Google use Galaxy then you'll find lots of resources about the Galaxy bioinformatics software package. So before I get started I also want to put in a disclaimer that I may don't profit from any way shape or form from any of the brand products I may mention I may mention Amazon I may mention Oxford Nanaport whatever I don't profit from any of these companies and I'm all I am on the scientific advisory board for Galaxy but they don't pay me so I do that for free. So the outline today is I'm going to sort of show you talk to you about workflows and how putting things together is important for reproducibility of science and so forth. I'm also going to show you about all the various variants of Galaxy and how we can sort of put things together. I'm going to give you an example a working example of actually quite appropriate for this class for some of you of an RNA-seq pipeline and so we're not going to run the RNA-seq pipeline but I'll show you that how to use it and how to you could do some of the things you did earlier this week in Galaxy and then we'll have have the lab. So now I'm going to take a step back and sort of give you sort of put this in an appropriate context. So what do biologists do? So biologists they like to you know make observations they like to make hypotheses test them challenge them conclude things and they well they don't necessarily like to write papers but they have to write papers. I don't like writing papers but I write papers. So we obviously we do things on RNA-seq we do protein mass spec we do interaction and pathway analysis and we have a whole separate workshop on that and basically we're trying to understand biology and as you all know the central dogma of biology is DNA makes RNA makes protein and then you write a paper about it and so and this really sort of I did spend five years at the NCBI and so I do have a sort of a biased view of the world and how biomedical information should be represented and so forth and this is one of the central things that NCBI does in that they try to link the publications with the biomolecular data and so it's a really sort of central thing and how you can explain how things are and a lot of the metadata the data about the DNA the RNA and the proteins is actually in the papers and so you have if you have an easy way of going from the sequences to the papers then you have a way of having a chance of understanding what was done and how it was done and so forth and so this is really sort of some of the core of many of the things about reproducibility and science and so for example if we're trying to do and trying to understand how things go in the cell we have to do some some of the things that we may have to do are our wet lab bench experiments but we also I like to think of bioinformatics as a dry lab experiment so it's the same way you think about doing an experiment in the lab you should think about it about doing an experiment in your computer and so and likewise like you I'm sure all of you that do wet lab work are trying to make sure that whatever notes you take and and so far it's make that make it possible to reproduce those experiments that you do in the lab and so likewise we want to be able to reproduce the experiments we do on the computer and so that's really sort of an important thing for example a classic sort of bioinformatics experiment would be to take a sequence to do a blast search and to look at the alignment and so you have reagents so you have your sequence that you're querying with and you have your database you're searching against you have a method so maybe you're doing a protein-protein comparison or a translation of a nucleotide against a protein database or vice versa and you get results and from which you have similarity scores and so forth and you're testing a hypothesis so that's it's all the makings of an experiment and so it's important to know your reagents it's important to know your methods and it's important to know to do your controls so what's an example for the blast search who has not done the blast search here which is okay so those of you who haven't done one you don't have to find the end you don't have to answer this next question but the rest of you do so what's an example of a control you could do for a blast search no no actually give me an example of what you would do to do a control what's a control of an experiment in a blast search yes yeah so what's a control what control are you doing so what do you why do you do controls what do you do controls so when you do an experiment let me take a step back here I think maybe my question is too complicated it's a very simple question when you do an experiment you you get results and so you want to make sure that the results you've got are because of the experiment you're interpreting or interrogating the right thing and then what you're seeing is not is not happening by chance or what it is you're seeing you do want to see as a as it's a result of the of the perturbation of the of the thing you did in your experiment right so let's say you do you grow cells and then you treat them and you're trying to see which RNAs are coming up so what's an example of a control in that experiment scramble the sequences scramble the sequences or you could just do the same treatment but you omit so you have the the solvent in which your your drug is in for example but you omit the drug and so you expect not to see anything right so that's an example of a control so in a bind so I'll repeat my question now so in a bioinformatics experiment what kind of control can you do let's say in a blast experiment if you're doing blast what's an example of a control so what I'll tell you because you're sorry reference sequence so not sure I understand but it's a one example could be that you go look for a sequence that you know isn't a database and if you don't see it then that's a problem right likewise if you know something that's not in that in a database and you do a search for something and you do you find a hit but you know that's not it's not it because it doesn't exist you got some random sequences and then you search a database and you find a 100% match that's probably that's a false positive right so there's that's sort of an example of thinking about controls in a bioinformatics experiment it's the same as what you have to do in a in a wet lab you have to assume you have to take into account your assumptions and and test them basically so one of the big sort of common themes between wet lab and dry lab is the ability to to make things reproducible and so that's really critical for when you do an experiment that if I were to repeat the experiment I would know how to do it first of all and second of all is I would get the same answer right if you do a search and every time you do a search you find a different sequence that's not very there's something weird going on likewise I'm involved in a large scale analysis of cancer genome data right now and we're actually doing lots of of alignments of sequences using vwa like across many computers around the world but and these are all separate different instances of these computers one control is to make sure that the computer doesn't modify your result so you want to take the same data and put it at say in six different clouds around the world and you want to make sure you get the same alignment we actually did that control and it worked so that was but initially it didn't because bwa actually has a random uh randomly so if an alignment can go in one of two places with the same quality score it puts it in one time puts it there and the second time it puts it there and then when you do check something and do to see if a byte by byte the two alignments are the same you actually get different answers so there's a way of pushing bwa to always go to the first one it finds and then it does find it does generate exactly the same answer but these are things you have to think about okay if i'm doing this twice or if i'm doing you know two different computers that have different amounts of memories or or whatnot is that going to affect my result and so it's really things to think about and so it's really important when you do a bifracks experiment it's actually track everything about your experiment so the kind of computer you're doing on the version of the software tool you're using which parameters you're using and so forth and so that's some of the things that people have to keep in mind here so some of the important things for many of these pipeline things is it should be open source which allows not only it takes the sort of the black box component out of it it should be used by if it's got a large user base that's more likely to have things figured out and and make the code base sort of more robust and so forth if there's a big sort of user community supporting it and helping each other that's very useful flexibility being expandable is for a pipeline sort of type tool is is really useful and if it can work on the cloud that's good i mean you've seen the advantages of working in the cloud this week and and ideally especially for biologists if it's user friendly so much better so there are several sort of tools like that one that you've used this week for example would have been r and bioconductor and r added its core had the ability to trace and capture every step every parameter and so forth and that you do in an experiment and um Robert gentlemen in this paper argues that every figure of every paper you've ever done should come with an r script that explains that generates the same figure so you should have the script included with the paper so that you if you use that script you can you'll be able with that his data or the data that's in the paper you'll be able to reproduce a figure that's in the paper exactly and if that's available that makes the whole thing sort of reproducible so it's a really sort of core thing about about this paper another uh recent paper is uh from actually some of the folks that that did um galaxy is about 10 simple rules on reproducibility and those rules are sometimes you have to sort of think you know am i am i going to be able to do that so for every result keep track of how it was produced i mean it's sort of basic you bench scientists sort of have no qualms about that they take likewise for advanced experiments you should do the same thing um avoid that step in the middle where you have to go do something and then to make it go to the next step so if you can script everything if you can't or write detailed notes of what it is you did each step so that you can reproduce it um archive the exact version of all experimental external programs used and so if you generate if you need a special version or whatever version of of python or pearl or whatever it is you're using to to do a script or whatever you should make sure you have a copy of that so that somebody you know a year from now says well i'm trying to do your experiment it's not quite working you know like you did in your paper so you can say well which version of python were you using so i was using bubble wrap so we should use this one and here's a copy of it and so if you have that available then you can you can make that possible um record all intermediate results and when possible in a standardized format that's pretty hard actually because uh bioinformatics loves loves to invent formats and to tweak formats and to modify things and so forth so that's the goal but it's sometimes it's a bit difficult gff has i don't know how many versions and people write okay well write a gff file but they actually don't know which version of the gff file format you've written and then you go and use it and it doesn't not quite working because it's got an extra column and so forth um as i mentioned earlier so for analysis that include randomness uh note the underlying uh random seed so usually like a program like bwa it does have a random seed so you can fix it so that if you then makes it possible to uh to reproduce exactly the the data always uh store raw data behind plus like the example i gave you about uh with r generate um hierarchical analysis outputs allowing layered of increasing details to be inspected so uh i'll show you an example of that in galaxy which is basically um uh if you're gonna have your your data you can for example you could put as you you may have done this week is to have things in directories within directories so you keep the the hierarchy of how you obtain the the data as as the experiment moves along and um so and connect textual statements with the underlying result that's another feature of galaxy is that throughout all your notes and your experience in galaxy you can actually write a note and say on this part i modified this parameter because it works better blah blah blah and so forth and you can say where uh often galaxy puts in vents names for a step or for an output file it's probably always a good idea to go back and actually say not the automated version of the output file but the one that makes more sense to you'll be able to understand it later on um and uh the most uh important one is is to provide public access to scripts runs and results and so that means that if somebody reads your paper and for example you've detailed all this information they will be able to reproduce your experiment because they will have all the reagents all the the the things you use to do your experiments okay so the same way you you will describe you know where which vendor got you this restriction enzyme or from where cell line or this cell line that you're using in your experiment you you have to put that those kinds of details around software things as well so i'm going to talk about galaxy but at galaxy is only one uh player and in the world of sort of pathway and pipelines and so forth and um pipeline uh organizing tools another one that's much more um it's less user friendly and it's more robust to do larger scale uh one is it's called sequer which is one we use here at the OSR in our software group and um it's not for the faint of heart but it is much more robust and much more scalable than uh than galaxy can be although galaxy has done a lot of improvements in the last year uh to make it uh so you can do for example the same galaxy pipeline against a thousand files if you want to so that's possible now in galaxy so the solution i'm going to talk to you about today's galaxy and um there's lots of papers about it there's uh this one in genome biology and this one in current protocols all of these are open access documents so even though current protocol is a is more of a commercial product this specific chapter was actually made uh open access so when you start there are several ways to start galaxy and there are several versions of galaxy out there and so uh the homepage for for the galaxy project is galaxyproject.org the homepage for the public server the galaxy public server which is the one or referred to as main is at usegalaxy.org and there's also um you can actually also download it on your own server and you can also there's a cloud version and there's actually a number of of public versions as well so so which one's the best for you so the um um just uh stay by the microphone so the um if you if you have so moderate size datasets you can use it so so the public version actually is quite uh good in that it's uh it's it's quite it has a few hundred cores of cpu behind it it has um an up-to-date list of tools so you can think of a galaxy instance as a place where tools have been installed and so the galaxy main or the usegalaxy.org the public one has got most of the most recent tools installed in it so it makes it for a very convenient one the um one challenge with the public server is that it's public and so that means there's lots of people using it and uh they galaxy right now is about 50 000 plus registered users fortunately there aren't all there right now this afternoon but uh there are 50 000 people that have registered to use the galaxy and they give you somewhere if i remember correctly i think a couple hundred gigs of free space so you can have uh 200 gigs of files up there if you want it is not considered a it's it's your private space but uh it's not impossible to for people to uh to break into it and so forth so if you have confidential especially let's say human data it's probably not the best place to to put it on and so that way one one solution there that people have come have done is they want to use galaxy but they wouldn't install it on their own personal server as i'm sure has been told a few times this week um personal servers are not necessarily the safest uh computers uh amazon is actually consider i consider amazon um more secure than than most university servers uh by by landslide that said as we saw this week you can configure things in a way that it makes it very open and very publicly accessible so we did that this week just to make it easy for you guys to use the resources but uh if you're worried about security and you're worried about uh protecting your data and so forth then you have to you have to be mindful of all the security uh documents and i think malachi and obi earlier this week had a pointed out to a document on there uh in the rna c workshop for those of you that were there they pointed out to a document that sort of highlights uh how to configure your server to make it more secure so uh yeah so for example the main if you have really large data sets the main is probably not the best idea so the as i mentioned galaxy project dot org is a home page so on so you'll have access to using galaxy or getting galaxy are you getting galaxy and installing in your own server how to lots of linkage to the tutorials and there's lots of videos and so forth and and also there's a wiki page that has lots of ways of getting involved in and understanding and and so forth the um so also malachi and obi talked about uh by i think michelle as well about the bio star and actually the folks so the the folks at galaxy actually one of them one of the p is is that an institution in the us which was the where the bio star software was started so they actually have they have their private instance of bio star so that they have all the galaxy material on bio star so it's you're not searching all of bio star you're searching the galaxy version of bio star so they they've limited to to that data set but they have all of their material available there and and it's it's quite rich and and lots of information and likewise if you go there then you don't need to um uh you don't you should look for the answer of the question you're thinking about before asking yourself so this is the use galaxy this is where we'll be going later today and i'll spend some time describing the various parts this is the um get galaxy so this is the page to get the software to download and install it so you can install it there's a couple of different ways of installing it the two main ones are on a single box so you can actually install it on a laptop or on a more so robust server but you can also install it sort of in front of a cluster so that it can can be configured and use all the powers of the cluster and so for an institution like a university it actually makes quite a lot of sense for them to install an instance of galaxy for their user base and and their bio from x community within the university and many of them have done that and so and i'll talk about those a bit later um we talked so cloudman is sort of the software infrastructure to to to use galaxy in the cloud and so there are there's an you can get a an am i an amazon um machine instance that has galaxy in it so you can use galaxy a graphical user interface but it's actually on amazon so there are ways of doing that and there are other cloud providers non non-commercial and academics cloud providers that also use galaxy and so i mentioned all there's a 60 plus sort of galaxy servers out there so in various institutions that have making their data and their software available through a galaxy instance so the challenge there is that this instance will be sort of put together to to address the needs of a certain community and so there might be some uh weird organism or some weird uh some specialized software like there's a mass spec galaxy instance for so people that do mass spec analysis they can go onto that server and they have mass spec tools so most of the tools then are specialized for for that area of bioinformatics but that's a good place that to go look for and see if there's one or more than one that has the kinds of tools you need for example there is one i know there's a few that do sort of aren't a seek analysis for example so um so one of the things the great things that galaxy does is is it puts uh it integrates inputs of of various data sources so there's a way there's lots there's on the left panel you have all the the various inputs from a various number of sources that you can imagine in addition to of course things coming in from your own computer um likewise uh the galaxy so the use galaxy.org so the main one has a number of tools that are pre-installed so you don't have to worry about installing the latest version or what have you they're there now and you could just select and use them and and and go about doing your experiments although you saw uh this week said installing tools and so forth is is pretty straightforward but sometimes if you don't have the environments right and and so forth you can spend a lot many hours sort of figuring out you know which library am i missing and and so forth so galaxy folks are taking that away by taking that concern away from you um one of the big things that galaxy does and it's really sort of the main reason why you should use galaxy even if you're a sort of command line pro and you love the command line you never want to touch your graphical user interface at the end of your experiment that's actually a really good way to share your parameters and your tools with the community so if you do write a paper you can actually put the whole workflow in a sort of galaxy paper page type unit and publish that make that public a url public and include that in your publication and you tell people this is how i did my experiment and your galaxy workflow will take into account uh could take into account the data the tools the parameters the versions and all all those things and how you did the various steps and so forth so if you capture that into a workflow then you're able to to share that with the world and it makes your paper much more reproducible there are even journals that will store that for you journals specialize in big data and so forth like giga science and whatnot will host your workflows your galaxy workflows and so they have a they have a galaxy server and they'll put they'll put your your workflow but there are lots of different ways of within for example using the the main server you can make your your workflows publicly available there um so like i said you can publish an experiment on the galaxy and um initially so galaxy is about uh i'd say 10 plus years old and so it was actually started before next-gen sequencing was sort of on the landscape and so a lot of the tools in galaxy have nothing to do with with uh next-gen sequencing but um it was powerful enough and it was robust enough and they and they did the tweaking they needed to do to make it work but it's also now the bulk of the work the galaxy does is actually the end data and so the ability to do the things you've been doing for the last four days is basically all possible within the galaxy sort of main format and and it's also available in the cloud because if you want to have more more horsepower behind it that's a really sort of convenient way of doing it and so one of the i would say i'm i was gonna say i'm not a galaxy evangelist but i think i am so one of the i i think the guys are doing great job so yes i am an evangelist but i don't like the term evangelist but anyway that's another discussion um but you get what i do the um one of the cores of why things develop the way they are in galaxies is that they believe in reproducibility so the two p is uh james and anton are are that's sort of what drives them every day is making a better tool where it will make it reproducible and make it possible for people to come back several years later and to repeat that that experiment so very good at keeping history to a point where you know you repeat a step two or three times because you got it wrong or someone so it keeps all those steps so if you have to sort of go clean things up eventually sometimes because you have a lot of of things that it keeps with with you and it's also very easy to to do something and share with one colleague or share with the world or share with 10 colleagues or whatever and notify them that things have changed or and so forth so if you want to give you're working in in say in new york and you have a colleague in vancouver and or vice versa then you can sort of uh email to each other okay i updated the pipeline you want to have a look at it then you can log in and then you'll see the pipeline that your colleague worked on and so you can work on it that way um the if it wasn't clear yet but i think this one's not a big of a surprise galaxy was designed with a biologist in mind so it's not for computer scientists it's not for software developer except if you're developing modules to put in galaxy but it's really meant as a sort of easy user interface to go do things and it's a little frustrating i would say from especially even for people like yourselves this week now because you know how to do it quickly sort of go grep and count how many lines it has you know such and such a string in it there are ways to do that in galaxy you have to pull down some menus and write down some words and tell it to count the second column and and so forth and so that is all doable through galaxy but you can do it's much quicker much more quick today now not on monday but today you can do it much more quickly at the command line yes michelle wednesday what this class but some of these people were here on monday too so yes it's monday or wednesday but yes thank you michelle um and so also galaxy is a healthy developer community and so one of the things because it's an open source package and and so there's lots of people working on it and one of the things that they encourage is that when you have for tool developers to actually develop a galaxy version so that i'd say have a new um a liner for for short reads one of the things i could do to make to have more people in the community use it is to actually wrap it in galaxy that people then would see it in a galaxy menu others my favorite tool and his tool could be a could be a c program it could be a pearl script it could be anything there's all sorts of there's all sorts of different ways are for example as part of galaxies there's all sorts it could be in our script and so forth um so it's an NIH funded project it's a uh um it's well supported it's actually now a NIH um u 41 i think it is so it's a sort of a community tool uh support so it's going to be around for for a while so one of the challenges um with the the multiple sites i mentioned where they have uh variances is that not all galaxies are created equal um and what they're going towards and is happening more and more often now is to basically ship galaxy as an empty shell so that you use galaxy.org is not an empty shell it's got all the tools uh and and then some but if you you can download galaxy that's basically tool free and what you then need to do is you need to go to the tool shed and get your favorite tools and so then you have this menu of okay these are all galaxy wrapped tools this is the one i was just referring to for for developers to be able to put tools there and then install them in their own version so this is when you're it's running not the use galaxy version not the AMI that's uh on on amazon although you could do that as well there with a little administration but this is if you're going to install your institutional version of galaxy you'll then go to the tool shed and get your favorite tools and you can make you can get all the tools or you can get you know specific tools for dealing with for example ngs data or you can get tools to deal with rna seek data and so if you look up sam then you'll see a bunch of tools that have related to uh sam tools and whatnot there read sam files and so forth and so you can it gives you an idea of of how that could work so the general workflow in galaxy is that uh you log into the system and when you log in what that does is that for example your history of what you've done is then remember so if you log back in then it has the it remembers you and it keeps track of the things you've done and so until you clear it and so forth so logging in although you can use galaxy without logging in but if you log in then you get a bunch of perks one of which is the history and memory of what you've done but also the ability to share with other people because you have to know so you do it it's sort of a peer to peer sort of exchange of information you then get data or you upload your your data so you can get data from ucsc for example you can get data from various sites or you can have data like i said on your own computer you then mini plate you do things your data you do experiments uh you repeat the experiments you do a few times and then you save your output and then all of that becomes a pipeline a workflow and then you can publish the whole pipeline which includes the data from some your your own data or from a server and you can put that into a galaxy page so galaxy in the cloud actually looks exactly the same except what as any other different galaxy is that you'll have different tools present on on on the server so this is an example of some of the tools that are present in both the the public server in the cloud version and some of the tools that are missing they're present in one but not in the only in galaxy cloud versus only in galaxy use galaxy.org so there's a bit more i find in all of the galaxy cloud version i haven't started recently but it seems to be a bit behind and not having the same up-to-date tools as the as the use galaxy.org has and so uh all the the tools in galaxy have a short description what it is they do and what they're used for and at the top left panel is where you in the left panel in galaxy page is where your tools are and your data and so that that list is so long it's actually collapsed and so it's sort of compressed a little bit when you look at it first but it's such a when you know where you're going to go get whatever it is you need to get that's fine but the often the best best ways to just look for it so you know you're looking for um what was the one you did today uh fastqc um so they do the qc on on on fastq files and so that tool is there so you look that up and and and you find it right away so uh yeah so like there's like i said there's another example of tools that are different on the different servers so one of the great things about galaxy is that it's fully integrated with the ucsc genome browser and the uc genome browser ucsc genome browser is fully integrated with galaxy so they can send things to each other so you can do a search let's say for all the genes and a certain version of the human genome you can output that as a gff or gtf file to galaxy and so and vice versa so it's really sort of uh useful and so that makes it simple to to to look things up and get data sets so if you want a part of a chromosome like that that's very uh very easy and ucsc outputs things as you know in a graphical output but it also likes to output things in table format and that's actually the format that galaxy likes to consume and so if you want to read or generate let's say all the snips from chromosome 22 and you want that in a format that can then be uh brought into galaxy uh that that that's possible and another thing that's possible also is that galaxy talks to igv so you can actually have things in galaxy and then you want to output it and you want to see the output in igv and that's also possible and i'll talk a little bit more about that later so um so these are some just some of the examples of file formats that uh ucsc likes to output fast a you know very well it's basically just a greater than sign some descriptor and then a sequence and nucleotide or protein uh bed format which is the browser extended file format extended data format which is used for by the browser to represent various things in various ways so again if you output sort of chromosome start position stop position some id and some score and so forth then you can output it in bed format gff has uh as you can see i can't even see my computer so it has all the uh sequence names the source the feature the start and score strand frame so that's sort of standard gff format with the specific versioning and so forth and explain on that page and gtf is basically a simplified gff but only deals with coding sequence so regime features and uh has an extra field which is the um the name of the identifier something like for example would be an ensemble id name so that would be part of the of the gtf file format so um as i mentioned so one of the big things that's quite special so even like i said even if you do your whole experiments from the command line at the end to be able to to do the control of reproducing it yourself in galaxy is it would be quite the accomplishment and the and that becomes a great way of sharing it in the publication and so um if you go through galaxy you can actually look and see published pages there so the there's obviously the galaxy staff have put up many pages there and make it available for you to reproduce the things that they've done and it's and they've been involved with a number of papers and so they've included those in there but there's also people that want to make uh make things public from their own publication make it available on on this sort of published uh pages within galaxy and what you get that you can crowdsource and if you have certain pipelines and certain workflows that you like then you can sort of give it if you're logged in then you can sort of say you can start and say okay yeah that one's really good i'm going to give it five stars that one is not so great i'm going to give it a couple of stars then so you can see other people's ratings and then you can see oh there appears to be this great rna seek workflow for example that the people the community likes and so maybe that's a good one to look at and of course you then get uh certain people uh that uh are doing great workflows and they sort of become they get badges and you know they become superstars and whatnot and so Jeremy is actually one of these people he's actually part of the galaxy team as well but he's really keen but he's done a lot of work about rna seek and so i'm going to just i don't want you guys to do this right now but i'm just going to give you an example of of um rna seek analysis and so what they've done here is that Illumina has actually put out this uh rna seek with data set of of uh of sort of specialized you know not like good sort of data sets for teaching and whatnot and um from many body parts from from the human body and so they have things from a 500 kb region from crosal 19 that has about three three and a half million sorry that has a few million sort of rna seek reads so that's the body parts where it comes from comes from brain and from the adrenal gland so for those of you who are not biologists need to know what the brain is or the adrenal gland this is your map um so what you first do is you log in uh if you've never logged in before then you have to register if you've uh and actually this part actually put a note in your notebook right now if you're using the paper version just bend the corner of the page i'll ask you to do that before we go on coffee break so we'll go after my lecture we'll go and break before we start the lab but i'll ask you to do this if you've never logged into galaxy i'll ask you to do that before the coffee break and then when you come back from coffee break things you'll be logged in so if you're a new user then you have to put in your name and email address and so forth and uh or if you're a returning user then you just log in and once you're logged in then you type the user tab it will look like this and that you will see um you'll see look at now i want to see my safe history or my safe pages and so forth because is that the sort of the logout button and so forth so that's an indicator that you you are logged in and uh you have access to all that information so in the case what we're doing right now is we're actually gonna in this uh sample i'm explaining to you right now is rna from so paired and reads from one set of paired reads from adrenal gland the other from the brain um and so the first thing you want to do is you want to get data and so for to get data is the first item on the left panel and um you uh basically copy a url and it's one of the ways to get data into a galaxy and then galaxy has this very convenient sort of uh shade shading so the first when you enter a command initially it's sort of gray which means it's it's still hasn't gone to the um to the computer yet there's a server it's on its way yellow means that it's there but it's has it's queued up to be done so it hasn't done anything yet and then green it means the job is finished uh red it's not good as you can imagine so red means you gave the wrong argument uh the file didn't work the process in question didn't work for there's lots of different reasons why that could happen and blue is usually points to a data file and so this is here i'm i'm uh and each of these commands have three things next to them always so um there's an i it's a pencil and an x so the i i refer to this poke the eye and so if you poke the eye you click on the eye so you poke the eye you get that's a way to go look at the data uh edit with the pencil edit the attribute that's a way of taking notes and where you add you can add notes about that that step or you can rename files there are going to be the output or the input and so forth the way and you can delete a step and here i put a narrow just saying if you did this and you did it three times you these are the steps at which so this is not one two uh four seven means that i'm not showing you three and four that failed or five whatever and so these numbers on the left hand side is just a chronological order of of your steps but it may be uh different for you these may vary with usage basically so if you poke the eye so if you're looking at a fast q file then you'll you'll get to see the fast q so in the middle pane you'll see the the uh the file the data file that you're looking at whatever that data file if it's a text file it'll show it to you if it's not a text file it'll either encourage you to use a graphical user interface or to um or to not show it to you because it's a binary code or some of that um the edit attribute button so it gets you so here i rename this file sort of brain one fast q sanger and this is where the original file was named and so that's i'm keeping that there and and so forth so this is a really sort of reorganize i don't delete anything that's usually here and i usually move it to notes or whatnot but this is where i the name file is going to be something that i'm going to recognize later it won't be like results of experiment of step two so it would be a sort of an example of how the machine would sort of you know automatically name that file and so it's not going to be results of step two it's going to be what what you know it is to be so that way when later on when you need that file you're not going to look for results of step two you're going to look for the file of the body part or the sample experiments you've done or whatnot um so the uh so in general so tools uh you'll read in the file you'll um so and it will show you the files usually that are compatible with that tool so you it'll protect you a little bit from from only doing things that you can do it it's not perfect logic but it works most of the time so it's if you can't see it that means there's might be some some formatting problem fast qc we saw earlier today so it's available through um through uh uh galaxy and so there's a text version of the output there's a graphical we saw the graphical output version of it and uh there's also you can delete it if you once uh if you didn't use the wrong file name or whatever and then um so here's an example of file name so it's fast qc on data 8 uh raw data so it's an example of the file name quickly sort of getting sort of confusing and not very informative so then you would use the edit button and you would rename rename that file so this is like um all the files that have been groomed the grooming step in galaxy is often about changing from the old Illumina uh standards to the new Illumina standards and it just it's um it's like a bite shift because they're using a 32 bit code versus a 64 bit and so they just shifted the the letters up and so some tools only work with the old ones some use with the old one so you have to use you have to Google in your files to to allow the tools to to be used in a specific pipeline for example removing bases because you want to remove things that are of lower quality is possible here's an example of using top hat and so the top hat is one of the tools that's in on the server and for this experiment we know that the 110 base pairs between the reads and so that's the parameter that you would need to put in and so all of these things are are are sort of clickable or intruable into these forms and then you run this this actually takes about 30 minutes to run on these four files and on the public server and um so it's yellow then it goes green when it's done and and then so you have um an example here of all the sort of the output from the top hat experiments and then in galaxy you actually have two sort of graphical viewers you have a sort of a standard what was standard it's not standard in a sense it's galaxy specific but it's a standard in the way that you're used to seeing the data which is basically based on tracks and and and so linear sort of coordinates from left to right and so you can see RNA or whatever experiments you're doing in galaxy it also has a circular plot viewing also for like the things that we did earlier with some of Guillaume's data so that's the sort of standard viewer and then he has the ability to share so you can like I mentioned you can publish and share it with your colleagues or with the world so it has sort of those two colleague slash colleagues so you can share with a number of people or one or more or keep it private we'll share with a group of people or share with the world so there's sort of classic sort of munix sort of three types of groups and you can share history and and and so you can share history and you can also extract workflow so like I mentioned when you do experiments sometimes some steps work and some don't and so forth so when you extract a workflow it extracts all the steps then you go and select which ones you want to pick and extract into your workflow these are the ones that I want to keep these are the ones I want to rerun on the next data set and and so and that's and this is the one these are the ones I want to share with with the world or with my colleagues or what have you so it also has a workflow graphical user interface so that if you can you can go edit workflows this way and basically all the the strings the curves that link up boxes together are basically data sets you're giving you doing a step on it and so forth all of these things you can come out once you extract the workflow but you can also then go edit this and you can change file names and so forth so it's quite flexible from from that point of view so as you start getting familiar with gaussian and you think it's a it's a cool project and so forth and you want to share with your colleagues remember there's lots of videos tutorials mailing lists twitter's and so forth that allow to to look at stuff for example there's a the gaussian project the vimeo channel so we were this week last two days you looked at uh variant calling structural variant calling and so forth there's a galaxy tutorial on chip seek and so we're actually we're going to introduce that actually do a workshop on chip seek and and and epigenomic analysis probably starting next year so a new workshop to that's in the plans right now and so but there's already if you want to get you want you have some things you you have some data sets you want to work with maybe you have a look for chipsy you have a look at the galaxy video and then you can get started right away um jeremy's also got looking it's got a a page on how to use trackster which is this graphical user tool which is makes it quite simple to use um this is at the end is actually another advisory board i'm on it's a genome space which is another tool out of the broad that links up many tools together so i mentioned ucsc galaxy um side escape um i gv which is another tool from the broad and many others are basically all hooked up and so you can go to the genome space in there they've got tutorials on how to hook up that tool to that tool and basically what genome space does is it doesn't show you doesn't tell you how to use that tool it assumes you know how to use tool x but what it shows you is how to get data from tool x and to tool y and so the steps and then basically it's there they've added each tool through the genome space interface or through the tool now you can sort of output the output of one tool becomes the input of the next tool and so it makes that very convenient so uh other useful resources uh galaxy there's a twitter account there's a support the bio star i mentioned for use galaxy is available there's a bio star itself which has of course lots of of things as well as you um and if you follow them on twitter they're basically all the stuff that there's a pretty heavy feed of all the things that comes in on bio star i did not talk about open helix but they're basically i think they're still around and they're a commercial company that makes sort of uh support material to help bioinformatics and they actually had a contract from ucsc browser to make all the material for training on how to use the ucsc material so they make all of that free online but they have other material that they sell ucsc itself has also a number of tutorials online there's uh we mentioned bio star the other big one and it's a particular interest to to this class is seek answer which is basically an answer discussion board for everything related to next gen sequencing and so it's a really good place also to go and look and see if you have questions or about about next gen technologies that said i think the best one right now is bio star and it's the one i encourage you to go check first so these are some of the papers i mentioned throughout my talk i encourage you to look at them and now we're gonna be on break