 So the usual slide, which I explained in the beginning, so I don't have to explain it again. I add sort of, you know, you're free to copy, share, take pictures, blog, blah, blah, blah, as long as you give attribution. And I have a disclaimer, which actually all faculty should have, especially Mike Stromberg. He will next year is that I will not profit in any way, shape or form for many of the products, brands, and what not that I may mention. That's my email address, that's my handle on Twitter, that's the hashtag for the workshop. I should have given this on day one, but I didn't, sorry. And there's actually, if you want to follow what Galaxy is doing, this is the hashtag for Galaxy. So use Galaxy, so pound use Galaxy on Twitter, you can follow with all the people that use Galaxy or saying about Galaxy, so that's a good sort of tag to follow. So one of the big things about Galaxy, sort of behind the philosophy and the design of Galaxy is reproducible science. And the idea there is that it's really important for science to be reproducible, and it's very important for scientist one does experiment one, that he or she can hand it off to their scientist number two, and they should be able to reproduce and get the same data. And so when you're talking about bioinformatics, what would be the sort of the main things, reasons why an experiment would not be reproducible of software, and that's the one, David, no more from Michelle, but that's a good one. Yes, yes, so that's a good one. Can't find the data. Can't find the data. Yeah, that's a good one. The methods, exact methods, yeah, the sample, the data, the sample, the sample is different. Some other one would be the software, the so-called code that was available doesn't compile. I mean, that's a very actually common one that, you know, you have software, oh yeah, I use this package in Belva. And so you have the same version, but you didn't use the same parameters. So there's little flags, you know, sort of arguments, command line arguments to use. So Galaxy really tries to capture all these things. And actually there's ways in Galaxy to capture the data, the tools, the versions, and everything, all the sort of the metadata we call it around experiments. And so it really facilitates reproducibility of science. The Galaxy hasn't figured out how to do it with the biological samples yet, but they're working on that. So I'm going to talk a little bit about that. I'm going to talk about a Galaxy, the various forms of Galaxy, the interface, getting data in and out of Galaxy, processing data. And we're going to do, in the lab, Emily is going to run a next-gen sequence analysis on Galaxy, which is actually going to be the same lab you just did, basically. So we're going to do the same RNA-seq analysis in Galaxy. So you'll see the pros and cons of Galaxy versus and so forth. And one of the questions that may be on the survey would be, would you consider having a next-gen workshop, and we're going to ask you after the work, after you do this next lecture, would a next-gen sequence analysis workshop only in Galaxy, the whole workshop in Galaxy? Would that be something that would be useful? And another question that might be on the workshop is, should we do a separate workshop just on RNA-seq analysis, for example, and not have that part of this workshop? Anyway, so Michelle's writing the survey right now, so I'm planting some seeds. So Galaxy is also another, it's a very, sort of, is considered as a pipeline tool. There are many, many pipeline tools. We've actually published one many years ago, which nobody uses anymore, which was very useful at the time. And basically, pipelines are developed, pipeline tools are developed, and really to fill in a gap of things that need, you know, need to be reproduced, need to be reproduced in very similar ways and so forth. At the OICR, we actually, we use this pipeline tool called Seqware, which is a more, it's a less user-friendly, more, sort of, command line-driven type tool, but it is a pipeline tool in the sense that it allows, it writes, sort of, a metadata file and it keeps track of all our pipelines and the arguments and so forth, so that we can have, we can reproduce it, but more importantly for us is so we can automate it, so we've got a thousand samples to rerun and so forth, so we could be able to do it. And so, actually one of the things that the Seqware group, and as I've been and myself are doing is actually trying to put Seqware into, which is not a command, it is a command line tool, putting it into a mod, like a tool in Galaxy, so we're going to, sort of, put a pipeline within a pipeline, but that should, that should be, you know, technically doable and, but if you have a pipeline that works and you want to distribute it to people, that should be possible. So, as you may have gathered so far, I'm really, sort of, strong advocate and as the CBW in general, we're really strong advocates of open access, open source and open data and so is Galaxy. Galaxy is really trying to, to make things, making things and open from a, sort of, literature data and software point of view is really key to science. And so, I, I do bug our friends at Galaxy. I know, I know the developing team at Galaxy. Not all their publications are open access. And so, I, I, sort of, give them a hard time. But this one, which is, and your binder is, and it's a really good, sort of, paper about the overview of Galaxy and the philosophy and, and about, sort of, reproducible science. They have, so there's different ways of, of, of working with Galaxy. One of them is to use, it's, use Galaxy itself, which is what we're going to do today. You can also, from their website, you can also get Galaxy and you can write it on your own server and your own institution. We actually, at the OICR, we have actually two versions of Galaxy. So there's a public one and then we, in-house, we have two versions. We have one as a standalone machine and then we have another one that's behind the cluster, which is, sort of, a different configuration and can handle and knows to send jobs to the cluster. And so we have, and it's one of Ben's job to, sort of, make sure that, that puppy is happy and, and, and working well. And it's not quite the same configuration. There's some tweaking involved. But the, the website is really helpful. And the folks at Galaxy are very helpful. They have actually quite a, they're well funded group. They have quite a team of, of help desk and support and so forth. And they were willing to send somebody to OICR to help us install in the cluster. So they, they're really, really keen. There's also on the website, there's lots and lots, tutorials, screen counts and so forth. And, and you can also then, and mailing groups and so forth. So the main site of, for Galaxy, the main, once you use, is, is basically usegalaxy.org. This is the public, it has a cluster of about 300 cores behind it. So it's quite, and there's very generous sort of disk space. Not really intended for next-gen sequence analysis, but there's a lot of other things that Galaxy does. And this website for that is actually, and actually part of the lab today, we're going to go on to this website as well. Not just the, the public instance of it. And it's really very, very sort of powerful and rich. Basically it's a Swiss army of bioinformatics. And on the left panel, you have all the tools and different types of things you can use to, to, to get to, to analyze your data. And on the right panel is sort of your history. And so what you, all the steps you've done, where you're at in the various datasets you get and so forth, they all show up on the right-hand panel. And in the, in the middle panel is what's happening. And so basically you're going, you're getting stuff or doing things from the left, you click on things that shows up in the middle panel. You do an application and then it goes to the right column. And so you'll, we'll, we'll do that in the lab. So how many of you have ever, have ever used Galaxy? Actually, let me put it the, have never used Galaxy. Okay, good. Should be fun. Huh? Sorry? That's why they hear it. Yes. So which Galaxy? So there's a, the main Galaxy, which is the one on, on, at usegalaxy.org. That's the main one. There's the getGalaxy so you can install your local one. And then you can use Galaxy on the cloud, which is also what we're going to do today. And, and which one is, is the right one for you? Well, it really depends what it is you want to do. So, and there are actually others as well. So the others here represents two other groups have installed, have a version of Galaxy, which have made public, which have separate kinds of tools in, in it as well. And so that's what that last column is. But basically, if you have moderate size, so if you're in that gig size and sort of mag size type of files, so that's moderate. If you're in the multiple tens of gigs or hundreds of gigs, or larger, that's starting to be more sort of traditional sort of sequencing type projects and so forth. And that's more the cloud or the, your local version. And local version, of course, you can, you know, whatever you want to throw at it. So if you have, if you want to share objects with others, so that any platform is able to share things with others. If your computational needs are moderate, again, so you don't need, you know, 1000 nodes or what have you. If you have large needs and large file size and so forth, the best version may be the cloud. If you have a really absolute data security requirement, the cloud is actually very secure. And of course, your local version will be very secure as well. So the, when you download Galaxy right now, it actually comes sort of preloaded with, like I said, sort of a Swiss Army knife of bioinformatics tools. That they're actually, the Galaxy team is actually moving to a new mod, a new mode of operation, basically, is that they have now the, when you download Galaxy, it's sort of a more or less an empty sort of Galaxy from a tools perspective. And then what you do is then you go to the tool shed, and you get the tools you need. And so they've made it easy for people to either contribute to the tool shed or for their developers, put the tools in the tool shed and then makes it easier for you to install the administrator of Galaxy to just get the tools they need and not sort of download the whole package. And so this is like the whole page. So this is the number of tools in the right column here of each sort of types of tools. And this is if you look at the sort of next gen sequence analysis. So along your page, it gives you the sort of the types of packages and what they do. So it's sort of the idea of the tool shed here. So you would get one of these tools. So the way it works in workflow is you get data, you upload your data, you manipulate the data somehow, you save it, save your output. And then you save your workflow. So you have the whole thing which then sort of saves all of this. And then you can publish it into a page, what they call a page, which basically captures the data and the workflow into one page, basically a Galaxy page, which you can then either share with your colleagues in-house or share with the world or whoever. So you can explicitly name people you want to share it with, or you can just make the whole page public. And so Galaxy staff have made a lot of these pages public on their website. So you can see the data, the tools, and so forth that they've used to do things. So Galaxy is really an integrator of data sources. It allows you so you don't need to install and maintain a lot of tools unless you're doing a local install. Of course, you have to get it from things from the tool shed. You can maintain workflows, reuse them and share them, publish experiments. And it's definitely in the next-gen space and also in the cloud space. And this was, next-gen space was actually sort of, is, I don't want to call it an afterthought, but Galaxy would do a lot of analyses, which would be basically talk to the UCSC browser, get some annotations, get some sequences, do some analysis, manipulate the data and save that. That was a lot of those things that where people were doing it first with Galaxy. Now they're doing more and more, the Galaxy team is paying much more attention to the sort of next-gen space. Like I mentioned, reproducibility, keeping history, what you did and what you didn't do, what you forgot to do. Galaxy also makes it easier to collaborate with people down the hall or across the ocean. Really, and this is really at the core of Galaxy, it's really meant for biologists. It's not, if you're a tool developer and you want to put things in Galaxy, that's good. But really, it's not meant for the people, it's not meant for the genome centers in a way, for example, that want to publish, that want to process a thousand samples. It's not, that doesn't have that kind of automation. It's really meant for the biologists that are handling one, two, ten, twenty plus type samples. There's a lot of, it's a clicking sort of environment where you have to go get things, input files, process and so forth. All the steps you'll see, you can, if you put it into a pipeline, you can automate a lot of things and you save a lot of clicking, but you're just pushing one sample at a time as well. So it's really, so it's sort of medium to small to medium scale sort of type of analysis. And Galaxy, like I mentioned, is well supported. All their stuff is open source and they've got really sort of active mailing lists and so forth. So they're really, really doing well at that point. So they're helping biologists, funding from lots of places and actually I think this URL is out of date. I'll update that on the wiki. The one earlier on in my lecture note is more up-to-date. So if I mentioned on the left side panel of all the various tools and so you can, the first one is get data. So you can actually from within Galaxy you can get lots of, it has links out to many data sources. Of course the main one is you can upload a file yourself. You have it on your computer or you have a URL and so forth. That's one way of doing it. But here are all the sort of basically the fly people that all the various model organisms, human, UCSC of course has, you know, 15, 20 different organisms, encode, modern code and so forth, warm base and all of them. And then a lot of the files are sort of sort of the unixy stuff we've been doing the last couple of days are available through a basically a click through type, how to cut a column from a file, how to cut a row from a file, how to manipulate sort of merge two columns into a third file and so forth. All that kind of stuff we do with very nifty one line Perl scripts. Galaxy can do for you with sort of push button operation. How to change case and you know you sort of don't want, you don't care if they're repeats or you want to make them all the same case. You can use that kind of stuff which is very easy to do, relatively easy to do in unix, but is very, very easy to do with Galaxy. Joining files, base coverage, you know, statistics and all that kind of stuff is also available. Like I mentioned a lot of next gen sort of tools are now and this is actually, there are more tools and are actually shown here and sort of fast cue manipulation and so forth are available. So more next gen sequencing tools, mapping, BWA, cuff links and so forth. All the things are now entering Galaxy. So one important thing to do whenever you start an instance of Galaxy, be it on the web, on the cloud, at UPenn, you should always log in. And so the main thing, the main advantage you do when you log in is you're able to have your histories and share them and data sets and so forth that you have access to. And so that's going to be a thing to remember all the time once you get into the, you start Galaxy, you know, log in and do things. As I mentioned, so one of the main things about the UCSC browser, and I'm sure many of you have used the UCSC browser, it has many periodic genomes from yeast to human. All the annotations we know about are usually there. A lot of variation data and evolutionary relationships is really powerful. It has graphical, the main use of the UCSC genome browser is to have a graphical representation of the genome. The main use that Galaxy makes of the UCSC browser is actually to get data and tables. And so GTF format, we talked about, GFF and fast day files and so forth are all kinds of files that UCSC genome browser generates that Galaxy likes to get from them. And you can also upload your data to UCSC and show it on a track and share it and so forth. So all of that's possible. So this is UCSC genome browser. This is like the home page which has access to all the various types of things that they have available. When you go to UCSC genome browser, by default it shows you a human genome. The latest release, which in this case is HG19 or the genome research consortium, human 37, which by the way, they've announced in a year, it's going to increment to HD20. So they've warned you. So don't be upset next year when the release has changed and I didn't know. I told you. So they've given like a year and three months notice of this change. So it's a new build. So a new build of the human genome has repercussions to a lot of gene prediction, a lot of coordinates. All the coordinate system have to be remapped. So from a, and knowing which build you're working from is critical at any, so if there's one type of metadata that you've always, always, always have to keep track of is what's the reference genome I'm working from. Be it human or mouse or deer or wolf or whatever, it's always critical to know what version of the reference genome you're using is what the version is. And for the heavily used ones like humans and mouse, there's a standard nomenclature, but for the more obscure ones, check to see what the standard nomenclature is for your community. There is one, I'm sure, and it's going to be critical to know which one it is you're working with. Because if there's a change in the coordinate system, then I say my gene is from 2 million to 2 million and 500, that's no longer true as soon as you have a new version of your genome. And so everything goes out the window. All the mapping RNA, mapping genes, everything means is useless. Atmetrics coordinates, the whole thing, if you don't know which genome it's based on, you're screwed. So it's very, very critical to know which version you're at. And so if you do a simple query, in this case, it was RAS, K-RAS, you get the UCSC genes that are represented and you click on the one, then you'll get the graphical representation, RefSeq, non-human RefSeq genes and so forth. And this would be K-RAS in the UCSC genome browser with different mRNAs, different, all the SNPs and so forth, lots and lots of data. And if you look at the whole page from UCSC browser, basically all these represent tracks which are shown or not shown and most of them are hidden that are available for you to look at. And so there's lots of configuration possible with UCSC browser. And of course, all that data is also available in tablet format. Yes. No, no, no, no. So UCSC is not a BAM file. It doesn't read BAM files, so per se. There are other browsers. I mean, we've seen a few here. There are G-brows also reads BAM files. There's a few other browsers that, yes. Yeah, so you have to have your files. So UCSC browser itself doesn't host the BAM files. If you have the BAM files on the web somewhere and then you can make, if they're readable by anybody in the world, then your UCSC, but it's not the best tool to look at BAM files. So custom track could be your interpretation. So let's say where your variants are, where your SNPs are and so forth. That's how you would have a custom track. Yeah. And bed files, yes. So we'll talk a bit about that later. Actually, a couple of slides, like this slide. So many of the file types that you can output and read, so you can read bed files, which is basically their UCSC formatted file. But then there are other file formats that you can generate and that are used by, have become a currency as we saw, for example, in the RNA lab of many other tools in the world. So for, and bioinformatics. And so, so a tab doing a, tab separated files. I don't, I'm not going to show an example of that, but you imagine whatever value separated by tabs. A fast day file, which I'm sure you all know, is, is the requirements for fast day, which was quite interesting from a historical point. It was like the fast day tool, which is the predecessor of BLAST, had a file format requirement for a query. So if you had to put in a query actually for its database, also they searched against, it was that you had a greater than sign and then one line. So this is actually this first two lines here is actually one line of anything, basically. So no format. And then your sequence and your sequence could be nucleotide, fast day, or it could be amino acids. And, and that was the only requirement for fast day file format. And it, it was sort of for many, many years. Now it's fast Q and so forth. It's lots of more, you know, sort of differences and so forth. But this was the standard file, most tools read fast day file formats. That was the input for any query was fast day. So NCBI. So this is a fast day from NCBI has their version of what the definition line. So this, this greater than sign, greater than sign followed by this. So there this information is structured and is sort of a standard way for NCBI. But that's only enforced by NCBI, used by NCBI and, and everybody else does a different. So, but so they have GI numbers, reference number, accession dot version numbers, and then a short description what that file is. This is an example of a bed format. So there's three required fields in bed. And then there's nine additional sort of, and you can sort of go to this page here for the, all the, the rest of the requirements. But basically, you have the chromosome number, you have value, chromosome start, chromosome end. And so you know exactly, and not in the bed file, but you actually know which chromosome, which version of which genome we're talking about. And so you know exactly where your coordinates are. And then you have some values, which could be, there's all sorts of kind of information could be gene structure, it could be annotations of other types, it could be color of the track, it could be I did an experiment at this position. And this is where my transcription factor binds to. So it could be chip seek data, could be all sorts of different things. And there's different ways of, of coloring things and so forth, which are obviously going to be used in the, in the graphical view. As we saw earlier today, there's also GFF, which is sort of a very poorly, I mean, it's defined well here, but it's actually, there's GFF3, there's GFF. There's all sorts of variation. Everybody makes their own version of GFF. And unfortunately, it's not a very well sort of maintained and standard file format, except that when people talk about GFF format, they talk, usually talk about these types of annotation that are going to be represented basically at a, so one line per interval. So it could be an exon, it could be a single nucleotide SNP, it could be all sorts of, of difference. It could be from one nucleotide to full chromosomes and, and everything in between. And GTF format has, is like GFF plus it has these two extra fields, the gene ID value and the transitive ID value, which is sort of, this is how this is represented here. And basically it allows for specifically annotation of mRNAs and coding sequences. So express sequence. Which, of course, we know doesn't entirely, the gene is bigger than that, right? That's a good question. What is a gene? So where does the gene start? Not all at once. Of what? Good answer. No. A bit further, more, what's more five prime end of that? Five, even more, what's more five prime end of that? Promoter. Yes, so the five prime end of the promoter. And maybe the attenuator, and maybe sort of anything that will affect gene expression, right? So if you, so what's a definition of a gene from a sort of a Mendelian point of view, if we take the classic definition of a gene, is a change in the DNA that causes a mutation that's observable, right? And so if you can change a piece of DNA, and you can observe the impacts, it changes the color of the peas or makes them different, their shape and whatnot, that nucleotide is part of that gene, right? So the gene, yes sir. So are you saying if you have an enhanced base of zero? Yes. So your gene, so yes I am, yeah. So, so that is part of the definition of the gene. Because it's testable, it's testable, and it is, so from a DNA sort of GenBank sort of point of view, it's a bit, very rarely do you see that. Very rarely do you see a gene annotated sort of five megabases upstream. But technically from a sort of a Mendelian sort of point of view it should be. Because you know, and it could be, they may get around it and they may annotate that region of the genome as being part of that gene. But if they do, a gene is a single interval. That's the other sort of sort of constraint there is about the definition of a gene, is that it's not, you don't, because we haven't done the experiment of course, of removing everything between two intervals to see if it affects the gene expression. But it probably would, and it would probably be part of that gene even though it doesn't have a known function. So the gene interval is the whole region that is affected by phenotype X. And so in this definition of the GTF definition of a gene, it's really only the mRNA and the exons, right? The exons which are the non-coding, the UTRs and so forth. But a gene is much, is much bigger. So Galaxy on AWS, Amazon Web Services, there's a whole, yes. So you have, so each, it could be all one line. So basically you have, you'll have multiple lines describing the intervals that belong to exon that is part of that gene. And so that will be part of the mRNA and will have the gene feature, the gene name associated. So multiple alternative splice variants will all have the same gene name. And so that's, that's the glue that sort of keeps the parts together basically is, is the fact they're all part of gene X, right? Okay. It's a very important biological question. You have to remember the biology. So Galaxy on AWS, so on Twitter, if you do a pound so it's a, everybody knows what AWS is. It's Amazon Web Services. And so there's a whole page on, on how Cloudman or Cloud on, on Galaxy is run and so forth. And so that's some of the things we're going to do. So we did this already. We did this, this, all this part we did. So you have all the instructions on how, you know, then you're supposed to get this screen and this screen. So this is what if you select four instances that they all light up and so forth and access Galaxy, if it becomes black, then you click on it, then you get this. And then this is a coffee break. So it's the end of this lecture.