 Okay, so I add this slide, which allows you to live blog me if you want or whatever, although Michelle disapproves of doing too much stuff and not paying attention to what I'm saying and I agree with that too. So today we're going to use Galaxy in the workshop. So how many of you have ever used Galaxy? One, two, back rows, okay. So some disclaimers, so I don't profit from any way, shape, or form from any brands, but I am on the Galaxy advisory board, so I sort of, if they do well, I scientifically do well. So I guess I have some, but I don't get any money from them or whatsoever. This is my email info, contact Twitter account, the hashtag for this workshop, and use Galaxy if you want to follow Galaxy. So one sort of word of advice for those of you who want to invent a new software package, don't use a word that is very common, that means something else. Galaxy, if you Google Galaxy, it takes a lot of tweaking to find the actual Galaxy software, because there's a lot of other galaxies out there. And so, yes? Which one? Oh, great. Oh, great? Yes. Oh, yeah, yeah. Yeah, there's a few, yeah, great, yeah, find a great package, yes. Good luck. Anyway, so use Galaxy is the hashtag for, so that's a unique hashtag for the Galaxy software, and people tweet about it. So what we're going to do today, so we're going to talk, so one of the really big forces behind Galaxy is really sort of at the heart of the developers and the PIs in that group is about reproducible science. And the idea is that if you're going to make something, you're going to report on something, a sequence assembly or what have you, if you can report so that somebody else can reproduce it, it's very, very, it's very useful. And there's actually, some PIs at YCR did, they were doing structural variation comparisons, so they downloaded all the structural variation packages, there's a dozen of them or something like that. Half of them they couldn't find. The other half they did find, half of them didn't compile, and so forth. So all of this so you can reference a software package and your bibliography and so forth, but if actually if it doesn't work it's not very useful because it was working when you used it. So Galaxy is one of the things at the core of Galaxy is to make science reproducible. So to allow people, if they describe a pipeline, a process, a workflow in Galaxy and then they publish this workflow, then it should, anybody should be able to reproduce it. So there are different ways of using, of installing Galaxy, there's lots of different Galaxy's out there, so I'll go over that. It's a very sort of intuitive user interface, but you need, if you have never used it, there's a few things that are good to learn about. And putting in, getting things out of Galaxy we're going to do, processing data in Galaxy, Galaxy in next-gen sequencing. So next-gen sequencing is actually in the lifespan of Galaxy, that was sort of a, it's almost an afterthought. It wasn't designed to deal with next-gen, but that's what a lot of people are doing. And so they are now fully embraced, the next-gen things, with a few quirks in the process and we'll go over some of these. This is just a plug for a software package on pipelines that's developed at the OICR, that's open source, that is not, we're not going to talk about this week, but if you're interested in pipeline workflows and things like that, I sort of invite you to have a look at that. Galaxy, like the Seqware I just talked to you about, is an open source project. So it's freely available, there's no fees, whether or not you're a company, you can have it, you can download it and you can contribute to it, you can make the code better and so forth. This, and as you may know from what Michelle just mentioned, it's in the context of open access, open source, open data, these are all the things that we rely on and if we didn't have these things, bioinformatics would sort of crawl and break down and not succeed. And so it's really an important thing to keep in mind. So there's a couple of papers, we talked to, I think I referred one of them, but there's actually been quite a few papers on Galaxy, they're all open access papers. So the first concept I want to tell you about is that the fact that there are multiple flavors of Galaxy, and that creates some problems and opportunities. So there is, the home page for the project is not galaxy.org, because somebody else already has that, so it's galaxyproject.org and so there, if you go to that page, you will see, and there's links to it from the Wiki, you'll see all the various flavors of Galaxy. So one of them, the main one that is the public server that you can go to and do things right away today is usegalaxy.org and it's a University of Pennsylvania and, oh sorry, Penn State, I get all those guys mixed up, Penn State, and it has two or three CPUs behind it, it has a few terabytes of storage, so I think there's a 10, 20, 30 gig limit per user, but they're sort of basically saturating. So it's a public server, you don't even have to log in to use it, you can just go in there and use it. The advantage of logging in is that it will remember you from the last time you were there, and so your files that you uploaded, then you can go see them again and so forth, so that's a good thing. A second way of getting or using Galaxy is to actually get the source code and install it locally. So there, that's getgalaxy.org, so that's the source code for installing it. And I would advise any institution, any university, to have a local instance of Galaxy and to serve its own users. At OICR, we actually have two instances, one that's on a single CPU machine, it's not a big workhorse, but if you just want to look up things and do simple manipulation, that's quite good, then we have another one behind the cluster, so that you can launch jobs, multiple jobs in the cluster. So we have an organization, which is a relatively small one, we have Galaxy. Most universities don't have it, but you'll see a lot of them do, and a lot of them actually make it publicly available. The cloud is a cloud version, so on Amazon there's a version of Galaxy you can use, which is what we're going to use this afternoon. And that one is useful because you can get more CPU if you need, you can get more larger clusters, and if you've got large next-gen sequence projects to do, then you can use it there. But it costs. So we talked, Michael talked about it on the first day. I sort of compared this giving crack to babies, so this week we're giving you crack. You're going to go home, so wow, that was good, and then you're going to go home, and then you have to go pay for it. So I hope that coming down isn't too hard, but it's reality of life. It costs. The other sort of, on the positive side, there is a link here to all the public Galaxy servers, and these are people that have installed Galaxy on their institution, and they have their own package, and they make it open to the world. And some of them have quotas on how many files you can upload, and some of them will be specific for metabolomics, some of them will be proteomics, some of them will be RNA-seq, some of them will be microbiome, and also so you should go look at that page and see if there's one that's already got the toolkit, basically, that you need for the things that you need to do. And that's very nice. So this is a used Galaxy homepage. So it goes over all the things. And what it has also is a number of screencasts. So these are short little videos of screen captures, the same way that we're doing our screen capture here for the workshop, of tutorials using Galaxy and where to go click and push and how to do such and such. The problem with Galaxy is that it can do lots and lots of different things. And we're going to talk about one very small slice, and we're only going to touch upon something today. And if you want to use Galaxy, to go look at the screencast, see the one that's there to do the things that you want to do, and to find the server. And there's always a usedGalaxy.org is always available and will be available for a while to come. There's also lots of user groups. So there's one for developers, but there's also one for users, mailing lists, and very active. They have a couple people. The Galaxy team has a couple people dedicated to helping their users. And so they are very active and very responsive to the user community. And there's lots and lots and lots. They have, if you go to this page here, which is basically a Google search engine with only for their stuff, there you can search Galaxy, and you'll get the right Galaxy. But you can search for screencast on pipelines for RNA-seq, or you can search for workflows to do metagenomic analysis and so forth. And so you will find lots and lots of information there. So this is what the, if you log in to usegalaxy.org, that's the one at your append state. And this is what it looks like. We'll come back to this later. And these are all different screencasts. So if you go click on them, you can scroll these things around, and then you'll have short videos on what to do. And so all Galaxy pages, servers, are all the same. They basically have the tools on the right. They have the, your history, sorry, the tools on the left, your history, yes. Left, right, this is right, yes. On the right is the history, so all the steps of where you are at and things you've done. And in the middle is basically your output, your working space, what things look, which files you generated, what graphs you looked at and so forth. And that's sort of the work area, where you enter things. And so, and this is the home page for to get Galaxy, the software. So there's lots of, like I mentioned, lots of software development activities. And this is a page, it's a whole page on how to use it on the cloud. Unfortunately for you, you just have to listen to me today and you don't have to read all this. And this is the public Galaxy server. So it's just an example of some of the Galaxy servers. And some of them will say, there are quotas. So this one says, if you're registered, it's 110 games. If you're not registered, it's 10 mangs. So advantage for getting registered. So Galaxy really integrates all sorts of different data types into one space. It allows you to do many tools that you don't need to install and maintain. So if they're part of your Galaxy instance, you don't have to maintain them. If you're maintaining Galaxy yourself and your own site, then there is some maintenance. Because if you want to have a new version of a tool, or if you want to add a tool that you don't currently have, then there's a bit of work involved with that. But that's also very well documented. Also, you build workflows in Galaxy, and then you can share them. You can share them with your colleagues. You can share them. You can make them public, and so forth. Or you can share them with just your best friend. And so it's a very useful way of sharing with your collaborators, for example, how you did a certain type of analysis. And you can reuse them, share them, and so forth. Edit them. You can publish them. And like I said, they're fully entered into the next-gen space. Yeah? Like smart phones these days? We can sort of say update all my apps. Yes. They don't have that. They're definitely. So I'll talk a little bit about the way around that right now has been basically using a tool shed. And a tool shed is a little bit akin to an app store, in the sense that you can go to the tool shed, and you can see, I want this tool, that tool, that tool, and install it onto the min of the Galaxy. You can install it and put it. So it's sort of not too painful a way of doing it. It doesn't, there's a lot of engineering and so forth involved, and so that the sort of update automatically part that is not quite working. I think it's on their to-do list probably, but there are many, many other things in front of that item on their to-do list. Yeah. So again, reproducibility is a really good thing. So keeping the history of which steps you did is what Galaxy is really good about. And not only which steps, but which parameters of which steps. And so if you use tool X, I use this parameter, that parameter, this file, and so forth. And the output file is named this. And here are some comments, and some notes, and so forth. And you can obviously share a lot also. Yes, so that's a good question. So the idea is that it's, yeah. So you would, would you, there are both servers that have bow tie one and two, for example. And so there, so you can have multiple, you can have multiple versions of the same tool. You just have to explain to your user community that's using that page or that, that, that, that version of Galaxy, what all the various versions. So people have never used bow tie. They say, well, I'll take number two, because I guess it's got a bigger number, you know. And they don't necessarily know what they get or lose by using one version or the other. But there's actually place in Galaxy for documentation also. So there, you can explain why, why certain tools, what certain tools do. And we'll look at that. And it's really, so Galaxy was really meant for biologists. It's not, there's a lot of software developers that develop for it, but the user community is not, the experts like all of you are really, have become with the command line. So it's for people to be able to go use tools at the web interface, a user interface, a web user interface. And so there'll be, there are things that I find a little painful in Galaxy, which if I had a command line in the file, I could do like, you know, cutting a column and sorting it and so forth. These are all ways you have to go. You can do that in Galaxy by pointing and clicking. But if you know how to do it at the Unix command line, like all of you do now, it's a lot faster. And so, with that said, there's a lot of biologists out there that really enjoy the Galaxy user interface. But can you do a non-territorial analysis into a Galaxy about all of this? Yeah, well, I'll talk. This is not the RNA-seq workshop, but I actually have an RNA-seq example at the end. Yes, yes, yes. So I actually just got recruited, quote, unquote recruited, to go give a Galaxy workshop at an RNA-seq workshop in the fall in Europe. So it's a five-day RNA-seq workshop. And we're doing like half a day on Galaxy. So yes, it can be done. And so funding for Galaxy should probably point out that most of the big funder now is NIH. It's developed at two universities, so Emery University and Penn State. And there's a Wiki page as well. And there's lots of user information on the Wiki. The Wiki's greatly improved in the last couple of years. So one problem with having all these various versions of Galaxy is that it's quite easy to go to a Galaxy. The main one has the most things, so the one that use Galaxy.org. But that one's limited a little bit, because if you get a lot of people on there, and there's only so much CPU and so forth, and it's actually getting close to being fully saturated, because it is really popular. But then if you go to another one, then you get different tools, because each one is different. And updating the tools is relatively simple, but it's not that easy. And one of those things, for example, the one on the cloud is not as up-to-date as the one at usegalaxy.org. So we're going to use the one on the cloud, because it's the best one. If we've done this work actually, it was last year or two years ago, the Amazon wasn't working, and we got stuck. And we ended up going to the USgalaxy.org version. And it worked. It was a little slow, but it did work. But 30 people at once, or 40 people at once, are jumping in to the U-Pen version with slow things down for them a little bit. So that's a challenge. And so one solution that they're working on right now, and they're really sort of supporting, is to do the tool shed. And the tool shed means that what they're shipping now is basically an empty box. And then you go to the tool shed, and you get the tools that you need to do the kinds of analysis you want. And so if you're interested in phylogenetics, there's like three tools there. You're interested in forming protein, getting, and partying analysis, and 118 tools. So you can get to those pages, and then install on your version of this. For example, these are all the SAM categories. And so all the very SAM, so bed tools here, and so forth. So that's their SAM sort of page on the tool shed. So if you want to get those tools, then this is where you come in and get it. So, oops. Oh, yeah. So a general workflow for galaxies, and you log in. So it's really important to log in, and I'll explain that later again. You get data, or you upload data. You manipulate your data. You can do this over and over again. Then you save your output to a file. Then you save this into a workflow. And you can publish this if you want, or you can share with your colleagues, or you can save it for yourself to have. So all these things are the way things work in Galaxy. So what we're going to do today is go use the Galaxy version, so to use galaxy.org slash cloud. And then you have this information there. And what we're going to do is we're going to start that right now. So the idea is that it's 2 o'clock right now. So we've been at only 30 minutes. I hope that in the next hour, we can get everybody to log in. So if it's, goodbye, Michael. It's good seeing you again. Yeah, thanks a lot. But it was a pleasure meeting all of you. Thank you very much. OK, goodbye. Thank you so much. Yeah, you're welcome. Take care. So if you have your Wiki page up, if you have a browser in which you can sort of click, so you have on the Wiki, you have this same URL here, the second. So this is day five, sorry, day two, lab five, or module five. And so this cwnextgen.signin.tw.amazon.com console. So you click on that. And there's another file here, Amazon Credential. OK, so I mentioned this at the beginning when I mentioned it again. The way we're doing this, I have a file here that has everybody's name, everybody's password, and everything. Don't do this at home, right? So keep your password as if it was physical. All of these are attached to my credit card number, actually. I kid you not. So this is why we're going to shut it down at the end of the workshop. But this is where also the same account that receives a grant from Amazon, so we're OK. So don't worry too much. Anyway, so if you have the Amazon console should look like this, your username and password is the one that's on this file, the Excel file. So you have to open this browser with Amazon. And then this, and you look for your name. And you'll have your login name will be this third column here, and your password will be this last column here. Everybody's doing this now. And if you click in a work, you should get this page. Does everybody have this page? Don't go any further yet. Don't reach forward, and we'll all do it together. Yeah. So Amazon is not the only cloud provider out there. But I would say it's the best one in the sense that it's the best one that has the best tools, the best services, basically. And it's really why we like to use it. Yes. Yes. So actually, that's a very good question. And Canadians should be worried about that. And we're worried about the Patriot Act and so forth, because these things are in the US. So they are under US governance. Amazon, partly for that reason, but also for other ones, they also have sites in Ireland, Singapore, Australia, Hong Kong, and so all over the world, and not in Canada. And so I think, though, that Amazon is more secure than most universities. It's more secure than maybe even most, a lot of companies. There is you can encrypt, you can ship transfer encrypted data. You can have multiple sort of fobs and IDs and so forth, so that multiple points of security to get access to your data and to your services. And nobody else over there can have access to it unless they have the same credentials. So it's quite secure. That said, before we don't upload any human data at OICR to Amazon for sort of the things you're thinking about, the concerns you have, we would like to. And we'd like to change the policies that would allow us to do that, but currently we cannot. And so even a private cloud, so there are such things as private clouds, so the same type of infrastructure that can be accessed worldwide, we're working on a project right now to install lots of international cancer genome consortium data into a private cloud in Chicago. And it hasn't happened yet. We have to get the ethical review boards to approve it and so forth. There's all sorts of challenges. OK, so everybody has this page? OK, good. So click on EC2, and then you should get this page. It may have different numbers, but it should look like this. OK, everybody there? No? OK. EC2. Is there something like three running? Should be three running. OK, good. We're going to hit on the 50. Yeah, we only have 60 total. So if we get above 50, there's only 40 students in the class. If we get above 50, somebody's clicked twice. And there's a few instructors, so there should be below 50. But yesterday we were at 50. So maybe I had two or three running. But right now I only have one running. OK, are we all there? So click on launch instance. So we're going to launch an instance of Amazon that has, so it's called an Amazon web instance. And no, sorry, it's an AMI, Amazon machine image, sorry, which is an instance of what we have. So we get this screen. Everybody's got this one? OK, and then so you click. Continue. I forgot to put a circle there. And then you should get this page. OK, everybody there? Don't want to lose anybody. So community AMI, so Amazon machine image. So this part, it will be slow. It was slow when I was doing it by myself. And so there's several things you can do here. So one is to type Galaxy there, OK, a lower case. And basically what it's going to do is it's going to look at all the images that are at Amazon that are publicly available from the community. And it's going to look at a string Galaxy. And then once you get that, it will show you the list of Amazon machine images that have this string Galaxy in them. Please note that this is also how you would search for a CDW. Yeah, we're going to do that later. Let's not confuse that right now. It doesn't seem to be working. So it's not working? So what do you get? You do not even need an error. So you do not have anything to do with it. OK, so are you in the community AMI? Yes, yes. They're public images, so they should be. OK, so how do you do that? Can I make a suggestion? I think maybe too many of us were trying to log in. Because the first time I tried is an error, and now it's actually loading. So we'll see what happens. Yeah. Do you want to get an error first? Yeah, there's a JSON error. That's definitely, yeah. So go back or go back to the screen, type Galaxy, and just hit Enter once. So how many got this page? How many of you did get this page? So if you get to this page, and then what you have to do is you have to click on the instance type, OK? And as you can see, there are many different kinds of instances that come at many different prices. As you can imagine, the ones down here are more expensive than the ones up there. Michael had a price chart on his slide. You can look at that. And the one we want is the extra large. It's sort of like supersided me. So it's got a processor, a sort of four cores, and 15 games of memory. So we're going to take that one. So you select that one on M1, yes. It's an M1 extra large. And then now you're going to go over to the wiki. And on the wiki, there's this information here. So you copy and paste this. And you're going to stick that. Or did I put that slide in the wrong place? No. Yes, I did. OK, keep this in mind for now. Galaxy. So the Galaxy is the, oh, I think I forgot to. I forgot an image. Yes, so once, have you copied and pasted this? No. OK, so go on the wiki. It's on the wiki, on module five. So copy and paste these three lines. Just copy, you know, to your. To the clipboard. To the clipboard. Thank you, the word I was looking for. Thank you, sorry. Yeah, yeah. Copy them to the clipboard. And I'm missing an image here. But there's an image. It's a black copy. It's the one before this, right? Replace the class name with the other. Yeah, yeah, yeah, but I'm going to get to the right screen. I don't have the right screen. I'm missing. So this is one after this. I think this is the one where you copy and paste this, right? Is it box? Yes, OK. So you use your data box, yeah? You copy this, and then you change. You put your initials underscore your number. And this is important so that as I've been in the back and you see two of these, you'll kill one. So if your initials isn't one initial, underscore another, or it's both two. It doesn't matter. It's a string. It's a string that's going to be unique to you in the lab in this class. Is it your CVW number? Yes. Or your number that you can make cell sheets? Your CVW number that you've been using so far, yes. I'm sorry. I'm so scared of getting kicked out of the system. Yeah, yeah. So, but. I just had the previous screen. And we had to go in the extra large. And then continue. Oh, OK. So we don't select preference. No, no, yeah, everything else is default. I was like, oh, did I do something? Yeah, yeah, yeah. And then here, copy and paste. OK, so the important thing about the copy and paste into the user box, and I forgot. I'm sorry, I forgot there. There's a few. There might be a few leading spaces here. Make sure you delete them so that this is line, you know, space one in that box. You don't want any leading spaces here. And space here is OK. But then this is exactly as it is, of course. Your company that you're using is your dad? Yes. But make sure there's no space at the end here, also. So just put your cursor there, and just don't delete any characters, obviously, but delete any spaces at the beginning and at the end of the line. Why number events are small to the number with the music? Well, anyway, how are you going to be lost? Yes, indeed. And then after you copy this, then you hit continue. And then you should get to this page, I think. I hope. And then by default, it will be, I think, it's quick start. So you select Galaxy. And what that does is basically it has a lot of parameters on keeping the web portal open. So it's all sort of security related issues. And so we're just going to take who can talk to who and so forth. And so we just take the Galaxy. Galaxy has already figured this out for us. So we just select Galaxy and then move forward. And then you get to this page. So you just do continue. And this is the important one. So this is a summary. If you don't have, if it says micro or something else at M1 large, that means you picked the wrong instance. So go back and fix that. Fixed soil. Extra luck. Yes, that's an old slide. Very good. Good catch. Everybody's following me. Very good. Extra large. And so, and then you hit launch. And if you hit launch, you should get to a screen. You will not, it's not the screen either. But it's a screen that will have, the third item is what on that page? Select copy. And then you go to another tab on your browser. And you should get something that looks like this. The public DNS, yes. Or it's there at the bottom. Or if you just click right below the table, there's a list of, there's this EC2 instance right below the list of the table here. There should be the URL right there. But if you scroll down, it's down there as well next to the public DNS, yeah. That's the same number. So you need to have this running to have that happening. This has to be green. Basically the instance has to be alive. And it may take a few minutes for that number to be registered with the DNS at Amazon so that the rest of the world can see that number. But if it doesn't work, just wait and then do it again. And then highlight, copy and paste it to a fresh tab onto your browser. You want to keep this one up as well. You want this tab up. And you want the other one that looks like this. It looks like we're there, but we're not quite there yet. And you get this screen. We're going to give ourselves 10 gigabytes of space. Enter 10 over here. OK, 10. Then you hit 10. And then you choose Cloudman. And then here, 10, yeah. So when you get here, this should be green. This should be yellow and gray. You should have one green box that's sort of churning about and starting to wrap up the engine. So we're only going to use one core. But if we were ambitious, we could launch up to 20. But we'd be spending 20 times as much. And we're not going to do that. But that's just so you know that if you want to launch more cores and you want to do larger projects, this is where you would do it from. And then after a while, these two things have become green. And then this button, which was grayed out, is going to become dark. Once this is grayed as dark, you can click on it. And then once you click on it, you should see this. And now we're there. And now after that, we go on a coffee break. Actually, I lie. There's actually a few other slides. But actually, you don't have to do anything. You just have to listen to me. So what we've done, so Michelle sort of alluded to it, what we've done is we actually made the AMI for the workshop. We're making it publicly available on Amazon. So if you go to Amazon and you want the workshop that you've been working on, the command line space that you've been in the console you've been working on, which has all these tools pre-installed on it, you can go find it. So you have the name of the AMI. It's a public AMI. Anybody in the world can use it. And it's got that number on it. It's a CVW. So you can search for CVW and you'll find it. It has all of these tools installed. So that's available for you if you need it. So we're paying to keep that online. If you go use it, it will be your dime, just so you know. It's not free for you to use. It's going to be an Amazon instance. But it's relatively cheap. So for that, you have to sign up for your own AWS account. And you have to go, same thing, EC2, launch instance, go classic wizard. Then here you use CVW as a search term. Then you go find the one which is just talking about. And then same thing as we did in the class. So the thing that Amazon has, it actually has the 1,000 genome data. It has GenBank. It has reference human HG19, HG18, things like that. It has ensemble data. It has unigine data. So it has a few data sets from NCBI, a few data sets from EBI that are already there. So you don't have to upload them and so forth. So this URL here at the bottom has all the public data sets that are available. So you can go look at that and see what's there. And it tells you where they are exactly, and so forth. So we're actually in the lab after coffee break. We're actually going to use one very similar to what Michael was using from the 1,000 genome set. We're going to use a file from there as well. And now that's it. So if you get to this page, you've been successful. And you're allowed to go for coffee break. So in the first part of the lecture, I was missing some images and things like that. So I will update the PowerPoint and put it back on the Wiki, update it with the missing images so that when you go, you can go look at them and you download them and so you'll have the full day. I never stopped it. So it's been running throughout. Leave well enough alone is my motto. So same slide, same intro. So at this point, you should all have this. And what we're going to do now is we're not going to touch this. No, it's running. It's humming. It's not going to disappear. I hope it should. And so what I'm going to do is I'm going to go through some lecture on the mechanics and how files you can get things onto Galaxy. I will do everything so I don't want you guys to do it. And then we're going to run it again. And we're going to load a FastQ file. We're going to do some QC on it. We're going to do some mapping on it. And we'll talk about some limitations with this current image. But you'll get the gist of it. And that will be about how much time we have for the rest of the day. There are, as I mentioned earlier on, there are lots and lots of things on Amazon. There's lots of ways of configuring things. There are a lot of public servers. There's the one at U-Pen at usegalaxy.org. But there are lots of other ones. And definitely, depending on which community you're in and so forth, have a look at those. And it's lots of documentation, lots of user help. It's a well-funded group. The whole team at Galaxy is about 10, 12 people. There's two people whose full-time job is help desk. Basically, it's help the user community. The man and the woman, the mailing list, the discussion groups, and so forth. So they're a very active, very supportive community. And of course, the community itself, like whenever they can, they will help each other as well. So if you're interested in using Galaxy, I encourage you to be part of those mailing lists and so forth. So we're going to do things, like I mentioned. So this part, actually, is a mistake. I will fix that in the end. I wanted you to do something right now. The thing is, you can't do this right now because you don't have your instance anymore. I was going to have you copy a file to your space on your shell, and then from Amazon, we'd go pick it up from there. That's how you would do it normally. But what we're going to do later on in the lab is that we're going to copy it straight from Amazon to Galaxy. So we're not going to go through this middle step of you holding onto the file. I mean, you could copy the file to your home directory if you want, but what we're going to do is we're going to copy it straight from Amazon where it lives right now to Galaxy itself. So some of the things that you do in Galaxy, and we're not going to do now, but once we start in the lab, we'll do that, is to log in. And the main reason for logging in is that it remembers who you are, and then that way you can share files, you can make them probably available to other users, and so forth. So that's once, if you've never logged in, then this is true for the cloud version, the used Galaxy version. Any instance of Galaxy, you always have this prompt here on the right-hand side, log in and register. So if you've never logged in before, you register. And once you've registered and logged in, this is what it looks like, you have all sorts of other options. So as I mentioned, on the left panel, this is true for every Galaxy instance. There's tools on the left-hand side. There's the history, and in the middle is your working part. So here, what I did is I looked at some of the tools that were available on, this is the tools, give you just an idea of the flavor, what's available in the cloud instance right now. So you have all these, and these are the header of sections that have more tools below them. So for example, operator and genomic intervals, so you can reverse complement, you can copy, delete, get parts, and so forth. All sorts of tricks are in that section. Extract features, so get all the coding sequence out of a piece of DNA. Filter and sort, so you can filter for chromosome 22, and you want to get rid of everything else and so forth. Liftover, so when the genome changes version, you want to do a liftover from one version to another. And so all of these things are in there. So I did at the top level, so at the below level it would probably be bigger, but if I had the top level, I did a diff, this is a Unix command trick, of the files of the headers that are available on the Galaxy cloud versus the used Galaxy cloud. So the used Galaxy cloud has a lot more tools. And even GATK actually has more GATK tools, what's available on the Galaxy cloud is actually quite limiting. There is bed tool, SNAP, SNPF, and so forth. All the sort of things we've talked about, this is what's different. All the everything else is shared by both of them. I think that was in the previous slide. And this is like get data. If you click on get data, then this expands. This is just a part of get data. So you can get data from lots of different places that offer data. So the main one that's the most commonly used is the first one where you just upload your own file. But a very, very common one would be to go to the USC Genome Browser, which has lots of data in table format. So you can get a GFF file formatted, a list of all the genes and all the features and all where their coordinates are in chromosomes and things like that. You can get that from USC. And we'll do a bit of that later. But there's a bunch of other servers where you can get data for. So all these places basically talk to Galaxy, whether it's on the cloud or at the University of Pennsylvania. So all of them also at the top of the tool is the search bar. And that's actually the list is so long. That's usually what I do. So I just type a string matching thing. So you can type SAM. And you'll get sample. You get SAM tools and so forth. So you can just type a short string that you know like mapper or map. And then you'll get all the mapper type information. So that's really the fast way of going through this long list and without having to go click, open up sections to see is it in there or not in there? Oh yeah, I remember where it was and so forth. What I do is just type the name. And so if you type SAM, then you get all these things. But if you type SAM on the cloud version versus SAM on the useGalaxy.org, this is quite a bit of difference. And so there's different tool sets that are present at different sites. So it's just something to keep in mind. But we're going to do SAM to BAM, for example, which is a SAM tool. And we're going to do, I think that's the only SAM one. We're going to do some other things. So I'm also going to talk a little bit about the UCSC genome browser. Who's used the UCSC genome browser here? A lot of people. So if you work on e-cariotic genomes that have been sequenced, it's probably on this site. They do a lot of evolutionary comparison at the high level, so they'll compare fugu against chicken, against zebrafish, against all the vertebrates that have been sequenced, and so forth. There's lots of information. There are a lot of tracks at UCSC. And it has both the graphical and table view. And this table view of the data is what Galaxy uses. And so it actually knows to go get that kind of information and then incorporates it. And you can also in Galaxy, sorry, in UCSC genome browser, you can actually load your own track and show them to you. So you can see them or compare them to what your colleagues can see them, make them public or make them share a URL where they exist and so forth. And it's a client server sort of architecture. And so this is a UCSC genome home page. And basically, if you hit Genome Browser at the top, there's lots of information about various data sets and so forth. A problem I think in general with all genome browsers, be it Savant, is that we're dealing with very complex data and a sort of linear one sequence dimension. And so there are different tricks and different groups of views to show that information. But UCSC is probably, in a way, the worst culprit in being stuck in this sort of linear mode and just adding tracks. And it just becomes a little overwhelming, I'd say. And that's their data model, that's the way they operate. But it's just things like Savant and things like even the cytoscape or sort of bring things in a different dimension is really important. And how do you represent protein-protein interaction? How do you represent clinical information? So IGV does a bit of that too. So there have been ways and groups that have tried to get out of that sort of restrainment that the 2D does, but it's definitely a challenge. So if you hit the Genome Browser, this is what you get, and you can enter right away, you can enter a gene name, a specific position, and so forth, and then you hit Submit. And for example, if I put in K-RAS in this example, it's the string matching on K-RAS and it shows me all the places in its database where K-RAS has showed up. And so there'll be RefSeq references, there'll be different alternate splice variants of K-RAS, and so forth. So you can sort of click on one of them. And I think I hit the first one, so it gives me the coordinate system, where it is on the chromosome, which chromosome it's on, and usually by default, it shows alternate RNA splice variants. And then also by default, it shows you similarity to other organisms. And so you'll see that there's very strong similarity to mouse, dog, and Xenopus, and so forth. So there's Xenopus, there's one gene here that seems similar to Xenopus, so where obviously conservation is present, you'll be able to see it in that track. What it also has is if you scroll down, it has basically every track that it actually holds. And so it has ways of representing compressed and so forth expanded, and so you can customize and make, you just want to see alternate splice variants, and you can shut off all the other tracks. Other examples of data types that are available from UCSC, tabs have rated, in general, fast A sequences, which are just greater than signed with an identifier and then a nucleotide or protein sequence. There's bed format, there is a GFF format, which is usually often used for gene features and gene transfer format. So this is fast A format, pretty straightforward. This is bed format, so it's chromosome start and some description and some values. And they can sort of annotate and do some various things. The GFF format will have sequence name, the source features, start and score, strand, and so forth. GTF, sorry, GTF, yes, it's like GFF, but it's more specific to Exxon and coding sequences, and it has a couple of extra fields for gene names and transcript ID. So again, this part I'm going to do only on the screen, so you don't need to do this. It's very straightforward, just to show you the mechanics on how Galaxy works, and then we'll do an example together. So I'm going to do an example, which is basically, we could do it on useGalaxy, because it actually doesn't cost too much from CPU and file size. But we're going to go get the 50 base pairs around SNPs, all the SNPs that are on the Bracka 1G. That's the operation we're going to do. We're going to do that in Galaxy. So I go to the UCSC, I'm going to go upload, go to UCSC main server and get a table browser. I'm going to select variation and repeats, the track, the version of the track. I'm going to look up Bracka 1, and I'm going to hit look up, and it's going to give me all the various Bracka 1s. I'm going to pick, so I get the position by doing a look up. And then I have output into bed format, and I'm going to send output to Galaxy. So by default, it came in through Galaxy, so it knows it's going back to Galaxy afterwards when it generates data. So here I'm on the left-hand side, I'm in Galaxy, and then I'm going to give it a name if I want. And I want the whole gene to the whole Bracka 1 gene, so I'm just getting one gene. Then I click send query to Galaxy. So it goes to Galaxy. So Galaxy is now thinking about it, so it's got a yellow box and it's a little thing turning away. And it turns green, so you've done well. It's figured it out. So I have the file name here. If I click on that, it expands and shows me a short snippet of what it is I retrieved. If I click on the I or the expression I like to use, if I poke it in the I, it shows me into the middle panel, and it shows me what it retrieved. So it's got the chromosome number, start position, end position, the name of the SNP, and no strand information. Sorry, a strand on the last column and a zero doesn't mean anything in this case. On this small window, you also have the information with the columns R, so it actually has the names and columns. So you poke it in the I, you can see that. You can also, if you hit the pencil on the right-hand side, right here, so that's poking me in the I. I did that before. Now, if I hit the pencil, I get this page in the middle, and that allows me to actually rename it to a sort of better name than whatever the machine guessed it was. So this is I can give it whatever name I like. And I can also, and I strongly encourage you to do that, I can also put annotation notes and so forth, so that you can remember later on when you look at it. And this is the also important, very important, which build it was done on, HD19. As some of you that were here for a previous lecture know that HD20 is coming out in a few months, like a month or two. And so all the coordinates, whatever it is you did on chromosome 19, version 19 is not going to get shifted around a little bit. So you save that, and then it's got the new name on the right, and so we're going to do that. And now we're going to, also while we're at it, we're going to rename the history. We can sort of delete things if we want, and they're gone. But then we can share and publish and so forth. That's all available from that panel on the right. So now we're going to operate on genomic intervals, and we're going to get the flanking sequence. So the SNP is a single nucleotide position, and we want to get 50 nucleotides off of that. So we're going to go hit this one, and then we get this panel showing up in the middle. And so that's the name of the file. So it knows which this one has got the right format. So it guesses that that's the right one. If you have more files in your directory, all the files that have the right format will be available there from a radial box. And then all you need to do is execute. And so one way to get the flank is to actually type flank, and then you get all the, that's a quick way to looking through all the various tools and you execute. Then again, so now we have, now instead of being one interval, one nucleotide interval, it goes for, let's say, 422 to 472, so you got 50 nucleotides. So now we have two, we used to have one nucleotide coordinate, now we have 50 nucleotide, because we told it to get 50 nucleotides. And so you can look at it again, poke it in the eye, you see the thing on the bottom, and so forth. We're going to work on this workflow. So the workflow, you can extract a workflow from what you've done, and it basically was get flank, extract genomic intervals, and so forth. So what you can do now is you can save this workflow and say, OK, get me the flank for a different gene. So this was BRCA1, get it for BRCA2, get it for KRAS, get it for clock, get it for whatever gene is available. And so you just run the workflow again, and you don't have to go through and clicking all the various parts again. So we named the thing get flank, and so now we're going to get the KRAS snips, and it just runs all at once again. So now we're going to start using it our instance. So go to your Galaxy instance, and don't type this in. Instead, on the Lab Wiki, there's a Lab 5 tab. You click on that, and let me bring mine up. So Module 5, there's this Module 5 Lab. So there's things I told you to do here, and that's not what we need to do. But you go down because you don't have this instance anymore, so you can't go there. But this URL here, not the W get part. So this command, if you run this command from your shell on Amazon, or any UNIX prompt that has the W get command, it will actually go over the web and get this file that you sort of told it. But now what we're going to do is just I want you to copy. So select, cancel that. Don't click on it. Sorry. I want to sit down. Yeah, so you want you to actually, one way to do it is you control click, and you copy link address. That's the one way of doing it. So you do that, and then you go back to your Amazon. OK. So here's something we can just so you know. So let me make this full screen. So on Amazon, if you click on this little bar here, that disappears. Click on this one. That disappears. So I want this one here on the left. So I'm going to click back on it. And now I'm going to get data. And I'm going to upload a file. Then here I'm going to paste Apple V. So I'm going to paste that URL that I just copied. Don't hit Enter yet, because there's something else you need to do. So I'm going to go back to the PowerPoint. So the other thing you need to do is up there. Oh, yeah, because it's at the wrong place. So for file format, let me try to see where I have that. Oh, yeah. So file format, you get to select FastQ Sanyo. So this is an old, there's different FastQ formats. I wish Michael was still here. You could explain some more. But basically, so you typed the URL I just told you, which is the S3 Amazon URL. That's what you don't type this one, but the one that I told you. And what we're doing is we're getting data and we're uploading a file. Yeah, FastQ C, yes. FastQ C, FastQ. FastQ Sanyo, yes. Not FastQ C. FastQ CS Sanyo. No, no, no. That's not correct. It's incorrect. OK, yes, I typed it. OK, anyway, it's a pop-up menu. Yeah, FastQ Sanyo. Yeah, you select from the FastQ Sanyo. What I wrote on the right now, what I wrote on the box. What I wrote myself, not what it says there. I'll fix that later. Thanks for catching that, Sabin. OK, so this is going to take a bit of time, I think. And then you hit Execute, and then basically, Galaxy Magic happens. At first, you'll get a gray box. It's really sort of going mm-hmm, mm-hmm, mm-hmm. And then it's waiting. And then it's going to turn yellow. And this little thing is turning about, and it's actually doing its thing. And then when you're done, it's going to be green. Beautiful green. And then there's the evil red, which you don't want to see. But that happens sometimes, too. Red is not good. Yeah, that's what this, yes, the thing about the red box. You don't want to see the red box. Anyway, so if you do that, so it won't say CBW, you'll say S3, Amazon, blah, blah, blah. So everybody have a yellow box right now? Yes. Green, you have a green box. Even better. OK, we're doing good. Who has a gray box? That's good. No gray boxes. Who has a red box? No red boxes. Yellow, who's yellow? And green. OK. So Galaxy is really keen about helping you keep track of things, adding tags, adding hints, so you remember when you'd come and look at this in a month from now, what it is you did, and so forth. And so in addition to the actual which tool and which data, and so forth, you can add names to your history, you can add names to your workflows, and so on. All these things are possible. And so one thing you can do actually while you're waiting is you can hit, says unnamed history, so you can click on it to rename the history. And so you can call it whatever you like. I think, did I have? And I forgot to name it on some of the slides there. But basically you can name it Mapping Chromosome 22 Reads. So this file that we're downloading is a million Reads, 75 base pair reads from Human Chromosome 22, which is the smallest chromosome. And green, we like green. Green is a very calm and color. So if you click on the box once it's green, then you'll see some what it looks like, because you know it well now. You look like some fast Q files. So basically, and as the name indicates, it's got that. So you can save the file. So this is a Nikon to floppy disk. Some of you may not know what a floppy disk is, but that used to be a device onto which we used to write files on. Anyway, that's used now as a Nikon for saving. Although nobody uses disks anymore. You can have more information, or you can rerun it also. And you can add tags and so forth. So one thing you can do right now is that you can poke it in the eye. If you poke it in the eye, once it's green, then you'll see the whole file. And it's usually in next-gen sequence data, because the files are so large. This one is a million reads. It only shows you the first, the top part. So from poking it in the eye, you see this. If you use the pencil, then this edit window where you can add notes. And something I do sometimes is I put in annotation notes. I put the old name that was machine generated. I stick it here so I remember what it used to be called. And in this case, I called it exome underscore chromosome 22 dot fast Q. But you can call it whatever you like. Having an extension that represents the file name is usually useful. Galaxy doesn't use that, but you as a user may find that very useful. I find it very useful. Anybody still in yellow? A few yellows? Yeah? OK. So you don't keep the first part of the name? It's up to, I mean, it doesn't matter. It's all whatever is convenient. If you like the long name, keep the long name. But you're going to see Galaxy is going to make up names pretty soon. So if it makes up names that reflect parts that are useful, it starts being very long strings because it just adds to the name space, basically. And then when you finish with the name changing, you click Save. And now I have, that was the history name, which actually we're not going to be able to do because of some other problem. But I'll tell you about that. But now that's the new file name that it has. And we're 210 megs, or it's thereabouts. OK, so who is still in yellow? One, two, three. OK. OK, you can look next to you. So the first thing we're going to do is do quality control to see how good this FASTQ file is. We got it from a thousand genome project. It's probably pretty good shape. But the one coming out of your lab may not be so nice. So it's a good thing to do some QC. And we do that on all our files. Everything, we do some QC of the BAM files and of the FASTQ. So where do I find the FASTQC tool? I just type FASTQC, or just FAST, or whatever. And then I get it matches that string. And so then that's the one I want to click on here, FASTQC, VQC, report using FASTQC. So I'll click on that. And I basically, by default, it already knows which my file name is, and I just hit Execute. And then when I get, this one's pretty fast. It goes, I now have a second in my history panel. That was my first file. And now I have my second one, which is basically the FASTQC of that file named FASTQ.HTML. So the FASTQC program output is actually an HTML file. And then you click on that. You poke it in the eye. And then you see this. You scroll down. And you get this graph. And so you lost? Did I hear somebody say they were lost? OK. You get this file, right? You have this. Then you go type FASTQC in the tool query. And you look for this program here. It's under NGSQC and manipulation FASTQC file. You click on that. You get this. Execute. And this box turns gray, yellow, and green very fast. So I don't think I have time to go catch the green part or the yellow part. And so you click on that. Or sorry, you poke it in the eye. You get this. And then you can click on every one of these. And I only have one image from the QC, which is basically quality score on the left-hand side here. So the FRED score that Michael talked about, up to 40, aimed at the base position. So 1, 2, 3, up to 75 base pairs. That matches 76. So we now have 76 base pairs. And that worked. So all of this we're doing right now, it's working well in the cloud. This works well. Apart from loading the file to U-Pen, it would have been a bit slower. There's a lot of things. So one thing about this FASTQ file, it's actually an older FASTQ file than some of the newer programs. It's something to do with the bit score and the range of numbers. And so in Galaxy, there's a program called Groomer, which allows you, and there's explanation of what Groomer does on if you type Groom, for example, then FASTQ Groomer, which is the program we're going to use. So you go type Groom here to find all the programs. Click on this one. You get this page show up. It knows the FASTQ file. It's in Sanger. So you've got to change that to sort of FASTQ quality score type to Sanger. That's correct. And then you run the execute. Here you have instructions or documentation on that. Click that. This one actually takes a while. We even have time to go get a glass of water. It takes a few minutes. So once you get this one started, you can ask questions so far. Talk to your neighbors. Yes. Yes, the tool shed. So I'm not sure. The question is, the version we have at OICR doesn't have everything. When you go download everything, does it come with that? They're actually migrating to shipping an empty shell, basically. But I think we did it, what, two years ago now? And it had pretty much most of these tools in it, right? You have to get a few. How many tools did you have to get? It's a big install challenge, yeah. And we also have UCSC Genome Browser, that it talks to. We also have a local version of that. So that's a bit of a headache to it. And we also have an ensemble. Yes. So it makes it easier to tell you, it's the source code to get all the things installed on your platform. Because some of it you have to compile on your machine. So it works with the libraries and so forth that you have. So you need to assist them in min type person to help you put it together, yeah. It's not quite an app store yet. But it's well documented. And like I mentioned before, they realize that getting a user to install their own version is a big plus for them. Because they're not using the U-Pen one, which is getting saturated and so forth. Yeah, and from a privacy issue point of view, that's why we did it, yeah. So while you're doing this, one thing to go do is to go look at the Galaxy Project page, galaxyproject.org. And you can go subscribe to the mailing list. You can go see, so actually search here. So this is a Google search engine, but it only looks at the Galaxy stuff. So if you're interested in, let's say, RNA-seq, it's 240 hits, published workflow. So it was a question, but he's gone about RNA-seq workflows. Step one, you do FASQ groomer, top hat, map with bow tie, map with BWA for Illumina, flax.cufflinks, flax. And so forth. So you can import, you can save this workflow and use it on your own data, or so forth. So that's an example of things available. Different one, different person. It's good if some of these actually are cited in papers. And so all sorts of notes and so forth. So yesterday was what? Like BWA, so there's a user question about BWA back here. Oops. I just deleted the question. Let me start over. Go to use Galaxy in a Publish control. Oh, yeah. Go to share data, and you can search BWA. Yeah. So this is here, BWA, demo mapping application, duplicate. So AUN1, that's Anton. He's one of the developers, the PIs on the Galaxy projects. And so this is all the published workflow. And with user rating and so forth. And you can search for specific ones that have the tool of interest. Somebody yesterday mentioned ChipSeq. So we actually don't do any ChipSeq in any of the workshops, but obviously, Galaxy does ChipSeq. So there are work flows that are available there. RNASeq, RNA-like, so forth. Lots of information. So where did I get this? I went to share data, publish workflows. These are publicly available workflows. Publish history. Some people have their history, their civilizations, and pages. Pages is a way of putting together a workflow, the data, the analysis, the comments, and into a page, basically. It's like a publishable unit, almost. Although it's not that widely used yet. Right now, this page is on usegalaxy.org. So it's in the University of Pennsylvania version. From there, you can launch a cloud cluster. And actually, if you have your Amazon account, it'll take you pretty straight up into the way to the last step, almost. So there's a lot of steps that it skips. But you don't have the pleasure of going through the agony of selecting all the variables and the type of instances, and so forth. So who has a green box at this point? OK, so I'd say it's majority rules here. So you poke it in the eye, and so you see what kind of file is this? FastQ file. So this is a groomed FastQ file. So we're actually going to use Bowtie, which is another mapper, so you used BWA yesterday. So we're going to use Bowtie. This is actually Bowtie 1. It's the old Bowtie. And how did I? So if you type map on the search tool on the left-hand side here, type map, and it has a few mapping tools on the cloud. On the University of Utah, usegalaxy has more tools. But here it doesn't have, for example, BWA, which you guys used yesterday. So Bowtie is a different mapper, which does work. And so the thing you have to change is the HD19 here. So we're going to map against HD19. And the FastQ file we're going to use is the groomed FastQ file from frozen22. This is the name I gave it. So I renamed that file groomedFastQ. And used that thing as default to hit execute. And this one's pretty fast. And then poke it in the eye. And I have some header file from SAM. So this is a SAM file now, right? And then I have all the things that we talked about yesterday. And then it actually tells you which chromosome we mapped on. So this is all the reason from frozen22. But you'll see there's quite a few, sort of, frozen14 and other places that seem to map there as well. What is that? Why is it that CERN reads mapped in multiple places? Mostly repeats, mostly duplicate genes. Mostly there's all sorts of reasons. It could be the Aligner as well. It says the software developer there. Yeah, the Aligner. So that's why there is ongoing benchmarking with Aligners, SNP colors, and which Aligner with which SNP color you get different results. And then you have to do validation at the other end. Validation means resequencing, amplifying the DNA, and sequencing with either a different technology or the same technology, so that you can, sort of, validate that that SNP is indeed SNP or variant. Because we're not calling it a SNP yet because it's just because it's different than the reference genome. A variant, we don't even know if it's a somatic variant or what kind of variant it is. It's just different. So this difference, is it done by chance? Is it sequencing error? Is it a population error? Is it a sample prep? Is it an Aligner problem, and so forth. And there are ongoing benchmarking going on with mappers and Aligners and SNP colors to see which one's the best. And truth of the matter, unfortunately, is that we don't know. And depending on what kind of DNA you have, depending on GC content, depending on how homogeneous the sample is, in worst case, is cancer, where you have different cellularity, you have different, each cell is an evolving mass, basically. And so there's a lot of challenges there. All of those things make it that things map at the wrong place sometimes. And so there's ways in IGV. We looked at it, and if we can see differences, they're sporadic. Some of them are, everybody's different from the reference, and so forth. We actually don't, and we have position coordinates here, but it's kind of hard to, actually, it's not the best way to look at it. And so the one we picked, we actually picked the one, so basically whatever reference you're going to select or map against is the one you, whatever is loaded on your computer. So on this computer, there is HD19, but they also have some plant-finished genomes. They have other animals and so forth. So you need a fast-day file that has most of the genes, mostly sequenced, although it could be partially sequenced, but most of them sequenced, and then you're going to map against that. And the way many of the aligners work is that they sort of break it into small words. They know where those words are in the genome. There's an index system, and then they sort of can do a quick lookup. It matches this word, so therefore it's over there. It matches that word, so it's over there. And so it needs this fast-day file to read with it, from that it builds indices to go map across multiple places. And obviously the sum, which are the same at multiple places. So where maps do, in some software, it's totally random. It picks one or the other. Sometimes it picks the largest chromosome. Sometimes it picks the smallest chromosome. There's all sorts of, what? Some don't what? And some refuse to report it. And some appropriately report everywhere. So you get these pileups of things in certain places. So before we do anything else with this file, we're going to actually, since we know that we only have chromosomes, we gave it chromosome 22 files, we're going to clean it up. And so we're able to do that. So we're going to do a filter and sort. So we're going to filter, filter, filter, not this one here, yeah, filter. And you get this window show up. So it knows which file to get, which is the SAM, both time, map, read from chromosome 22. And we're going to do the following conditions. We're only going to take the file where column 3 is says chromosome 22. So this is a very much a standard way to select on a column in Galaxy. So C3 column 3 equals equal to chromosome 22 between single quotes, a string. So it matches that string of text. Sorry? Oh yeah, it said chromosome 1. Or C1 column 1. Yeah, so it doesn't know what it's, it doesn't know you're looking at a SAM file where column 3 is a chromosome 1. A bed file, actually, chromosome 1 would be the column 1. So this is a filter for any tool, for any flat file. And so now after we do that, poke it in the eye, and everything's chromosome 22. So like yesterday, we're going to convert this into the SAM file into BAM. Because BAM is more compressed. It's a lot of software prefer BAM because of the small footprint, and it's easier to work through, and so forth. And with the other adjoining file, which Galaxy knows about too, the index file, it's able to look things up faster, and so forth. So Galaxy knows about the index file, it will keep track of it, and it will use it when it's necessary. So I just type SAM on that tools. Then I find SAM to BAM, click on that. And then I get this file, it's very locally cash filter on data 5, and it's the name of the file from the last X2. And so I get the BAM version of that chromosome. So I renamed it BAM Chromosome 22 Map Reads. Now we actually have this new icon showing up here. If you click on that, you get two things showing up. You get Traxxter and Surxter. Surxter is a word that's taken off of which other tool? Does it look like? Circos, yeah, it's actually their version of Circos. I've actually never used it, and we don't have data here. So we're not going to talk about Circos. Surxter, but we're going to use Traxxter. So I'm actually going to click on Traxxter. So first you have to convert it to BAM. Yeah, you have to do this on the BAM file, sorry. I'm going too fast here. You're one screen behind. So you have this file, so you have your SAM file. And then that's cleaned up. It only has chromosome 22. And then I click on, I just search SAM, SAM to BAM. That's SAM to BAM, execute. And now you can poke it if you want, but basically it's a binary file, so it won't be able to show you in the middle panel. So you can use the pencil to rename the file. You can save it to this weird looking square thing that's called a diskette. Information rerun the conversion, or click on here, which is Traxxter. The first one, first choice is Traxxter. Oh, OK. So this one? You get this one? This window here? You don't get this? Did you use the groomed version? If you go back and make sure you use the groomed FASTQ file, that may be a problem there. If you didn't groom it, it won't work. So grooming was all the way back here before mapping. You have to run FASTQ groomer, and then you have FASTQ groomed. And I think I renamed it groomed FASTQ problem 22. I don't think I did that. No. No, I did not suppress the header. I wouldn't do it. I don't know. I haven't done it. I don't know. We shouldn't. We're actually going to filter the mob after. We filtered it when it called for 22. So did anybody get the Traxxter to work? Yes? OK. Huh? No data to display. What you did? Yeah, I got that. OK. So click on your Santa word, the Santa ma'am. And then one more screen. Yeah. So we clicked on there, and then you click on it just to get more info. Don't click on the I because you go back one. Just click on the words. This one here? Well, I think they don't have that open yet. Because we just clicked on it. Yeah. And then right click on Traxxter. Yeah. And then click on Traxxter. Let me see if I can bring one up. Yeah. Now that's Traxxter. And it's a quick one. Yeah. That's OK. I did some good one with one of these ones. Oh. OK. It says that you did this. But you have this thing. That's all with it. Yep. So you just can't. So can I expand that? I don't know. Yeah. And then right click on Save. Save. Yeah. I had one, but it disappeared. I bet it must be it. This is where it would be. Yes. Oh, here it is. I used to know this earlier. Oh, yeah. You're all logged in. Yeah. Oh, look at this. In Francis, it seems you have to register within Galaxy to access Traxxter. Yeah. So everybody... No, no, but I told everybody... No, everybody should be registered. I did do that at the beginning of the workshop. Everybody should be logged in. After I told them that they all had to. If you're not logged in, yeah. You have to register. If you're not logged in, if you don't have, let me log out. So you log in. If you've never logged in, you have to register. And then you put in your name, password, confirm password, and public name. And then you submit. I'm already registered, so I'm going to just go back here and log in. So I'm just going to log in. And then now I visualize Traxxter, view, and new visualization. I'm going to name it chromosome 22, create. And then you've got to change here this to chromosome 22. And there... So this is the exome sequencing, if you recall. So it's only... You only have reads where the exon are because exome sequencing is a technique by which you capture only certain parts of the genome. So it represents about only 1% of the genome, 50 megabases, a bit more than 1%. And so it's... So you get deep coverage where the genes are. So there's lots of... This is a new tool at Galaxy. There's a... Documentation is in development, I would say. And loading Traxx on the... Amazon is not as easy as it is loading on U-Pen. So you've got this sort of dichotomy of challenges there. So if I want to get back... So I would save it. So you can save it here. This little funny little square thing. You can save it. It'll be a saved image or a saved view of the genome that you can look at. And it's only... Yeah, remember to look at chromosome 22. M is a mitochondria and all the other pieces. So I'm going to get it back out. I guess I could sort of analyze data. Are you sure you want to leave? Yes, leave page. So... What was I going to say? Oh, yeah. So you can also... If you had a current version, you can click on here for IGV view. There is... Displaying UCSC main. I tried that yesterday. It didn't work. So I'm going to try it again. See if it works this time. Oh, we're in the right chromosome. That's good. So... But it doesn't show me my reads. Oh, yeah. This one here is scale. Chromosome 22. This one. There we go. Oh, yeah. Bam. That's my data. Cool. Yeah. So... There we go. So you can control click here and you can sort of show it differently. So it can do full. There we go. So now we're in... You got to be careful. So we're in... Zoom in. Zoom out. There we go. That's a bit better. So... Does anybody have a favorite genome chromosome 22? TBX1. TBX1? TBX1. Sorry, couldn't locate TBX1. That's good. I found something. 22. It is 22, right? Which link? Oh, okay. But I'm on 22 now. So... Yeah. Yeah. You just typed CHR22 with nothing else and you should be able to get the whole chromosome. Chromosome 22. So what we're doing here is that it took our BAM file image, basically, where the coordinates of all the reads and it projected it onto the UCSC genome browser. And so you have all the files, the things I talked about earlier today or where the genes are, the SNPs and so forth. So you have all of that here as well. So if you're looking for your favorite gene, so if you scroll down, you have all those various fields that you can select from. They're all available here. Okay. So... Let's be brave here. I'm going to try ensemble. I've never done this before. Okay. See what happens. You've been directed to a nearest mirror. U.S. East. Okay. Semi-chromosome 6, that's not good. That's a bad sign. So I'm going to change the location here. I'm just going to change all of that for 22. Okay. Yeah. This is not as useful. Regional compared. Genomes. So I've still a few things to cover in my lecture, so I'm going to go back to my lecture now. So that was Traxtor. If you remember to pick 22, this is one view, calls him 22. So we've done a workflow right now. So we've done, you know, so forth. So you can... It's called advanced generation. So you can actually have it and then you can actually, if you click on here, you can get an edit button. So first, you're over here and your history is the top right button. Extract workflow. So you generate a workflow that you will save everything, as I showed you in the lectures for the SNPs. And then you can select everything and then you can sort of go. And then this is basically all the steps of the workflow. They're all the files. They're all linked together by connectors. And this is the overall show. The bottom right shows you where you are and sort of the Google map. So you can move this. You want to go to the right. You just shift this over. This box over. And you can look at that. These things, you can move them around. You can spread them out for a different layout to it. And you can, anyone you click on, you can edit. And so, for example, if you want to rerun this against Chromosome 21 file, then you can sort of, remember when we did the filter on Chromosome 22, you can just change that to 21. And then you can run that file, then run the whole workflow against Chromosome 21. And it becomes a one-click operation to do all the steps. And so then, by using this one workflow. Last year, we had an RNA-seq part of this workshop, believe it or not. We also had a half-day of RNA-seq in these two days. And this year, we separated it as a separate workshop earlier in the week. And at the end, so what I did last year is we had an RNA-seq workflow that my postdoc, Emily Chaudin, had prepared. And so we worked through it in the lab. And so what I did here just to show you that it is doable is I included the workflow here. So you can actually look at it and play with it and use it if you want. Workflow is quite a bit bigger than what's seen in this canvas. And this is a... It's one layout, and this is the same thing, just a different layout, a bit more spread out. But ideally, it's the same type of information. A lot of the tools on here are not on the cloud, though. So this is an example of yet another example where we want, if you want to run this tool, you have to go to either one of the public servers that has these... So a public server that would be specialized in RNA-seq would probably have all of these tools. And there's quite a few of them that are like that. There are some that are more to evolutionary phylogenetic analysis. So you have all the tools to do the phylogenetic analysis. So again, the U-Pen has a lot of these tools, but sometimes a workflow may take days to run at U-Pen while it would take hours to run it on your own site. And so there's a whole range of things. Also, all of these workflows, you can share or publish. And so you can share. It means that you can e-mail to your friend. Here's a URL that you can look at it. Or you can publish it so that the whole world can look at it. There's lots of tutorials. Galaxy 101 is a good one. And then in the end, I want to just tell you about another project which actually uses Galaxy and other tools. And it's basically... Genome Space is another free resource from the Broad Institute that integrates a number of these tools. So it integrates Ray Express, Sutter Trome, Sutter Skate, Galaxy, Gene Pattern, Genomica, Workbench, GI Tools, IGV. IGV is from the Broad. So is Gene Pattern. So is Genome Space. And it has a few databases. And you see a Genome Browser. And basically it's a space that allows that the output of one program can become the input of the other program. And this Genome Space ensures that all the tools can talk to each other. So you don't have to worry about that. And, unbeknownst to the user but you should know because that's what I'm telling you, is that all of this data in Genome Space is actually back-ended onto Amazon. Okay? So it actually goes into stores in Amazon and it's free because they have a grant from Amazon. It's free today. I'm not sure how long it's going to be free. But right now it's free. But it's totally transparent to you because you don't even know you don't log into Amazon or anything. You just load to Genome Space. You have to register in Genome Space. So it's secure within Genome Space. But in the same way you can share and so forth in Genome Space. But it's not... It is Amazon. So as Michelle mentioned it's very important to do the survey. It's really critical to do the survey. This is a... We've done this workshop before but it's the first time we're digging in this format with these four or five modules. And so... And why we're doing it this way is because we had recommendations from people last year to change the way we do it. So we take the feedback very, very seriously. We actually have a meeting in the fall where all the faculty from all the workshops get together and we plan next year's delivery and so forth. And we can debate all the comments from people from all the workshops. There's lots of Galaxy videos, screencasts and obviously I invite you to play with your own data and register and try out Genome Space. It's free for now. So galaxy.org and use galaxy.org cloud, the Twitter feed accounts, Galaxy projects the tag if you're using Galaxy or looking for people that referring to Galaxy is use Galaxy there's a user's mailing list there's a developer's one as well Biostar was mentioned a few times it's very... It's Biostars.org Biostar.org is it something else shoppers of some kind and on Twitter there are Biostar questions Open Helix I didn't talk about actually it's a commercial venue that does help documentation and support and training for bioinformatics and UCSC and Galaxy actually pay Open Helix a lot of stuff that they have about Galaxy and about UCSC is free of charge. Some of the other things are not free of charge and there's no support for other products. UCSC and also the owner or one of the senior people at Open Helix has a great blog with lots of very useful things UCSC has their own space their own Twitter some more tutorials C Cancer is another one for more for NGS