 Okay, so we are now recording. So this is going to be recorded and the video will be posted online in the next day or two Welcome to the galaxy webinar on connecting galaxy with the NIH sequence read archive I'm Dave Clements. I am one of the training and outreach people for the galaxy project Today, we also have Marius Vandenbeek and Dan Blankenberg. We're going to join in In addition, we have a number of people from the SRA itself who are going to chime in as we go correcting us as we do things for adding additional information. So The slides that we're using today. We're not going to spend a whole lot of time in slides. First of all, but the slides do have links in them. So that makes them useful. The slides are at that URL at the bottom. So bit.ly galaxy dash SRA dash slides and their Google slides. So let's see. There we go. Okay, so our agenda is short and sweet. We're going to introduce SRA. We're going to introduce Galaxy and then we're going to talk about SRA and Galaxy and we're going to do that via a live demo. Please ask questions as we go. So there's a Q&A window in Zoom. It should be, there should be a Q&A icon at the bottom of your Zoom window. Click on that type of question. Everyone can see your question. Whoever is not speaking will try and Collate those and then throw them at the current speaker. So if your question doesn't get asked immediately, please be patient and it will be asked. Time allowing and we want to cover questions as we go. We can also cover questions at the end, but we're all big fans of questions as we go so Okay, so when you registered, we asked, is there anything you would like to specifically learn about in this webinar? We got a whole range of options. Some of which we're going to cover today and some of which were not. So we are going to talk about SRA and Galaxy and we're going to focus on the parts of SRA and Galaxy that have to do with that interaction. So everything in the today column we should be covering things in the not today covering sorry things in the not today column we could cover, but not today we don't have time so we can do all these things in SRA or in Galaxy. But we're not going to discuss those today. We will give you some some pointers on how to get more information about things like that. And lastly, we had one person. God bless you, whoever you are who said the meaning of life. And I do like to think you could find that in SRA and Galaxy and maybe we will maybe we won't but we'll try. Okay, so that's what we're going to cover. So let's talk about SRA. Sequence read archive. So the first thing I'm going to try and do a poll. Poll can I do a poll here. Oh, it's only giving me one poll. I don't want that poll crap. Okay, so my first poll has gone gone awry already. Okay. Well, we're just not going to pull you guys. Sorry. Okay, so what I was going to do is ask how many people have experience with SRA and how deep is that experience. I'm going to assume that you have a little bit of experience with SRA. I'm going to do the walkthrough in the SRA part and Marius is going to do the walkthrough in the Galaxy part. I had another poll or so I thought to ask about Galaxy and your experience with that. So, but we're just going to plow ahead under the assumption that you have used SRA but you may or may not be an expert. So, So what is SRA? It's part of NCBI, which is part of NIH, and it's the primary archive of unassembled reads. So it doesn't have final assemblies. It doesn't have reference sequence. What it has is reads. So things that come off the instruments. And that distinction is worth keeping in mind. This is where you go when you want access to the source data for reference, for reference sequence. Things that are in GenBank. This is where you find the evidence behind it. So it's a great place to get, you know, I just said that. Okay. So another point, all of SRA, every bit of it is now on Amazon Web Services and Google Compute. So it's all in the cloud. This means you can have fast access to it. It's a very powerful feature and it's relatively recent. One note you will also hear referred to as the short read archive. It's former name. It hasn't been the short read archive for a long time. But I'm older than that. So occasionally I'll call it short read archive and I apologize for that. It is not just for short reads. It has all kinds of reasons. So and SRA folks, again, please feel free to chime in and say, well, actually, we'll see. Oops, wrong window. Okay. So today we're going to spend our time in SRA in two things. One is entree and one is the SRA run selector. Entree is a query interface that's used across a lot of NCBI resources. So you may already be familiar with it, even if you don't think you are. And the SRA run selector is a newer method of interacting with SRA data. We're going to use both. They complement each other. They don't replace each other. They complement each other. Okay. And I'm going to throw this slide at one of the SRA folks, or maybe I won't. So I will, my name is Iris Kripchenko and I wanted to let everybody know that SRA is going through some changes and we would really love to have community feedback on those changes that we're thinking about doing. And so we have this request for information open. So please go to the link on the slides and provide us your feedback on our plans for the changes that we're thinking of doing. I think it's very important to get the community's feedback to make sure that the things we're doing matches your expectations. And in the long term will help you perform science faster and better and provides kind of more research opportunities for everybody. Thank you. If you're part of the Galaxy community that link is also available in the June newsletter and we've tweeted it a couple times on our Twitter feed. We've actually retweeted it from NCBI. So you can find that on Twitter as well. So let's talk about Galaxy. Again, rough introduction before we dive into the actual hands-on part. Galaxy is a data integration and analysis platform and it's funded for and developed for life scientists. It's actually used in all sorts of things. But that's our goal is to be for life science. We are, in addition to being that platform, we're a worldwide community and we really can't stress that enough. The community is what makes Galaxy what it is. And it's not just the coders, the developers, it's the users, people who do training, people who provide infrastructure. It's a huge, very vibrant community and without a Galaxy would probably not be around anymore. So, yeah, so let's stress both of those things. They're equally important, the platform and the people. Okay. Galaxy is available in a whole bunch of different ways. There are over 100 free online web servers out there, something like 140, 30, the last I knew. If you follow that first URL at the bottom, you'll go to the platform directory. It's also available in commercial and academic clouds. So you can easily launch it on Amazon or if you're in the US, you can easily launch it on Jetstream. People have published it in containers of Docker and virtual machines. So you can launch it on your laptop very easily. It's also open source software that can be installed anywhere. And that is the most common use case for Galaxy is it's installed all over the world at research institutions, universities, usually behind firewalls, but that is our most popular deployment. So, okay. We also have a lot of training materials and we wrote a tutorial to highlight how to use SRA and we'll use that later today. But there is a huge library. Again, this is part of a community effort. This is entirely run by the community to publish tutorials on how to use Galaxy for different domains. And all sorts of topics there, epigenetics, genome annotation. We also have tutorials on Galaxy basics like data manipulation user interface. In addition, we have tutorials on how to write tutorials, which was very useful for the three of us in the last few days. And we also have tutorials on how to run a Galaxy server. So it's a very robust library of training materials. And something like 160 people have contributed so far to that. So today we're going to do a live demo. We're going to use a COVID-19 example. But that domain doesn't actually matter. What we're demonstrating today is how to use SRA and it doesn't have to be viral or use your gathering from SRA. It can be anything. It can back RNA-seq experiments. It can back ship-seq experiments. You name it. If it's an SRA, we can get it into Galaxy with what we're going to show you today. And a rough plan is we're going to go from Galaxy to SRA to Galaxy to get metadata. And then we're going to go back to SRA. Well, we're going to run a tool that goes back to SRA to grab the actual data itself. And then we're going to run a short suite analysis in Galaxy that uses the SRA data. Now, the second link at the bottom there may or may not work yet. I don't know. It works. Yes. Okay. So reverse that. It works. It's brilliant because we're on the ball. Okay. So the second link will take you to that tutorial. And I'm not going to go step through step in that tutorial, but that's basically the process that me and Marius are doing today is what's in that tutorial. Okay. So a couple of caveats before we do this. So SRA is this wonderful resource, truck full of data, and it's growing rapidly, as you may know. But submitters don't often provide complete or correct metadata, and we may see that in some of our explorations today. There's a discrepancy between SRA and ERR entries. Does anybody want to expand on what that is? Yes. So actually, I don't know. Maybe Yuri can correct me. But something, well, there are several places where primary data is deposited. So there is a short-term archive here at NCBI. There is also a short-term archive in European underamble umbrella, European Euclidate archive. And so sometimes, especially for SARS-CoV-2 data because it's released essentially on a daily basis, they're not always in sync. So for example, if you're trying to pull ERR, European accessions from SRA from here, sometimes that does not work. So there is some discrepancy here. And also in some cases, when you are trying to download a really large sets of reads, thousands of reads, then some of them may statistically fail for a variety of reasons. Most often for network connectivity reasons. So the tool we're going to use today is actually built to deal with that. We won't, I'm pretty sure we won't see that today because we're doing very small data sets in a small number. But the tool is actually built to be rerun and just get the things that fail. Okay. And I think, yeah, okay. So let's go back here and like get out of this scape. And let's go to the Galaxy training. I go there. What happens? Okay. Let me know if my screen share is no longer behaving. Is it still behaving? Okay, good. Okay. I'm going to zoom in just a tad. Let's do one more. Okay. So this is the tutorial. It's, yeah, so it uses SARS code or yet it uses COVID as an example. But again, doesn't actually matter for what we're talking about today. So I'm going to scroll down. It's a typical tutorial, some leading questions, some objectives, what are requirements, all sorts of things. And the aim and the agenda, you'll see as we go through, there's lots of texts, there's comments, there's tips, there's also these hands on. So if you're already familiar with SRA and Galaxy, and you know what you're doing, you just want to learn how to do it, you want to focus on just the hands on parts of these tutorials. Okay. If you're learning either or both of those, it might pay to actually read the tutorial in which case it'll take you longer, but hopefully it contains some useful insight. Okay. So what we're going to do again is we're going to go to use galaxy.org. And from there, we're going to go to SRA, we're going to explore SRA for a while, then we're going to send data back to Galaxy. So let's do that. And we're going to start out in entree. So let's do this. Okay. So this is going to land us on usegalaxy.org, which is, I'm going to zoom in just a tad. Okay. Which is the Galaxy server run by the project in the US. So again, anybody can set up their own Galaxy server, there are over 100 public ones. This is the one for the Galaxy that the Galaxy team in the US runs. It might be the largest one in the world, I think. So it's a three panel interface over here. You have your tools in the middle. Once we get going, you'll see data sets and tools. So we run our tools from the middle. And we also preview and view our data sets here. The third panel is the history. So this is what we've done. And so far it says, hey, Dave, your history is empty. And it says you can do stuff. So cool. I will do stuff. Okay. And in fact, what I'm going to do is get data from an external source. And so the very first thing I do is I come over here and it's like, okay, what I want to do, I want to get data from SRA. So I'm going to click on that. And get data downloaded next track. Okay, this stuff looks great, because this is what I want to do. But if I click on, well, yeah, eventually I'll click on that. But right now what I want to do is go to the SRA server, because I don't yet know the details of what I want. I actually want to go to SRA and figure out exactly what I want. So I'm going to click on that. SRA server. And now it transfers me. I'm going to zoom in Galaxy into SRA. Now, for right now, this feature is available on usegalaxy.org. Support for this feature is in the 20.05 release, which has not yet come out, but the release of that is imminent. And once it's in the 20.05 release, we'll see it started to filter out to other Galaxy servers as well. But right now it's only in usegalaxy.org. And so what's happening behind the scenes is when I said, hey, go to the SRA server, it goes from Galaxy to SRA, and it tells it which server it came from. And SRA is going to remember that. And in the end, when I ship stuff back, it remembers what server I came from. Now, again, as of right now, there's only one server, but eventually you'll be able to call this from any server that supports SRA. Okay. So we're in SRA. And this is the entree interface, which may look familiar if you've used NCBI or SRA before. And you can type anything here. And so I'm going to borrow something from Yuri because Yuri likes dolphins. And so we're going to search for dolphin. Okay. And it comes back and it gives us 720 results. And we go down and there's a lot of them. And that's cool. Okay. So I know, great. It's kind of working. And then I get kind of curious and I go, okay, well, I used to work on a kidney project. So I like to search for dolphin kidneys. And note that it looks different now. Okay. There's only one match for dolphin kidney. And it's this one. So it didn't display it as a list. It just took me directly to it. And it might be pretty old because it's 454. Well, that's cool. Okay. So I got dolphins. I got dolphin kidneys if I type kidney. I go back. And there we go. Now I've got 124,000 entries. So a lot more than I did for dolphin. Dolphin. If I could spell. So if I go back to dolphin. I have a mechanical keyboard, which is why you can hear me type. I get 720. Okay. Now there's a bunch of things we could. So yeah, I type dolphin. I type kidney. I could also go into the advanced query form right here. This is a query builder. So if you're familiar with entree, it has a very sophisticated and powerful syntax for specifying exactly what you want. And you can specify it in very particular fields. So author, I only want to see entries from this, you know, from this author, maybe from this biosample bio project. Maybe I'm only interested in Homo sapiens. I only want to see things that are recent or that are old. Who knows. There's a bunch of things you can put in here and you can build this incredibly complex query. And then it will put it into entree syntax, which is difficult to remember, but you can use this form to create it. So we could today use this advanced form to create our query. Okay, now we're not going to. What we're going to do instead is we're going to search for stars. So I can remember the C O V two. Okay. And that's what we actually want to run today. Sorry, sorry, C O V two. And we know we have 25,000 inputs. Okay. We're on page one of 1200 pages. So that's kind of a lot of data. Okay. And now we could narrow things down here further. We could say, okay, I want to see stuff from Egypt or from Thailand or from wherever. I could do that here. I'm not going to because what I want to do is switch to the run selector interface. Okay. So the run selector compliments entree as I mentioned earlier, but it's a lot friendlier. Okay, so we're going to switch to that. I'm going to send all 25,000 of these, which is way too many to use in a tutorial. Probably way too many to use in any meaningful experiment that we're going to run, but maybe so I'm going to click on send results to run selector right there. And this is going to take me to a different interface all 25,000. Okay. Here we are. There's our runs. Oh, so something to note. When we were in S, sorry, so back in entree, we had a number of entries and it wasn't this number. It was a different number. And the reason that number is different is because in entree we're looking at SRA experiments. And an experiment is a package that wraps a bunch of things, including runs. Now we're interested in runs because that's where the sequence data is. So this number is different. It's usually going to be larger because experiments typically have well, they don't typically have they can have one or more runs. Okay, and here it's a little bit larger. Okay, so don't expect this number to be the same because we've now gone from experiments to runs. It's the same data. We're just seeing it in more detail. Okay, so that's what this is and it says, Hey, we got a lot of stuff here. We got a terror 1.8 terabytes 3.9 terabases. Something to notice we have this tantalizing galaxy button here, but it's not currently active. So our goal is to make that active and sense. Now in entree, we could type in searches, which get very sophisticated and narrow in on things. With run selector, it provides a faceted search interface, which is over here in the filters list. Okay, so that's what this is. And these items correspond to columns. Now if we scan this, we have a lot of metadata about all of our 25,000 runs. It goes over and over and over. That's a huge amount of stuff. Okay, and a lot of these columns are reflected here. And they show up in the filters list. If we if they are suitable for faceted search, which is what we're using columns that are suitable for faceted search are two types of columns. One is it has a limited number of discrete values. Okay, so let's see assay type. Let's click on that. For instance, so I just clicked on assay type and it shows me down here. Okay, I've got 12345678 different values for assay type. Now, if assay type had 3000 values, it wouldn't show up in this list, because like, because the run selector can't build a meaningful interface from 3000 values. Okay, but it has a limited number there's a default you can control with this cog. Okay, it says well, I want more than 10, I want 15 or 20. I don't think you can say 3000. Okay, but this allows us to say I only want to see things that are applicant. Okay, and you can click on that. And now it's going to narrow this list we went from 24960 down to 21,000. Maybe I want to throw an army seek as well. And now that number goes up again. And then I decided actually I want to drop the game. Okay, and now I'm down to 1800 items again. Okay, I can also do what's let's see. I don't know. I don't know what's in cell type. Let's take a look. Okay, so I clicked on cell type and now we come down here and I've got tracheal I've got empty. And maybe I'm particularly interest. Yeah, I'm in particularly interest in long adeno carcinoma. Hopefully I pronounced that correctly so I click on that. And so now I've got the intersection of RNA seek and lung adeno carcinoma and I've got 56 items which are both of those. Okay, now before I click that you had a lot of numbers here. These are subsets within this set. So if I unclick that these numbers. Do they change. Oh, they don't change. Okay, I take that all back. Okay. Okay, so you can play with this and you can make your your queries as as as sophisticated as you want. Now, this is built as a query interface, but it's also a great interface for understanding the data. So we've got 25,000 rows here. You don't want to spend hours going through asset type columns or paging through this to see what the different options are. You just want to type on asset type and say what's in here. Okay, so asset type is kind of, you know, it's kind of an easily understood thing even for me a non biologist, but there are also things like I don't isolation source host assessment probably associated. I don't know what that means. But I come here and I can maybe figure it out. Maybe I have enough background to figure out what this means isolation source host associate. So they're going to be columns here where it's not exactly clear what's going on. And you can use the run selector to explore your data. Okay, so it's a much friendlier interface. Okay, now today, since we're running a workshop, we're not going to do any of that. What we're going to do is get a very small sample. Okay, and let's see. I skipped ahead right there. Okay, we're going to do a very small sample. And we're going to just cut and paste this into the found item search box. Okay, now in reality, this is not what you're going to do with the run selector the run selector exists so you can do sophisticated queries. But we're doing a workshop today. And yeah, it's just not going to happen. We need these four items. Now these are pre selected. Let's see Marius is going to explain them. But, oops, sorry, wrong one. These are pre selected to get us some geographic variation. So, let's see, let's go back here. And I could paste this, you know, the same string into Andre, and I would get the same results. But from Andre, I can't ship things to Galaxy. I have to go to the run selector first. Okay, that's where to go. I took clear of that search. Yes, okay. So now we've got these four. And these four are hand-picked again for the, you know, let's see, for the tutorial. So again, you wouldn't do this in reality. Note that Galaxy is still not highlighted. So how are we going to get Galaxy highlighted? In this case, I want all of these, so I'm going to click on that. If I only wanted a few of these, I would click individually, boom, boom, but I want all of them. Okay, no questions yet. Good. Well, that's not good, but no questions. Okay. So note until we do that, that's not highlighted. As soon as I highlight any of these, I can ship it to Galaxy. So that's what I'm going to do. SRA folks, is there anything that should be highlighted before we bounce out of SRA, the run selector? No, I think you're good, but you might want to select those four individually. Nope, sorry. Okay. So why is that, Yuri? Oh, it just looks like you have the entire data set selected and not just the four. So when you look at under the selected tab, it looks like it says 24,000 is selected. Wow. Okay. Okay. One, two, three, four. Thank you. Ah, okay. See, I didn't know that, Yuri. I could have had a disaster here where we sent, where we sent the rest of our lives waiting for 24,000 items to go. Thank you, Yuri. Okay. It's just met, I think. Yes. It wouldn't have been catastrophic. That's true. Yeah, so Marius' point is we wouldn't actually send 1.8 terabytes. We would actually send, let's see, metadata for 1.8 terabytes. Okay. Thank you, Yuri, for saving our day. Okay. So with that, we're going to head back to Galaxy. So let's click on that. We head back to Galaxy. And a couple of things happen. We get that big green box and it goes away. Okay. If you're not familiar with Galaxy, that big green box is necessary, but it's not sufficient. And what it means is that we have a handshake between SRA and Galaxy. And if you get data from UCSC or anywhere else, that's what that big green box means. It means we had a handshake and the handshake went okay. Okay. It does not mean we have your data yet. And so as we can see here, we're in Galaxy. We have an unnamed history and no longer says Dave, you got nothing. It says, oh, there's this thing called SRA. Now that little icon, that clock icon says, hey, your request is queued. It hasn't started yet. Right now it's transferring the data. So we're actively talking back and forth between Galaxy and SRA and the day is coming over. And now we actually have the data in Galaxy. Should I take over here? Yes, you should. You need to unshare your screen. Yeah, details, man. Okay. I'm going to stop sharing if I can figure it out. Okay. I have stopped sharing. Marius. And people please ask questions. We don't have any yet. Which implies that we're perfect. I've done the same thing that they just did. I've selected only three data sets and that's what we have in the current version of the tutorial that we have been changing. So, okay. So this is the metadata that we got from SRA. You see here is more preview that you can scroll. But of course, if you click on the I icon, the data set will load. And you see here that we have this first line, the header. And then we have the most important thing that we're interested in is the run accession. So this is the first one. And we have all the metadata that we already saw in the run selector. So this is now within Galaxy and we can see these data sets come from three different countries. And that is why I have picked them. So this is not yet the entire data set as it is on the SRA. This is just the metadata. So the next step is we need to download the data. And to do that, we will use the download screen. Let me make sure I'm following the tutorial. So, yeah, we have to download now with this metadata file, we can download the actual data sets. So what we want to use, and again, you can follow the tutorial that spells this out a little bit slower. You want to use this tool. So this lab part, as Dave said, this is the tool panel. So here are all your tools and they are grouped within sections, but you can also search. So this is called cluster download. Okay, and there are multiple options on how we can download the data sets. So if you already have the accession and you just need to download a single accession, you could type it in like this one here. And I would download a single data set. That is also useful if you know the accessions because, for instance, we're redoing the analysis of the paper. But here, we want to focus on the data sets which is selected. So for that, you click here on this selected type and we select list of SRA accessions. And since this is the only data set we have in our history, it is automatically selected. You would be able to click here and choose the data set that you would like. So we can do that. So what happened here, this is the history. In the history, data sets appear as they are being generated. So gray means that the data set is new and Galaxy has sent them off to the computing facilities. But that's happening now. And as soon as they are running, they will be orange and then they will turn green. All right, so we set this here. Okay, so we're going to do variant analysis and something can only be the variant if we have the corresponding reference. So we need to download the reference data. So typically, you would fetch the reference data from the NCBI resources or they host the reference genome. But for this tutorial, we've also uploaded the reference sequence to Zidon which is archived for scientific data. So you can get the URL from the tutorial. Okay, so we have received this data set through the SRE server itself. But if you want to upload any other data set, this is the main way to upload data into Galaxy. So this opens this here on the model and there are multiple ways to add data sets. You can upload it into your computer. You can upload files for FTP or if it's simple data or URLs, you can just paste them. So in our case, we can paste it here. We can give it a name. We don't need to name it but FASTA because Galaxy knows this is a FASTA file. Just going to set this here. And then we can start. And as with the SRE data sets that we're currently downloading here, we're also downloading the reference genome. So the download is finished. There is a log file. So we found three times two data sets, which makes sense because the data sets that we have selected are paired and data sets. They were generated with Illumina sequencing. So here and there we are to see that these are paired data sets. And if we look in the output of the SRE download tool, we see that these are in here. So I have to say one more thing. This is a normal data set. This is a collection. And so collections are collections of data sets grouped together. So we can see here, we have these three different sessions and we can go down with this connection. And we can go deeper and to see that we have a forward and a reverse. We're familiar with, you know, parent sequencing. This should make a lot of sense. We can also take a quick look. This is a compressed file that gets most out to display these compressed files. And if you've ever seen a FASCII file, this should be familiar. So you have your sequence, your header lines and the encoding quality score. Okay. So now we have the data sets and we have the reference genome. So we can continue the analysis. And one thing that is important to do in this type of analysis in general is to turn off sequencing adapters. These don't contain biological information, so we need to get rid of them. For this, we can use a tool. We have many different tools. And again, there are tutorials on this. In this case, we will use a book called FASCII. We have paired and data, so we need to select paired. And we have another collection. So we need to say paired collection. And there's only one paired collection in my history. So it has been preselected. But otherwise, if I want to put the sets here. So Marius. Yeah. We have a question from Pablo Cataldo. Should we download and process data from single and paired ends separately in Galaxy? Yeah. So it is not necessary. If this was a mix of single and paired end data sets, you would get a paired end data set here and a single end data set there. In practical terms, we often separate them so that we can run a series of analysis. We can run workflows, which are a series of analysis. And then we first pick the paired end data sets and then the single end data sets, because they may need different parameters. But in principle, there is, you can download single end and paired end. And it doesn't have to be, let me know. It can be anything that the SRA has. So also number four, I am told. All data sets. Okay. So good question. Thank you. So class P generates this paired end output. And it generates an HTML report as well. Now, I didn't follow the tutorial. That's not good. In fact, what we should have done is run class B connection. And then we would like to output a JSON report because after we're done with the analysis, we will use this. So we get that HTML report, which is beautiful. And we get a JSON report, which is machine-readable, which we'll use later on to generate a bigger report. We need to select the correct input data set as well. We selected the fast P output. Not a problem. Gannix is very patient. It's very good. And the approach here is that once you know what you are doing, you can generate a workflow to not need to think about this. All right. So the first run here has completed. But we're missing the JSON report. But we can already take a look at this. So we see that there's a quick overview of what the term is done. There's some statistics. So most of the bases are really good. So choose the quality and the more high-quality bases we have, the better. And again, there are more extensive tutorials on how to interpret this kind of data. All right. So after that the term, we can align the sequencing leads back to the genome to see where they align. And this is the basis for the actual learning point. So for the alignment, we'll use a tool called BWA-MAN. And so here we need to select a reference genome. For most organisms, we are ready for typical model organisms. We provide indexes of genomes that we can use. But in our case, we're using Sanskrit too. And maybe you're working with a non-reference genome. So we're not a model organism to know. So you can also select to build a custom index. And we're going to do that. So here you will say that we are going to select the reference genome from the history. So this is the first step that we have. Well, we have one first step in the history. So this is pre-selected. We have a paired collection. And we want to use FASP. And here we have multiple data sets with fit. But we want to run the last FASP run here. And that is all we need to do here. There are much more, many more options. So again, X2 typically has all the options that are available on the command line. But we also try to keep it simple. So we have some presets. Actually, BWA-MAN also has the same presets. So that's great. We can run the school now. And another thing to point out, I skipped ahead here, is that notice that we ran a download of three accessions. And we're doing this for each of the three data sets because this is a collection. The jobs run for every two execution runs for each individual item in the connection. So if you ever think, well, this is too much clicking in Galaxy, put things in the collection and everything will be done for each element in the connection. And it also automatically keeps the name. So we can see, for instance, here, this is still the accession we had previously. So there's no way that we can accept this. Another thing that is great about Galaxy is you don't need to wait until the job is run and finished and green to continue. So if I'm fast enough, I can just take this. So the next step after alignment is removing duplicate reads. So I have here here of the. UCR duplicates or optical duplicates. Or this we're going to use large duplicates, from duplicates to. So while that runs, there have been several questions in the q&a that. Aren't either directly relevant to what Marius was talking about at the time. Or about something we've talked about earlier. on by a text, so be sure to check that out. Sorry, Marius. Okay, thank you. And again, keep the questions coming and all questions are different. All right. So one thing I also skipped ahead and again, there was tutorials on this, but you have multiple ways to use data sets. You can have a single data set with this, like the initial data set. This data set is just one single entity. This is how you would run on a single entity. You can run on multiple of these single entities or you can run on a connection. So whenever I'm running these things in parallel, I'm using this little button here. And you see it's already preselected here. And we don't need to wait for the data set to finish. We can already hit execute while the tool is already running, but I am too slow to actually get there. So Galaxy is faster than I can explain, which is a good thing. So the only setting here that we need to change from the report is we want to remove the duplicates right away. So typically we would keep the duplicates, but have the varying color, take care of not overestimating the amount of leads. But here we will say yes, and then we can run. All right, after this step, we can generate some basic statistics about the alignment. And for this, we're going to use Fun2's stats. And note that I am usually not using the sections as they are here, because Galaxy has many, many, many tools. And you can find them by topic, by section. But if you're following the tutorial, it's much easier to just search for them. Okay, so this time it seems like I still have some time to actually run the tool. So we will select the certificate output. It's a kind of distribution to know. Let's see the summary file. You don't want to filter anything. You don't need reference points. All right, good. Okay, so the next step is realigning the leads. So often when you have insertions or divisions, there are sometimes equilateral options for the aligner to base the lead and reference genome. And to have common set of coordinates, there is a realignments that one can do. And this is especially important when calling the insertions and divisions. So for this, we will use low-freq return lead. And because it's going to be this realignment, we need the reference genome. And as I said, if we didn't prepare a reference genome already, you can always select history as the source. This is our reference genome. That is all there is to do. And then we need to do an additional step before we can go on to call our variance, which is insert in-del qualities. So we do that with insert in-del qualities with low-freq. So again, we have a simple data set collection. And maybe I can just show that when you only want to take a single data set out of a collection, you can actually go into the collection here and then choose individual data sets. We want to uniform the process of our data sets. So we're not going to do this. These are the options that are prepared for this addition of the qualities. And with that, we can finally do the actual variant column to do the current covariance. So you see that low-freq has multiple steps, discrete steps in the preparation of the input data sets. You can find more information on the COVID-19 portal that we described in what details of why we've chosen to use low-freq for calling variants on the Sanskrit 2 sequencing data. As I mentioned, so you always need to select your correct input data set here. And they will select the newest appropriate data sets. So in our case, this is the, this file is correct. Now we need to set the correct options. Again, we need to provide reference team from the history. We're going to call the variants across the entire reference, but there is an option to specify just the subset of the regions. We're going to call SMBs and insertions deletions. We're going to set some additional options. We're going to say that we want as minimum coverage 15, so meaning sites that have less than 50 weeks will not be covered. And at the base point, quality, we're saying that the minimum quality should be 30 for both reference and alternative cases. So this setting is important to make sure that you're not calling ambiguous alignments. And we're going to say that we want to use the minimum quality. So again, these are settings that I need some additional explaining, but that's possibly the scope of the tutorial for today. All right, so this is the actual. The next step after the variant calling, so the very column will offer the VCF file, which is a standard format for describing variants and you can visualize them in the genome browser. So Galaxy has integrations for IGB, for Traxxter, Galaxy's built-in genome browser, for J-Browse, and it's also a very easy tool to handle. Sure, it is not insufficient. Okay, so the next step is that we will run SNPF, which is a tool that looks at the variant and predicts what effect the variant actually has. Okay, and there are two SNPF variants. So there is a special version just for Salesforce 2, which takes care of the multiple overlapping living frames that are in the Salesforce 2 genome. So we're going to use this one. As input, we need to select the VCF file just generated. It is in VCF format. The genome is already correct. We want to create a VCF again. We also, sorry, let me start correct. That's VCF is correct. We will also create a CSV, which we can later use to generate a summary report. We will not add any upstream or downstream intervals. This is an option that is more appropriate when you are looking at genomic variants from non-varying sequences. Use these two options. And then the other options. We don't care about these things. Okay, so SNPF describes the effect that the given mutation has. And we can open a VCF file now. Just to give you an idea how this looks there. You can see this before. So you have your chromosome and your quantity. You have a position. You have some optional things. So you can give variants an ID, but we're not doing this. And then you have the reference base and you have the alternate base. You have a quality column. You can apply filters in case you did apply some filters. You'll have to either pass or fail. This is an optional column. So you may also have just the dot, which promotes option columns. And then you have various information. So DT stands for depth. AF is a new frequency. So what fraction of alternate bases are supported by leads. Okay, from total leads. And then you have the type of variant and simulation information. But this doesn't really tell me yet what is the effect that it has. Is it a change? Is it a synonymous variant? Doesn't that introduce stops? So for this, we've used SNPF. So you can see right here. It does add this information within this input column here. But again, that is not super easy to read. So now we can use SNPF, which generates a table. You see that? Just the information that we need. So we can set up the set here. There are some additional things that are interesting that we want to expect. So these are given in the tutorial. So this is the last step that generates data. And now we're going to summarize what we've generated. So far, we've used a multi-QC. It's a very great tool for generating these kind of reports. So we have results from class P, which was, well. Okay, so even copy pasting is not always fruitful. And before, but another thing, you know, when things go wrong, we want to understand what is going on. So this is very nice. I can actually show you that if I don't know what is going on here, I could hit this button here to submit a bug report. And somebody from our team will respond and let you know what went wrong. So you can say, hello, I tried to, but I don't know. So we received this report. We look at this, we tried to have you, or we can't. It also already says here that there was an exception. So you see here in compute H-ROM, not found in this, I've had a significant problem. So we actually don't need the help. And in fact, we can use the VROM button. This is a very cool feature in Galaxy. So you have to source all parameters in the database for the disability reasons. So if something goes wrong or you want to change the parameter and just the same parameter, you can hit this VROM button. And so all you need to do here is change H-ROM to Chrome. And then we can run it. We can use the other term, right? So back to the multi-QC. So we have a fast preview report. And there's only one dataset that is compatible. So it's very easy to select this. We also have some tools report from some stats. So this dataset here, and we have snippet report, which is the dataset here. Okay, so now multi-QC is running. We can have a look at the tables. So this is considerably more friendly than the DCF. So this is your regular table that you could download and put in a spreadsheet editor or you can have it here. So you see here which dataset this is. And I'm always hoping just the first one that you can see that there is a varying number of number of lines in the dataset. And this corresponds to the number of variants that we found. You see here the impact. So yeah, for instance here, some low-impact mutations, variants where the mutations sound and we have a simultaneous coding mutation. Or we have some things that are more severe. But typically the most severe things have a lower rate frequencies. But in the sample, they were quite rare. Okay, so let's see. So the multi-QC report is done. So it gives us a global overview of what we've done. And we can just go here and we can have the two kind of here into more space. And so these have some general statistics. You can see various things here. So for instance, one dataset at a high level of modifications. So perhaps it was sequenced very deep. We can also see that they're not percent of metrics. So this might indicate that maybe there's some human leads that are within this mixture. There's few proper pairs, which again might highlight that human leads have been artificially removed. And we can see where mutations fall. We can see the impact that mutations have. We can see whether we have frameshift monosynomers. So let's keep in mind this is an unfeatured list. And typically we do some additional features. Okay. We can see how many leads we have. So this is the largest dataset, this is the smallest dataset. Okay, are there any additional questions to answer? And I think we can go back to the slides. Thank you, Marius. There were a lot of questions while you were talking. Most of them have been answered in the Q&A. Currently have one open question from Pablo. In the trimming step, does Galaxy identify automatically all sorts of adapter sequences that may be present or should we specify them? If you know them, it's always better to specify them explicitly. If you don't know them, it does recognize the most common sequencing adapters. But if their prevalence is relatively low, it may not find them. So that is. I just want to add here that Galaxy doesn't identify anything. What it identifies is the tools that it has. So in this particular case, we're using FastB, which recognizes automatically some types of Illumina adapters and trims them. There's also Trimimmatic, which is in Galaxy, which we didn't use in this example. There is Trimgalore, and particularly in case of SARS-CoV-2, if you're using Ampliconic data. So this data sets have specific primers that they're used for amplifying in this enriching, particularly SARS-CoV-2 sequences. And so you have to know them. And there are tools for dealing with them specifically. There is an IVAR tool set for cleaving off Amplicon primers off of these reads. It's also in Galaxy. We're not showing it as a part of this example, but it's there. But in order to use it, you need to know adapter sequences that you're using. Thanks, Anton. Another question from Tusharika. It is possible to form plots in Galaxy which show us which are the common genes present in each sample. Anybody want to take that? In relation to SARS-CoV-2 or in general? I'm not sure. I doubt any of us understands exactly what you're doing in the audience here. So in general? In general. Okay, so this is a bit difficult to answer that from a group perspective. So if you want to see for instance which genes have a mutation, you could generate a tabular file and then there are tools in Galaxy to plot tabular data. There are visualizations where you can display one axis where there's the other. These are relatively simple tools. If you do know how to use RStudio or Jupyter, you can know that the data that you generated and then use RStudio or Jupyter within Galaxy to get at that question. And we have also, we have a replica of GT plot which also lets you display different columns, different colors. So we don't have necessarily something specifically for genes but as soon as you put that into a table you can generate plots. Thank you, Marius. So we have one more outstanding question but in the interest of time I'm gonna move on. Is he too sure he got it? I'm gonna show some resources on the last slide that you can follow up with or if any of the panelists wanna answer that by a text, feel free. So thank you, Marius. That was wonderful. Let's go back and I'll see if I can share my screen. Oh, I feel like I'm in the 20th century, 21st century, sorry. Okay. Okay. So these are the resources I was referring to. We have a bunch from SRA and a bunch from Galaxy. Again, these slides are available online and we'll also post links to these slides. Probably, well, I don't know about SRA but we certainly will from Galaxy. So if you have questions about SRA that's the email at SRA. Lots and lots of resources at SRA and NCBI in general. Really a ton of stuff. There are some links to this in the tutorial itself. There also is guidance about how to submit data which is a very common question. For Galaxy, there's a number of resources. Galaxyproject.org is the community hub. It's the website about all things Galaxy and in theory it links out to all parts of the Galaxy ecosystem. A couple of places to ask questions, gethelphelp.galaxyproject.org is a discourse based online forum, Q&A forum. It's quite popular, quite friendly. The community is also quite friendly. If you prefer a chat interface, we use Gitter for that and that's what the Gitter link takes you to. You can ask questions there. That works well for a lot of give and take and questions that you can state in a short way. If it takes a long time to describe your question then help might be more useful. I mentioned earlier there are over 100 Galaxy instances that are publicly accessible. Here are three of them which are the big ones, usegalaxy.org, run by the US team, usegalaxy.eu, run by the European team and usegalaxy.org.au, run by the Australian team. There are also more of these in the works. There's a Use Galaxy Belgium for instance that's out there. France, Spain, Southeast Asia and Taiwan are all in various steps of coming along. The last thing I wanna highlight is our annual conference which is at that URL. We are co-locating with BASC this year, the Bioinformatics Open Source Conference. It's really, well it's online obviously. It's at the end of July and it's really, really affordable and no matter where you are in the world you can attend because we are mirroring our content in both hemispheres. So we're hosting out of Toronto and we're hosting out of Eastern Australia as well. So 12 hours apart, everything twice each day. So last slide, thank you. We'd really like to thank our NCBI collaborators, Yuri, who's on the call, as is Lydia and Rivender. I don't know if Kurt and Sergei, sorry if I'm not pronouncing that correctly are on the call or not, but they were both instrumental in getting this to work. We'd like to thank NIAID, NHGRI and NSF because they provide us with funding. And last but certainly not least the Galaxy community. And again, without them we would not be who we are. And Dave, if I might just finish. So this, we're again, we're very grateful to SRA for initiating this. And SRA provides, I just lost my Zoom interface here. Yes, so SRA provides open data. And what we're adding here, we're adding open tools and open infrastructure to analyze this data. Because if any of you on this webinar will decide to download 10,000 SRA data sets, you will be able to do that. And you will be able to map them and to analyze them. And so this will be happening somewhere and this is one of the reasons why we have NSF there because NSF funds national supercomputing infrastructure and Galaxy runs out of Texas Advanced Computing Center which allows anyone of you to analyze their large data. So I want to thank them. And again, I want to thank SRA for initiating this. This is, it's an early stages, we're probably gonna have some problems. And so for you users, it's important to let us know what these problems are so we can fix them. Thank you. Thank you, Anton. And thank you all last, certainly not least again, for sticking with us for the whole 75 minutes of this call. And we look forward to hearing from you on those contact and support channels. Okay, and that's it. I'm gonna stop the recording.