 Welcome to the walkthrough part of this tutorial. We're going to start off with a quick sort of schematic overview here to kind of get a general sense of what we'll be going through and we'll refer back to this a couple of times during the tutorial to sort of keep track of where we are in the process. But the basics of what we're going to do today are we're going to first access the metadata from SRA and then transform it into a searchable format and run a search on it and that's this set of steps over here on the left. The result of that will be a list of runs that we're interested in and we'll use that over in the second step to download the objects, the data objects from the cloud into galaxy. We've got two examples we're going to download the aligned read objects and we're going to convert those to fast day in galaxy and we're also going to download the VCF files and show that you can run your tools of choice in galaxy we're going to use snippet as an example. And then I'll also talk briefly about if you wanted to start in the cloud in BigQuery or Athena you could run your queries there and get a run list out of it. And that would replace this whole set of steps over here on the left the download transform query steps on the left and you could take that list of runs and use it to directly import these data into galaxy as well. But we will start here in galaxy I'm at use galaxy.org. You can use whichever public instance of galaxy you'd like they should all work fine for this. And I've got a nice fresh history over here on the right side. If you need a fresh history you can click this plus button up here to create one. And I'm going to name this I'm going to name it serfs, but you can name it whatever you would like that's not actually important for the tutorial. So the first thing we're going to do is download an index file. I mentioned that these these metadata files are available in the cloud, they're split up into smaller pieces to make them more efficiently queryable on the cloud and they're updated every day. The names have the date on which they were generated, as well as a number representing which, which piece they are. And so, we can download an index file first to tell us what the most current up to date set of these files is so that we can download them. So if you go over to the upload data button up here and click that we're then going to go up here to these tabs up top and click on rule based. And we want to make sure that we have upload data as data sets selected and load tabular data from pasted table there. And then you can go to the actual written tutorial and the text need you need to paste in is there. Click the copy button, then click in this box, hit control V, and we've pasted in this, this set of URLs and each each URL is the path to one of these index files for one of the metadata tables. And we're going to click build. And this brings up the build rules for uploading data sets. And from here, we're going to click on the column menu down here. And you can select this top option base name of path of URL and select it. And from column a is correct. So we're going to click apply. Then we're going to go to the rules button here and click, we're going to click add modify column definitions. This is where we tell galaxy what these columns represent and you can see that the first thing we did split out the individual file names over here into a separate column, and that just sort of gives a unique identifier to each of these for galaxy. Now we're going to tell it galaxy what these columns represent. So we're going to click the add definition button. And we go down here to URL selected and column a is automatically selected here and that's correct. This is our URL. So we're going to leave that as is and then click add definition again. And here we're going to choose name note there's also something called name tag. That's not the one you want you want name so select that. And this is column B, that file name that was that was split out over there. And those the only two we need to be click apply. And you can see our two rules are set up here. If I mouse over them it kind of highlights the column that it refers to. And now we're going to click upload. It's okay to go ahead and just close this box. And so you can see over here on the right in our history we have a set of files that are queued up for download. This usually won't take very long they're just small text files that are really just a list of all of the metadata files for, in this case say the Contigs metadata table, or the end table variation metadata table. One thing I will note is that these files are updated daily and that's usually happens around 530pm Eastern time. So if you're doing this around that time it's possible that you could kind of get a race condition where you download the index file. Start this import process. And if the files on the server get updated well between say now and when you actually download the files you could you could try to do the download and realize that the metadata files have changed their names have all changed. And so they won't match the index file you downloaded a few minutes ago so if you are doing this around 530 Eastern I would say wait, give it a few minutes go make off your tea. Do something else come back and start a little bit later just to make sure you don't encounter that. And in the tutorial we have a link to a time zone converter if that's helpful. But you can see these have all downloaded they turn green over here on the right. And if we click on one of them click on this view data icon. You can see it's just a list of file names until this is the date on which they were generated, and then we just got an integer over here. So now that we have the list of file names we're interested in we're now going to actually download these files. And to do that, we are once again going to go to the upload data button. We're going to go to rule based again. And up here we're going to upload data as collections. And we're going to load type of their data from history data set. Click that. And the one we're going to choose is this context file list. We're going to import the context table. So we select that and it shows us the contents here is the file names like we expect. We're going to click build. And once again we're back at the building rules for uploading collections box. So first going to click the column button, and we're going to click on fixed value. So we click that, and the value you want to put here it's sort of the first part of the URL to download these files and this is present in the tutorial so I recommend you just copy and paste that I've copied it from the I paste it here. Click apply. And you can see this is showing up in column B now that that prefix prefix of the URL that we want to download. And now we're going to go to the column menu. Click concatenate columns. And we're going to it's from column B, you're going to pick this is the beginning of the URL. And then the second one is a that's the actual file name. And as you can see we now have a third column which is column B and a concatenated together and this is a URL and these URLs can copy and paste this. And if I just paste it up here into a browser or you can see it downloads the file right there. These are just URLs you could download in a browser or from anywhere if you want to download these files they're freely available. But we're going to put them in the galaxy. So to continue that process, we're going to go to the rules button here and click it and go to the add modify column definitions option. We're going to click add definition, select URL. And this is going to be column C. It's the column we just created. Over here that's our URL and then we're going to click add definition we're going to add one more. We're going to select that first option list identifiers. Click it. We're going to select column a. And this is the file name column that just gives it a unique value to each of these. Then we need to enter a name for our new collection. So I'm going to call this contigs.json and click upload. And again, I'm going to close this. You can see we have a new history item over here on the right and this is going to download this full set of files. It's going to download each of these files. And then we'll have them in our history and be able to work with them. And so, going back quickly to our overview. The first thing we did was download this index file. And now what the step we're currently doing is downloading the JSON metadata files themselves. And since these are split up for efficient querying in the cloud, we're going to get a group of these files. You saw there were about 45 there, I think. And the next step is going to be for us to combine these into a single JSON file. So that we can further work on the my galaxy kind of all in one piece and look at all the metadata together. And if we come back here we can see that my download jobs have finished. It can take a few minutes usually for these jobs to finish under certain circumstances it might take longer depending on the load on galaxy and other factors but it's usually not too long. And we can see it's just a list of JSON files and if you click on one of these, click on the little view data icon. We'll see the contents which are just rows of JSON data. And so now we're going to concatenate these into one single file. And again, you can click on the tools in the tutorial if you want to go straight to the tool that we're talking about but I'm going to type can cat appear in the little tool search bar. This is one to be careful about because there are several concatenate tools that do various things. We're going to use this concatenate data sets, tail the head parentheses cat. Click on that. And we're going to go to data sets collections here and click on that little button and our context.json is there. So we're going to select that. And then we're just going to click execute. This is going to take the files in that collection and concatenate them into one big JSON file, which we will use for further processing. I'm going to go ahead and rename this data set. So if we click on the edit attributes button right here. This brings up our ability to rename it. I will call this contigs single JSON and click save. And here we've got our concatenate job running. And once that completes, we will proceed to the next step. That tool is going to be called JQ. I'm going to type JSON up here in the tool search bar. And JQ process JSON. Click on that. And you can see our concatenate job over here is finished green on the right. So if we choose JSON input, we want the context single JSON that we just created and then under JQ filter. I recommend that you copy and paste the filter that's listed in the tutorial. I'm going to paste that here. And really all this this filter is doing is taking everything in the JSON and pulling it in as we convert to a TSV format and so convert output to tabular you're going to click this to yes, and then click execute. So this is converting that JSON format into a tab separated format that we can then query within Galaxy. So I'm going to go ahead and rename this and click on the edit attributes button, and I will call this contigs dot TSV and click save. And this warning is okay. The name will still be updated over here once it completes. This process sometimes takes a little while since we're converting for the large number of JSON files. So if you want to go make some tea or coffee for a couple of minutes and come back this is a good time to take a little break. The fact that we've just finished is combining into a single JSON and right now we're converting that JSON to TSV format and finally we're going to transform that to a SQLite database and run a query on it to get our run list a list of runs we're interested in that we can actually use to get the data. And so going back to Galaxy. My conversion process has now completed. And we're on to the next step. We're going to use a tool called query tabular. It says query tabular using SQLite SQL and click on that. And that brings up this screen. So the first thing we're going to do is click this insert database table. We're going to go for tabular data set for table. This is correct our context dot TSV files what we want to use. And we're going to go down here to table options, click that opens up a new set of options for us. There's a few things we need to fill out here need to specify a name for our table. I'm going to call it SARS contigs. You can call it whatever you want but it's important that what you name this table here corresponds to what you call it later on in your query. You can just copy and paste the query we've provided and it will work. We need to specify column names this tells it what each column represents here again I'm going to recommend that you copy and paste this directly from the tutorial. You just click here, control the got our column names all listed there. I'm going to scroll down a little bit further. There's the save the SQLite database in your history option. The false to know if you think you're going to run more queries on this database, like, say, let's say later today, then I would, I would select yes because you will allow you to skip this conversion step and if you want to run future queries on this, you can just do that there's there's a link on Galaxy that let you run those queries. If you think you're going to want to run queries again in like a week, you might want to go through the import process again to get the newest set of metadata, but saving this in your history will allow you to more efficiently query it in the future without going through all these previous steps again. But I'm going to leave this to know right now. The first thing is the SQL query to generate type of their output. We've got a query for you in the tutorial so please just click the copy button there, paste that here. Basically what we're doing is we're selecting distinct runs so we're going to we're excluding duplicates from this contigs metadata table. We're looking for things that have contigs that have a length greater than 20,000 and a coverage greater than 100x. Then I've got a couple of other filter and sort options here to make it a little more reproducible for people doing this tutorial in the future. And we're doing a limit 10 here that's just going to give us 10 results because we're just sort of demoing this if you were doing this for real with your own search you probably would want to exclude this and get the full set of runs that that your search criteria. So we're just doing 10s, doing 10 of them, and then include query result column headers, we're going to click no on this because really all we're getting out of this stage is a list of runs. Once we've set all of those we're going to click execute. You can see it appears over here on the right. And now I'm going to click the edit attributes button and rename this again to run list. Click save. All right. So this is another step that can take a little while to run. One thing I'll note is that we're going to take this run list, show you how to import the VCF files and the aligned read objects. So what I really wanted to do with this run list is just dump fast queue with the originally submitted quality scores and that's all you care about. You can take this run list and use it with the SRA tool cutting galaxy to do that. And so, just, I'm not going to do it but I can show you how if you were to type SRA up here in the tools tool search bar. This brings up some of the SRA tools in galaxy and you could use this faster download and extract reads and fast queue. If you click that. It's like the input type you would have list of SRA sessions one per line. That's the output that we're getting here with this run list so you could just feed this run list to it, click execute and it would create that fast queue with original quality scores. One other thing you can do here. I've mentioned before that you can query this metadata directly in the cloud so we've converted JSON to TSV and we're currently doing the transform to SQLite. This is the first step and this whole set of steps. The results is we get this run list out and this run list is what lets us do the further downloaded import steps. You don't have to do this in Galaxy. We're showing you how to do it for ease but you could do this in the cloud you could do it in BigQuery or Athena where you wouldn't have to actually transform the metadata it's already there and ready to query so you can just query it there. Get a list of runs out and you can use that list of runs to start at the next step that we're going to do directly directly import here. We've got a couple of minutes while this process right now completes. So I'm going to show you a couple of examples of how you can do that. So here I'm showing you the web interface in Google's BigQuery platform. And if you're interested in doing either of the things I'm going to show you here in Athena or in BigQuery searching in the cloud. We have step by step guides to get you started on that in our on our NCBI documentation and there are links to that in the references section of the tutorial. But once you have a login and are kind of up and running with the data, you can just type in your query and we have example queries on the website as well if you're new to SQL. There are a number of tutorials out there that are free online and that are pretty good and actually SQL is not that hard to learn if you don't have much much experience. I wouldn't let that be the thing that prevents you from trying this because you can get a basic understanding quite quickly and it won't prevent you from doing pretty powerful queries actually. But you can hear we're just going to run a really quick query to look for all the data that have a collection date greater than the start of this year. The query typed in here in the window. I just click run, and it starts to run. And these queries perform quite quickly as you can see they're optimized for running in the cloud. And the sort of preview results they give you here. You can see all the data fields that have come up for each of our, each of our queries. So it shows you the geographic location, the sample collection date. I mentioned that there were some sort of complex data fields. There's a sample of that here. So these key value pairs, these are hard to sort of query elsewhere. They don't have a clean analog in standard SQL database, but they're not too hard to query in the cloud. There's specific syntax for doing that. And we have examples again on our website for unnefting these and searching all the sample information in here. You can see a lot of other information like the sequencing platform. And there's other information like the library construct some library construction information is at Amplicon, etc. So you can get these, these data out of big query. I'm also going to show you quickly what the Athena Amazon's Athena site looks like it looks pretty similar. But you can type in your query in the little query box in here. And click one query to let this go. But here we're looking at the geographic location information and getting account for all the samples from this in the SARS-CoV-2 set. Where, where was it collected? And you can see this information down here below. There's 300,000 from the United Kingdom, 75,000 from the USA and on down. So if you are specifically interested in one of these locations, you can further drill down, you can just query for that specific set of data and look at it, or you can further combine this with the time it was collected. And just, okay, I want to compare time a to time B or see how it has changed over over a time period. And so you can get a lot of information just by querying on the database itself, even without looking at the data itself yet. I mentioned that you could also do your queries here instead of in Galaxy and export that run set, that list of runs, and then use that directly for your import. So here is an example if I run this query. We're just looking for samples with the geographic location with South Africa. And you can see we get a run list here. If you click this little download the results in CSD format, it'll download this list to your computer, and you can just insert this into the next step of our analysis in Galaxy and start from there. So if I'm back here on the BigQuery page, when I have a result that I've got down here that I want to save and let's this is not a list of runs but if it were, I could click this save results button. And that brings up the option for me to save this locally and you can save it as a CSV file or they have other formats as well. Click save and it'll just download that to your computer and you can use that in Galaxy to import at the next step. And now back to our Galaxy interface, and we can see that our run list is complete here that query is finished running but if I just click on the view data quickly. The result is just 10 running sessions. And that's because we specifically eliminated it to 10 for purposes of this example. But looking quickly again at our schematic. We've completed this step, and we've got our run list and now we're going to proceed to download some aligned read objects and then we're going to get the contig information out of them and convert it to past day. So the first step to doing that back in our Galaxy interface. We're once again going to go to the upload data tab or upload data button rather click on that we're going to go to the rule based tab again becoming familiar with this upload data as going to select collections load type of their data from here we're going to go history data set. And we're going to click that run list that we just generated and it shows us the preview here that it is our running sessions so we'll click build. And then here in the rule builder, we're going to make a couple of changes. We're going to go to the plus column button, and we're going to click the using a regular expression option. And it gives us three options up here, we're going to select create column from expression replacement. It gives us two boxes down below. Here I'm going to recommend you fill in these boxes by copying and pasting from the tutorial, but we fill in our regular expression here. And then we're going to put in the replacement expression, which again is copied from this tutorial and I paste that there. Once those are done, we click apply. And you can see it brings up a second column, which is a construction constructed URL that will give us the SRA aligned read object for each of these runs we're interested in. And again, this is a public URL you can plug this into a browser or W get it or anything from wherever you are and you should be able to just download this file freely. But to import it into Galaxy here, we're now going to go to the rules button and click add modify column definitions. Click add definition, go down here again to URL selected. We're going to pick B because that's the column that has our URLs in it. And then we're going to click add definition again. We're going to select list identifier. And we're going to use column A for that. This is our, again, our unique value that we're assigning to each of these objects. So we're going to click apply. And if we mouse over these, you can see that they apply to the correct columns. That's good. And so finally we need to give a name to this collection. I'm going to call it sarf path and click upload. We can close that dialogue box. And this is grabbing these files there in the cloud. It's going to download them to our, our little instance of Galaxy. One thing I do want to note is that there can be some lag in the availability of the aligned read objects, and the VCF files, this is specifically for like newly submitted data. We know that the metadata is present in the online tables, but the actual aligned read object or VCF hasn't yet been uploaded. There can sometimes be some delay in that processing. So if you try to download one of these files and you get an error, generally, if you wait a little while, like say 24 hours and try again, that will succeed. If for some reason it doesn't and you're having trouble getting a file that you want, you can always write into our help desk and we'll be happy to assist with that. The SRA help desk email address is in the references section of the tutorial if you need to do that. But we can see that this is still downloading over here. It doesn't usually take all that long. They're pretty small files as I mentioned previously. So we can see there with these runs that were interested in their pretty small files since they don't contain quality scores. And since they use compression to the reference and they're all aligned. So the final thing we're going to do here and ultimate actually it's not quite the final. It's going to use the SRA toolkit on these. I previously searched for the SRA toolkit that's still over here. And so if we go down to the download and extract reads and fast to a slash Q and click on it that brings up this tool. We're going to select input type. And here we want SRA archive and current history, because these aligned read objects are just SRA archive objects they're just like the format exactly the same as other SRA data. And then we're going to go to advanced options and scroll all the way to the bottom. So this this field that is table name within CSRA object and CSRA object is really it's a kind of an older name for SRA archive objects. And here we're going to type reference in all caps. This is telling it that we specifically want the reference table which in this case are the context. And then we're going to click execute. And then see, I forgot SRA archive and current history I forgot to tell it which SRA archive and current history so click on the data set collection button here. It automatically brings up soft path which that that last thing we did which is the set of aligned read objects we imported that's the one we want to run it on our reference table is still entered there so now we click execute and hopefully this is added to our history over here. And so this job is is dumping those context for us. One option you have here as well is that if what you really wanted to do instead of get these contigs was get out the reads in basket format and you're okay with not having the submitted quality scores getting placeholder quality scores in that format. You can just leave that that table name within CSRA object that field will be typed reference leave that blank and you'll just get the reads out in basket format. So that's another option that you have here that we're not showing actively, but we can see that these dumps generally happen pretty quickly and if we click the data button for one of them. You can see these are the context and they aren't fast you format because the tool we're using as a fast Q converter. So we're going to do one last step to make these into true fast day format. In order to do that. Go back to my history here, search, ask you to fast day up here. This will give you a lot of results. We scroll down a little bit, or you can of course click the tool directly in the tutorial is a couple fast Q to fast day converters. This is the one we're going to use fast Q to fast day converter. I click that. Then I click data set collection. I didn't I didn't rename this, but you can rename this as you would like or you can leave it I'll just leave it for now but this is the context we just dumped. We've selected that and I click execute. This is now going to convert that into true fast day. So you'll just have the fast day sequence of those contigs. And this is also generally a pretty quick step. If I click on this, we'll see if any of them are done yet. I click view data, not quite yet. But once you're at this step, you can use the fast day in your workflow as you would any other fast day or contigs that you had generated yourself. One thing I will note is that we don't generally consider these to be amazing or the best possible alignments that can be made and that's not to say that we think they're bad it's just that using an automated process the way that we do. No, no set of parameters would be completely optimal for all samples. Really what we're hoping to let you do is see what's in there get an idea of what's in here before you go through the process that can be somewhat costly both in terms of money and time of assembling it all yourself. This is designed to give you a basic handle on what's there before you study and reassemble yourself. And so you may find that once you've gotten a sense of that, then you want to go back and do your own assembly with a specific set of parameters and that would that would be a perfectly reasonable way to use this. Let's see. We can see here that some of these are done and we've just got fast fast day format for these contigs and these contig names correspond to the contig names that were in the metadata files previously so that contained the covered information and length and other parameters about these contigs so you can cross sharpen that with those tables if you need to. And so looking again at our sort of overview schematic we downloaded the aligned read objects and then we then use the S-ray toolkit within Galaxy to extract the fast day contigs from those. And then what we're now going to show you is how to get the VCF files for these same runs and of course if you wanted to you could run a new query to figure out the VCF files you're interested in or run that on a different metadata table what we're just going to use our existing set of runs, their existing run list to do that in favor of time. So going back here to our Galaxy interface where once again going to go to the upload data 10 which is our rule based which it already has us that upload data as collections load type of their data from history data set and actually we're going to use run list again we're going to use the same list of runs to import our VCFs so we'll click build to get to our building rules for uploading collections. And here we're going to go column. Click on the column button. And we're going to click the using a regular expression option here. We did previously, we're going to click create column from expression replacement. Here again I'm going to recommend that you copy and paste the regular expression and the expression replacement from the written tutorial, but I'm going to copy in my regular expression here. I'm going to copy in an expression here and click apply. That's made us our new column that is the full URL for these DCF files, which again, you could download through any method that you particularly like to download with but to import them into Galaxy. We're not going to go to the add modify. We're going to click on the rules button and go to the add modify column definitions option. We're going to add a definition. We're going to click on the URL selected. And we're going to change this to be just the column we've constructed to make our URLs, and we're going to add definition again, select list identifiers. And here we're going to select a which is a unique value for each of these. And then click apply. We have our rules. As we've seen previously a few times, we need to give a name to this collection. We're going to call it VCFs. Click upload. We'll close this box. So that's going to go out and download those VCFs. And so at this stage, once you've downloaded the VCFs and ported them into Galaxy, you can sort of do whatever you want them you could insert them into your own standard pipeline or processes here at this stage, just by way of example, we're going to run SNPF on them. We're going to give a demonstration of something you can do, but really we expect you to do your own pipeline or use them as you would normally use them in this case. So if I search SNPF up here. We'll get a couple of options. There's more than one SNPF in Galaxy. And you want to use the one that's specific for annotate variants for SARS-CoV-2. So I'm going to click on that. And we will click on the data set collection option here under secrets changes and 91 VCFs. That's the thing we just made. This is our set of 10 VCFs. That is exactly what we want to use. I'm going to leave the rest of these options as they are certainly you could update them to make them specific to your personal interest or usages. I'm going to leave them as they are, including producing summary stats. This is going to create a second item in our history that will have the summary stats. And I will click execute. So that adds these two items over here. One is going to be the annotated SNPF annotated version of these VCFs and one's going to be the summary stats. Click on this VCF item, the thing we just sort of imported. Click view data. You can see that these are just VCFs we imported and you can just work on them as you would normally expect to. So going back one more time to our workflow overview, we downloaded the VCFs and now we're running your tool of choice in this case SNPF on them. And from there you can you can take them and use them as you would like in Galaxy. I go back out here. This is busy running getting close to being done but you can see we've got our outputs. VCFs here as well. And so this concludes this part of the tutorial. I if you have any questions or concerns that you'd like to write to us about the SRA help desk. The final address is in the references part of the written tutorial. Feel free to write us questions or comments or other updates you'd like to see that kind of thing. If you have general questions about NCBI resources, you can write to info at NCBI and that address is also down in the references so you've got both of those there. And let us know what you think. Thanks.