 Hello and welcome to the first intro video of this galaxy training session. My name is Jonathan Tro and I'm a curator at the sequence read archive, which is part of NCBI, which is itself part of the National Library of Medicine here in the United States. And today I'll be telling you about some new data formats that we have available in the cloud from SRA, as well as some accompanying metadata that can help you search and filter that to find what you're actually interested in. Specifically for SARS-CoV-2 runs in SRA. The learning objectives for this intro video are to gain an understanding of which data and metadata types are available for SARS-CoV-2 runs in the cloud to get a basic knowledge of what the SRA aligned read format is. It's our new format and what you can do with it. What kinds of data formats can you get out of it. And then to give you a working knowledge of SRA metadata that is available on the cloud, the ways you can access it, and some popular use cases, you know, why would you actually want to access it. So one reasonable place to start here is why have SRA data in the cloud at all. One answer to that is the SRA archive, the public part of it. It's now well over 18 databases of sequence data that's available for free download and analysis by anyone, which is great. But it's a lot of data that can be hard to search, which is not so great. I want to say things like I wish he was easier to find SRA data based on the organismal contents of the reads, or I just wish he was easier to search SRA based on the submitted project and sample data like doing that an entree can be tough sometimes. Or I wish I could get a sense of what could be a symbol that have an SRA data set before downloading it and putting that effort in. But there was some kind of shortcut towards a simply embarrassing calling. And so the data we've made available is kind of a direct response to these, these wishes and requests. So which brings me to what data is in scope for this training. This is public SRA data that contains SARS-CoV-2 sequence. We do have searchable metadata for all of SRA that's available and we have documentation about that but I'm not talking about that today. I'm specifically talking about SARS-CoV-2 data. But even more specifically, this is data generated on the Illumina platform only. We do plan to do this analysis for long read data in the future. So stay tuned for that. This begs the question, how do we determine which runs contain SARS-CoV-2 data? The short answer is that we run the SRA taxonomy analysis tool on it. I'm not going to go into a lot of in depth explanation as to how that works, except to say that it uses a camer based approach to organismal content analysis. And we're running it on all of our incoming data to get an idea of what kinds of sequences and what organisms and clades that data is from. If you want to know the more specifics about how this works, we've got a link to a preprint here that gives a more in depth explanation. We have some more explanation on our website as well. And also the link to this preprint is there. And if you want to see that, we have a link to it in the references portion of the written tutorial. So I've mentioned now the SRA line read format. What is that? So I told you that all incoming SRA runs are scanned for coronavirus content, scanned for all types of content. But in particular, if we find any runs that have at least 100 hits to coronavirus and they're from the Illumina platform and have at least have average read length of 75 base pairs, then that is selected for further analysis. And we then make contigs out of those using guided assembly against the SARS-CoV-2 RefSeq record. And this creates faster contigs which you can access. In addition, if contigs are successfully created, the reads are mapped back to those contigs, and you can access these raw reads themselves and fast day or fast queue format, or you can dump them out as aligned to the contigs and SAM format. One thing I will note is that dumping in SAM format doesn't work right now in Galaxy. We hope to have that working in a future update. But if you're running the toolkit locally on your own system, you can dump SAM out of these objects. And once that's complete, taxonomy is assessed via two methods. First, the contigs are assessed again using the stat detection tool. And those results are made available for search in our metadata tables. And then second, the contigs are checked via mega blast against the nucleotide blast database. And the results of that blast analysis is also made searchable and available in our metadata tables. And I'll be talking more about that here in a little bit. Finally, variants are called and then annotated with figure three. This result in VCF files, which you can also download and we're going to show you how to do that. I'm not talking about the specifics of how these VCF files are made, but if you want to know that we've got the specific pipeline. And the parameters that are used on our website, you can find that through a link in the references section again of the written portion of the tutorial. So the Orion aligned read format is really a set of Corona V-ray contigs that have been constructed from SRA data when they're organized by run accession. These are compatible with the SRA toolkit, they're compressed data objects that can be operated on by all the normal SRA tools. They allow you to access the reads themselves in fast queue or fast day format, the contigs and just fast day format if you just want the contigs or you can have the reads aligned with the contigs in SAM format, sort of whichever options work best for your personal use case. In addition, these have a smaller size than standard SRA format. This is both because we're using compression to reference and these align quite well, but in addition to that they have no quality scores. The original submitted quality scores have been removed. So this makes them very small, quick to work with, quick to dump. However, if you dump fast queue, you still get standard four line fast queue format, but at each quality position you have a placeholder quality. If you need access to the original quality, those haven't been lost, those can still be accessed from the normal SRA data using the SRA toolkit. So you can always dump fast queue with the original quality scores using sort of standard toolkit on the original run. So we've got the contigs and the reads, those are the aligned read objects. We also have available metadata that surrounds these objects. I'm going to give you a brief overview of that here. So there's user submitted metadata. So this is the biological sample and sequencing a library information that was supplied at the time of sequence submission. We've got contig metadata, which is stats about the context. This is coverage taxonomy ID, the contig length, that sort of thing. And we've got the taxonomic content analysis of the analysis of those contigs. Let's let you do things like search for records based on camera hits to your taxonomy level of interest. We've got the blast results for those contigs. I mentioned that they were blasted. So you can search these results for by hit accession by hit length percent identity bit score, that kind of thing. So you can find calls are also available and searchable as well as that bigger three annotation information. There's also peptide information. So the details of annotated peptides, including the sequence, if that's the way you want to go. And there's more than one way to access this metadata. So it's available in the cloud. And it's was specifically originally designed for querying using cloud services. You can do this using Google's BigQuery or Amazon's Athena. That's one way. This allows you to analyze the SRA metadata directly using standard SQL syntax, and allow searches that just aren't possible in entree and are quite fast. The one advantage of this method is that there's no need to manually transform the data for querying. It's already that's already been done. So you can be up and running in just a few minutes. And that has built in support for complex data fields. A few of the fields on these metadata tables are essentially nested arrays, which don't have a clean analog in a TSV format or a standard SQL database. And so you can't easily query these in Galaxy, but there's built in support for that so you can un-nest these and query them with standard syntax if you're working in the cloud. One downside is that you do pay for the data that's scanned, although some free options exist. So BigQuery, for instance, gives you some free data every month that you can use to run queries. And Amazon's Athena, they have some credit systems for research and education that you can get some free time as well. So that's one way you can access it natively in the cloud. However, the underlying metadata files are available for download at no cost. So this is an anonymous download. And there's no egress charges. The data is in JSON format. And so you can download that and transform it according to your preferences, however you like. And that's what we're going to show you in this training coming up. You can import these directly to Galaxy and query them there. Some popular use cases. What are some types of things you can do with this metadata? You can do quite a bit, but for instance, you can find SARS-CoV-2 contigs with specific coverage and length characteristics. We're going to show that in one of our examples. You can filter SARS-CoV-2 contigs based on the geographic location where the sample was collected. You can filter these via blast hits for contigs from a specific geographic location. So if you care about one location in particular, you can look at what type of blast hits the samples from that location get. And it's how does change over time, for instance. So if you wanted to filter for all variations called from samples collected in the United States or somewhere else, after a certain date or between certain dates, or you wanted to compare different time points over time, you can do that. In addition, you can also filter for specific mutations. So I want to filter for all runs with this specific E4K mutation. You can do that too. So you can kind of mix and match these options to get a variety of filtering and search abilities that you just don't have an entree. And so in summary, I'm going to just briefly reiterate the types of data that are available in the cloud right now. We have this COVID focused SRA data set that's available both in Google and in Amazon. In the tutorial, we're going to do some downloading from Google and some from Amazon, but the data set is available in both places. We have the searchable metadata with the different tables I talked about. We have the SRA aligned read files themselves. So these are the files that have contigs and reads aligned to those contigs. There's the VCF files, which are the variant calls based on those contigs. We have the full SRA runs. This is what we would call like normal or regular SRA data, and this has the full submitted quality scores. And then we also have the original format submitted files for these. So this is the fast queue or the BAM file that was actually sent to us when we transformed our format. You can download those original files too if you want to. One thing that I'll note about these contigs is that they're not intended to be the final word in assembly or even the most optimal for all samples since that's not really possible. It's more based around the idea that the first major step of using SRA data is finding the data that you are interested in since it's a big heap of data. And so that's the metadata is designed to help you do that. And the second major step is often assembling it. So we're doing that step for you, not because this assembly will definitely be best for your analysis, but to give you a sense of what's in there and what can be extracted out of it beforehand because it can be pretty costly to both in terms of time and resources to do these assemblies. So we're trying to make it a little bit easier to figure out what is of interest in these runs to you so you can decide before doing your own work or your own better assembly that's more specific to your analysis, what's in these runs. And the benefits of that you just have access to this large volume of COVID-19 raw sequence data and there's no throttling and no cost to you. All of this data that I'm talking about here is available on the cloud. These are free open buckets that can be worked on in the cloud. Of course, if you have data sets there you could combine them and work on them in the cloud can download them to local storage at no cost to you as well. And importantly, you can also import them into Galaxy for use in your existing workflows and that's what we're going to show you in the upcoming tutorial video. So please stick around.