 Hello everyone, my name is Josper Auerkerk and in this video I will be guiding you through the Galaxy tutorial 3-0 analysis using synthetic datasets from RD Connect's GPAP. To open a tutorial, you can click on this icon. This will show all the tutorials that are available in Galaxy. In this video, we will go to variant analysis and scroll below to find our tutorial 3-0 analysis using synthetic datasets. In this video and tutorial, we will answer the questions. How do we import data from the EGA? How to download files with HGScat in Galaxy? How do you preprocess VCS? How do you identify causative varities? And the objectives are recasting DAC access and importing data from the EGA, preprocess VCS using regular expressions and using annotations and phenotype information to find causative variants. It would be useful to first follow the introduction to Galaxy analysis and the sequence analysis tutorials before doing this tutorial. So the tutorial starts with some introduction on variants and in family trios. So you can have different kind of patterns such as autosomal dominant, autosomal recessive and the NOVO. If you want to know more about this, you can read the introduction. I want to go through this in this video. So in this video, we will make use of the HGScat protocol, which is a program to download data securely and safely using the EGA download client in Galaxy. We will not start our analysis from scratch because we will be directly using the VCR files or the variant calling files. If you want to know more about doing variant calling from the start, you can follow the link here and go to the tutorial XM sequencing data analysis for diagnosing a genetic disease. So first we have to get the data. In this tutorial, we are using case 5 from the RD Connect GPAP synthetic data sets. The data sets have been generated from real family trios, which originates from the Illumina Platinum initiative and is described in Eberle et al. 2017. If you are interested, you can follow this link and read their paper. And this data was available by the HubMap project. In this data set that we are using, a real case 8 variant was manually spiked in, which should cause breast cancer. This has been sentically introduced in the mother and daughter. And here our goal is to identify the genetic variation responsible for the disease. So first we want to download the data. For this, you will need to go to the EGA archive and request the DAC access. This will take only one work day and it gives you access to all the RD Connect GPAP synthetic data sets. However, if you don't have the time to do that, you can also download the data from Sonodo by clicking on this button. And we can import the data from here. So to import the VCS, you copy the lines and then you go to your galaxy. To download the files, you go to upload data, paste fetch data, and then copy and then pasting the links in the text box here. And then you click start. It will be downloading the three VCFs. So here we have case 5F, which is the father, case 5IC, which is the case, and case 5M. However, we will request DAC access using the EGA archive. To request DAC access, you go to the EGA archive website, which is egarchive.org. To request DAC access, you first need to find your dataset. In this case, we will be using this dataset, egad0001008392. And we can search it here. Then we go to datasets and we click on this link. So this dataset consists of 18 samples because there are three VCFs, one for the mother, one for the father, and one for the case. And then that's for six case samples. So to get access to this dataset, you need to request it. And we can find how to do it here. So the contact person for this specific dataset is the EGA helpdesk found here. So you just need to email the helpdesk at egarchive.org and request the dataset by sending them an email. I have already done this before. So you can see in my account and my datasets, the EGA dataset available, which is this one. When you don't have an account yet, the EGA will make one for you. And then you can access your data. To link your EGA account to the Galaxy, you have to put in some settings. So you go to users, preferences, manage information, and then you go to add your email address to the your EGA account block and the password of your EGA account. And then everything should be set up properly. To test this, we are going to use the EGA download client again. In this step of the tutorial, yeah, so check the login and alpharise datasets by listing your alpharise datasets using the EGA download client. So we go to EGA and then that's the default. So we just run the tool. This can take some time because we are downloading the Sonota files and also the EJS to connect with its servers. So I'll be right back. So after some time, the request has been finished properly. And we can show a look at the output. And then you should see your dataset listed here. And this is the corresponding dataset. Also our VCF files from the Sonota links have been downloaded successfully. However, I will not go use this as we will continue with the EGA access way to download the files. If we go back to the tutorial, we can start the next task of downloading the list of files from the dataset. So each dataset has, of course, multiple files, and first we have to request the list of files to see actually what we want to use. So you click again on the EGA download client, list files in a dataset, and then add your accession ID, which we can find in the listed one here. So we copy this one, list files in a dataset, paste the dataset number, and run the tool. And this might take some time again depending on the amount of activity on the EGA archive site. So the list of files have been downloaded properly, and we can look at the result. So here you can see a list of files that are available in the dataset. So here you see, in this column you see the file IDs, and here you see the file names. On the first row we see case 3, and then the pro band, and then the forward rates. However, we are only interested in the VCS of case 5. To find these files, we can use an automatic filter tool described in the tutorial. So we go back to the tutorial, and then we download the list of files, and now we want to filter them. So we have to search in the text files the pattern of the VCS files. So we go to search in text files. We want to open the file list of the EGA dataset, and then we want to match a regular expression or a regex to extract the lines with the files that we want. For this we use eGRAP, so there are different ways of formats of regular expressions, but in this case we would use eGRAP. And then we will type our regular expression described in the tutorial. So we will copy it and paste it from here. So we will try to find for lines in the file which contain case 5, so the case that we're interested in, and then we put a point and a plus, which means you match any character for until you find 17, and then we match any character again for until we find the extension vcf.gc. And the dollar sign here means that the line has to end with this file, because the file names are at the end of the file, so we have to match them at the end. And we make sure that the case sensitivity is set, and then we run the tool. So this will output the list of files that we would want. Okay, so the file has been filtered, and we can check the results. So now we see three rows of files that we want to download, so we see the case, the mother and the father. We go back to the tutorial. So now we want to download the files listed here, and we go to the EGA Download Client again, EGA. We click on Download Multiple Files based on file ID. You can also do download the file, but then we only can do it once, and we want to So you do it this three times, but we can also do it in one time, which is a lot easier. So you select that, and then you select the file that you want to use, and then we select the column containing the file IDs, so that's calling one. And then we can also request a specific genomic range. However, the files are already split by chromosome, and in this case we're only interested in chromosome 17, so we won't have to put in anything here. But this is very useful if you want to download BAMs, for example. So we run the tool, and then we wait for the VCS to be downloaded. So the files have been downloaded and are listed here. So we have our father, our case, and our mother VCS. And also I just realized I have never given a name to my history, so I will be doing that now. It's always good to name your history, so I will call it Trio Analysis using EGA. So the next step in the tutorial is to decompress our VCS files because we want to do some pre-processing using text files. So for now they are bgzipped VCS. We cannot use tools that use text files, so therefore we have to decompress them. We can do this by going to our dataset, click the edit button, and then we use convert, and then we do convert compressed file to uncompressed file, and then we convert our collection. And this will create a new collection with our converted VCS. So the files have been converted to VCS, as you can see. So before they were VCS.gc and now they are normal VCS. And then we can pre-process our files. So go back to the tutorial. So first we have to add chromosome prefix to the VCS because they do not contain the chromosome yet. They just showed 12345678 etc. But we have to add this chromosome because the tools later expect this prefix. So we go to the tool column regex find and replace, and then we have to copy paste the regular expressions that are displayed here. So we go to this tool, we select our files, which is a collection, so we click on this button, then choose the uncompressed VCS files. So we add our first check, so we do insert check, and then we go to the tutorial again. So we put in this regex, and then we to find this pattern and then replace it with this pattern. Move this. So what this does is it tries to find lines that start with 0 to 9, My or X. So these are the chromosome numbers. And then replace them with and then add the chromosome prefix. So it adds the chr and then slash 1 is the pattern that it found. So it doesn't affect the pattern, it just adds the prefix. And then we add another check, which is focused on the header lines of the VCF. So here we want to replace the header lines that describe the contours and the chromosomes described in the VCF by also adding the prefix there. So we add another check, we paste the chromosome, the regex, and then we add the replacement. And what this does, it first matches this pattern, which is it has to start with the herstech, herstech, and then contig, and then equals some ID, and then we add the chromosome, and then you find the same pattern here as there, which is basically the chromosome name. And then we add the prefix in between these two matches, and then we run the tool. So I can show you what exactly changed in the VCF file. So you click on the collection, and then you click on the eyeball, and then it will load the file, and then we scroll down, and then we see the contig header lines. So the header lines are determined by the hashtags, and then see contig, and then the ID, and then equals chromosome 1. So before it was only the one here, and now it has added the chromosome. And also we can go to the file, and then we see that the chromosome prefix has been added to the chromosome column here. So while this is running, we can go to the next step. If you're not completely sure about the regex, you can always find more resources on it. This has been explained also very elaborately in the tutorial here. So the next step is normalizing the VCF files by left aligning them and normalizing them. According to this article done at all in a indel, it's only left aligned and normalized, if and only if it is no longer possible to shift its position to the left while keeping the length of outs alleles constant. And if it is represented as in as few nucleotides as possible. So basically there are many ways to explain a variant in some text. So here there is a CA deletion according to the reference. This can be seen as CA and then a dot, or it can be seen as CAC changing to C, or it can be GCACA to GCA. So yeah, there are many ways to represent this. However, we have to make sure that all the variants are represented in the same way because we will compare them between the cases. And they are also in the databases, they are only represented in the normalized way. So to normalize our data, we will use the VCF tools norm tool and we will use the collection and then we use the one with where we replace the prefixes. And we use the built-in genome and we just use the HD19. And then we want to left align and normalize indels. We don't want to use deduplications and we want to split multi-alytic sites into biolulic records and we do it both for SNPs and indels. So what this steps does is basically splitting the variants into multi-alytic variants. So if there's on the same position a different variant on the two alleles, then they will be represented as two records because they can also be represented in one record. And then we want to output our files as uncompressed VCS and then we run the tool. So now that the normalization step has finished, we can go to the next step in our tutorial which is filtering non-ref sites. So in our VCS there are some non-ref tags in the alt column, the column that represents the mutated nucleotides. And these tags correspond to any possible alternative allele at this location. But so they are a kind of placeholders for potential variants, however, we are not really interested in these variants because they slow down our analysis quite a lot. So we go to filter this out, we go to use the filter tool. We select dataset collection, the normalized VCS. We filter based on the alt column which is column 5 and then we select the non-ref sites and then we change the equal sign to exclamation word equal sign which means that the condition means that it should not be equal to the non-ref site. And we skip 142 header lines because we want to skip the header because we don't want to filter out any header lines and then we just run the tool. So the filtering is done of the non-ref sites and we can look at what the effect is. So we see for the further VCS there are now 140,000 lines left which is a lot. We only kept 4.9% as is shown here using this filtering. So basically we have filtered out a lot which helps us with the downstream analysis to speed that up. So the next step is to merge our VCS into one dataset. So what we do here is we take all the three separate files of the father and mother in case and we merge it into a single VCF and then this will show for each row a variant again but then after that a variant information you see the three columns of our samples. So case, mother and father and then whether or not they have the same variant or they don't have the variant and if they don't have the variant it is indicated by this pattern. We can merge the VCFs using the VCF tools merge tool by clicking here and then we click dataset collection the filtered VCFs. We don't restrict it to any regions then for merging we can again there can be some multi-allegric records but we don't want any new ones we just want to output them as multiple records instead. There are no header options and we just output the again as a VCF file and then we run the tool. So the VCFs have been merged and now we can have a look at what that results in. You open the VCF by clicking the eyeball and you scroll down to the records and then we see the chromosome of this variant position 302 reference ST and then the alternative STA so it's an insertion of NA then you see some variant information. And then we see the cases and then you can see which one has rich so in this case all the family members have this variant however in the fourth row you can see that the mother doesn't have this variant because it is indicated by the empty dots and this makes it very useful for algorithms to check whether variants are shared especially based on because the inheritance pattern is important when considering the terrio analysis. So for the next step in the tutorial we go to the annotation of the VCF and to annotate our data we will use the SNP F2 using the merged VCF so we click on the SNP F2 here and then we input the merged VCF and then we also output it as VCF. We use the locally installed SNP F2 database and then we use HG19 because that's the reference that we're using. This we keep everything the same here. We use the defaults. We don't know special stuff and then we just run the tool. So the annotation is done of the VCF files and it has created the VCF file and also an HTML file with some results. So have a look at that. So here you see just a general summary of the file and the results. So we see the number of SNPs found, 150,000, the insertions and deletions and we see the number of effects by impact. So there are less high impact variants compared to low or moderate which is to be expected. The effect by functional class, so if the variant is a missense mutation, nonsense or silent and then the location of the variants shown in the table above and in a bar plot. So we see that mostly they are variants in intons and then here you see some more stats that I will not go into but you can have a look at it yourself. So for the next step in the tutorial, we go to the Gemini analysis, i.e. the trio analysis. Before we can use this trio analysis, we have to create a pedigree file describing our family trio. So the pedigree file you can see here and you can import in Galaxy by pasting and fetching. So we copy the pedigree file, go to upload data, we do paste and fetch data and then we give it the name just pedigree and they click on start and then it will be imported. So now we will explain the pedigree file. So they have a family ID, so which row belongs to the certain family. The name of the sample, so we have 6M, 6F and 6C. So as you might remember we use case 5, not case 6. However in the VCS the sample names are case 6. So therefore we also have to design them in the pedigree like this. So then we have the paternal ID, so that's the sample name of the father and the maternal ID, the sample name from the mother. So here it is 6F and 6M for our case sample. And then the sex of the person, 2 is the female, 1 is male, so we add the mother and daughter. And then the phenotype whether or not the person is affected by the disease. So you have 2 is affected and 1 is unaffected. So in our case the mother and daughter both had breast cancer so they are both affected and the father is not. So the next step is loading our pedigree file and VCS file in the Gemini database. So we can use this with the Gemini load tool. Then we drag and drop our VCS file here and then we scroll down and then we select our pedigree. Then we run the tool and then we wait until the database is loaded. So the database has been loaded so now we can analyze the database using the Gemini tools. However before we can start with that we need to know what kind of inheritance pattern we are interested in. So we know that both the mother and daughter are affected and the father is not. This makes it less likely that the inheritance pattern is recessive. It is still possible however an autosomal dominant pattern which you can also apply here is much more likely. Also this is the case for the denoval pattern. Since the mother is affected it is again more likely that the pattern is autosomal dominant and not the denoval. Most likely you will see that both parents are not affected but the pro band or the daughter or son is affected. So to analyze the trio we use the Gemini inheritance pattern tool. So we select our database and then our assumption of the inheritance pattern which is autosomal dominant and then we add an extra constraint. The extra constraint is that we require that the impact severity of the variant cannot be equal to low. So a low impact severity means that it has no impact on protein function which is likely more considered for a disease. So we put that in there. Then we keep the additional criteria to the default so no analyze all variants from all included families. And then we want to output a custom report by adding additional columns that are separated by a comma. So then we go back to the tutorial and then we copy paste this line here and then we run our tool and then we wait until the analysis is done. And we will get a list of variants that are causative for that are potentially causative for the disease. So now the analysis has finished and we can regenerate the list of variants that could be the cause of the breast cancer. So when we have to look at the file, you can see a lot of information on the variants. So each row is a variant and then we can see the impact, the gene it was on, the significance according to ClinVar and the disease name, and the gene phenotype and some RS IDs, variant IDs and then some information on the trio. And then you see it like the samples and they are all related to K6C and 6M because we search for autosomal dominant. And to see if we can find anything related to our term, we can look for breast for example. And then we find two rows. So we see this one here BRCA1 gene and we see DAX14. However, for DAX14, we see a none at the ClinVar significance, which is this column, and there's also no disease name. However, when we look at the BRCA1, we see that it is pathogenic and also related to breast. So this is the variant that is most likely to be the variant that causes the breast cancer. And it also actually is the variant that we were looking for. So with this tutorial, we have found the variants of interest. We have downloaded data sets from the EGA using the ACS-GAT protocol. We were able to find the variant by preprocessing and annotating the variants using SNPF and Gemini. Basically, you can extend this work by downloading any TRIO data set from the EGA using Galaxy and the EGA download client and do analysis yourself with this workflow, which is listed here. I hope you find this video useful for your future work and hopefully makes it easier for you to do TRIO analysis in Galaxy.