 Hi, my name is Alex Estrovsky. My partner Delfine LaRiviere and I will be presenting a talk for you today about some advancements in the field of genome assembly, as well as an exciting venture in the realm of open source data from a group known as the Board of Genomes Project. Although historically, DeNogla Genome Assembly has been a large effort within the genomics community, it has also had a very large barrier to entry. It requires great amounts of technical expertise, resources that are not necessarily available to most people, as well as high quality data that you can use to actually perform your assembly. This effort resonates with the Galaxy community, especially considering one of our chief goals is to promote accessibility in computational research. We are proud to have provided tools for researchers for the past decade, enabling them to perform their own assemblies, and have continued to expand our available toolkit as new tools are developed. With these goals in mind, we are excited to discuss with you an ongoing effort of open source data and genomics. Beginning in February of 2017, the Vertebra Genomes Project is an effort from the G10K Consortium, and is made up of over 50 institutions with the stated goal of creating an error-free reference genome for more than 70,000 vertebrate species. This goal is organized into four phases sorted by taxonomical hierarchy, with each phase representing a more specific taxonomic rank. The VGP is currently in its first phase, with 268 species representing 260 orders planned to be assembled. As these assemblies are generated, they are posted on the VGP's Genomark repository, which currently has 112 completed assemblies, with several more in progress. In generating these assemblies, a best-practice pipeline has been successfully developed. It integrates Hi-C and Bionana optical map data. Although these are recently published images, the pipelines have been further developed since their release. Delfin will be getting more into the specifics of these pipelines very soon. And now Delfin will discuss the new collaboration between the Galaxy Project and the Vertebra Genomes Project. The collaboration between the VGP project and the Galaxy team focuses on the second edition of the VGP assembly pipeline. The first part of the pipeline is based on the pack bio-data. The long-read assembly is performed using Hi-Fi ISM and uses the produce-due package to separate haplotics. The pipeline then uses Bionana genome mapping data to perform scaffolding of the previously assembled genome. Finally, we provide the pipeline to perform the hybrid assembly using Hi-C data to add an additional level of scaffolding to reconstruct wall chromosomes. At each step, the pipeline includes quality control tools to data-mine the parameters to use for the assembly and evaluate the quality of the assembly. The VGP pipeline has been implemented in Galaxy in six distinct workflows. We are presenting today the full pipeline, but we divided the analysis in several workflows so the user can combine them accordingly to the available data. The first part of the VGP pipeline is the generation of phase assembly using pack bio-long reads. This assembly is generated using Hi-Fi ISM tool. In this figure, the orange and blue bars represent the reads with heterozygos alleles that carry local phasing information. Green bars come from the homozygos region without any heterozygos alleles. In the phase string graph, a vertex corresponds to a Hi-Fi read, and an edge between two vertices indicates that their corresponding reads are overlapped with each other. Hi-Fi ISM first performs an up-lo type-aware error correction, where it corrects errors in the sequence while keeping heterozygos alleles. It then builds a phase assembly graph with local phasing information from the corrected reads. Only the reads coming from the same up-lo type are connected in the phase assembly graph. If provided with complementary data, Hi-Fi ISM generates a completely phase assembly for each up-lo type from the graph. Once we obtain the phase assembly sequences generated by Hi-Fi ISM, we use bio-nano data to assemble scaffolds. Bio-nano technology creates optical images of long DNA molecules in their native states. It preserves long-range genomic structural information. Structural variations are observed instead of computationally inferred as in sequencing approaches. These long-label molecules are assembled into physical maps spanning the whole genome. The resulting maps can be used for scaffolding NGS contigs and detecting structural variants. In this picture, you can see that the bio-nano map, in dark blue, has been used to create a hybrid scaffolding grid from the packed bio-contigs. The VGP pipeline also uses Hi-C data to generate scaffold from the phase assembly or to improve the scaffolds obtained after bio-nano scaffolding. In figure BAC, the arrows represent the contigs. In part B of the figure, the arc between arrows represent the linkage information obtained from the alignment of Hi-C reads to the assembly. The thickness of the arcs denote the weight of the Hi-C edge implied by Hi-C reads. In part C of the figure, the arcs represent the overlap between contigs. From these two types of data, a graph is generated as shown in part D of the figure with solid edge representing linkage between contigs and dotted edges indicating the links between the ends of the same contigs. From this graph, the algorithms select the maximal weighted edges as shown in picture E, such as each node is only linked to one solid edge. Once this maximal weighted matching edge has been calculated, edges between the ends of the same contigs are added back to the matching to obtain final scaffolds. The first Galaxy workflow is the generation of the merry database from Pac-Bio-Ritz. This database is used all across the assembly pipeline for quality control. It is also used to generate KMR distribution histograms that are used to generate genome scope profiles. Genome scope profile provides with a profiling of polypoid genomes. It estimates the length of the genome or the estimate coverage for ethyl ozygous camera that is described here by the K-Cov variable. The four KMR peaks that are marked by a pointed line in the graph correspond to the mean coverage level of, respectively, unique ethyl ozygous, unique homozygous, repetitive ethyl ozygous and repetitive homozygous sequences. These variables identified by the model are then used downstream as parameter for the next workflows. We are presenting today the full pipeline, but we divided the analysis in several workflows so the user can combine them according to the available data. The second workflow is dedicated to the long-grid genome assembly. The long-grid assembly is performed using hyphaism and minimap is used for the mapping steps. This workflow actually contains three distinct parts that you can see on the slide. At the top, in gray, we use passing tools to extract the model parameters from the text output of genome scope. This allows us to automatically use them as input parameters for other tools. The orange and blue rectangle indicates the use of the purge-dupe tool on the primary and alternate hyphaism assembly, respectively. Hyphaism is an assembly tool that emits partially phased assembly. Purge-dupe is dedicated to the removal of apoptics and contact overlaps from the novel genome assembly to further improve the quality of the phased assemblies. The purge-dupe package contains several steps that we separated in the Galaxy workflow to allow for parameter adjustments. This need for parameters adjustment leads to the creation of two sub-workflows. These workflows allow to continue running the rest of the long-grid assembly workflow after two key steps of the analysis. If the automatically calculated cutoffs perform in the whole workflow are not satisfying, the user can modify the parameters manually and run the sub-workflows to resume the analysis. The workflows are available for the cutoff calculation of the apoptic aploid purge for both the primary and the alternate assembly. Three tools are used for quality control at each step of the genome assembly. Quast is used to provide basic quality information on the assemblies, including the assembly length, the number of contig or scaffold of the N50. BUSCO evaluates the presence of near-universal single-copy autologues in the assembly to evaluate its completeness. Finally, Mercury provides copy number spectrum analysis. After the long-grid assembly has been performed, the third workflow is dedicated to scaffolding steps using bionanogenome mapping. The integration of bionanodata to an assembly allows to order and orient the sequence fragment, identify potential chimeric joints in the sequence assembly, and estimate the gap size between adjacent sequences. The scaffolding is performed using the bionanosolve tool dedicated to bionanodata and the workflow includes assembly quality control, Quast, Mercury, and BUSCO. Last but not least, the final workflow performs a hybrid assembly using either the long-grid assembly or the scaffolded assembly in combination with high-c data. The use of AC data provides long-range sequence information to improve scaffolding to reach chromosome-spanning assemblies. The hybrid assembly is performed using Salsa 2 tools. In addition to the quality control tool used in the rest of the pipeline, Quast, BUSCO, and Mercury, the final workflow contains the pretext map tool that generates genomic contact maps using the high-c data. To illustrate the pipeline, we use this set of workflows to assemble the genome of the chicken. The genome is 1.05 gigabase per in length and the hybrid assembly using PyBio, Bionano, and high-c data runs in approximately 4 to 5 days on galaxy.eu. You can see here the evolution of the assembly quality provided by Quast. The first assembly generated by high phasem was made of about 1200 contigs with an n50 of 6 millions. The latest assembly, including Bionano and high-c scaffolding generated by Salsa 2, is made of about 350 contigs for n50 of 87 millions. On this slide, you can see the quality evolution provided by BUSCO and Mercury tools. We can see that the completeness of the assembly increases at each steps, with 88% of complete and single copy BUSCO for the high phasem assembly to 95% for the Salsa assembly. We can also see the disappearance of duplicated BUSCO between the high phasem assembly and the same assembly after purged dupes. The disappearance is also visible on the copy number spectrum graph provided by Mercury. The blue curve corresponding to duplicated k-mers disappear after the purged dupes step. Finally, the pretext map tool provides us with the genomic contact maps using high-c data mapped on the assembly generated by Salsa 2. We can see in this contact map that the assembly shows a generally clean scaffolding based on high-c data, with contact restricted to and continuous within individual scaffolds. There are two small exceptions at the spots marked by the letters A and B, showing respectively a disruption and external contacts between contigs. I would like to specify that we developed this workflow on the European instance of Galaxy, but they will eventually be deployed on the public instance Galaxy US and Australia. Thank you very much Dolfine. And to facilitate all of that, we are very excited to announce a brand new workbench for Galaxy, assembly.usegalaxy.eu. This site has been built from the ground up to allow researchers from around the world to contribute to the VGP efforts. Let's take a look at how now. On the homepage, this site gives a very basic primer into the VGP project, as well as Galaxy itself. It also links out to several workflows and trainings that might be applicable. For users wishing for a quick start to assembly, we have direct links to all of the VGP workflows and datasets. For ease of data acquisition, we have also linked Genomark as one of the built-in remote repositories from the choose remote files section on the Galaxy data uploader. Simply go to the upload data button, go to remote data, and select Genomark. There you can find all of the raw data for all species currently published on Genomark. We also intend to have extensive shared data within this workbench, both the workflows, as well as completed assemblies as they run within Galaxy. The workflows will be clearly labeled and easy to find, and the histories will follow an as-yet undecided naming convention so that they can be more easily discovered by users. As stated before, one of the factors that makes the partnership between Galaxy and the VGP project such a big deal is the democratization of compute resources. In order to maximize the benefits that Galaxy can provide, we have consulted with the VGP project to set default resources on the back end for each of the tools involved in the VGP project pipeline to create maximum efficiency in analysis. The VGP and Galaxy collaboration is just getting started and we have big future plans. Soon, we're planning on adding automated upload to the end of our VGP pipelines, which will test for high enough quality output data sets and automate submission to the Genomark. We're looking to add in conditionals to our workflows, which would allow users to run a singular workflow and based on their input data types or the output quality of their files, run additional steps, alternative steps, or cancel the workflow outright should the assembly quality not meet the standards. We will be adding a VGP group to the Galaxy server to allow them more control over what data Galaxy publishes to their repositories as well as their official workflows. And of course, we plan to continuously update these workflows as they become better and better over time with new and new versions of tools. We would like to thank the Galaxy team, especially the tools working group who worked extremely hard to get these pipelines up and running as quickly as possible. The VGP team, especially Julio and James, whose rapid responses have allowed us to optimize all parameters and resources for the entire pipeline and provide us updates on changes to the workflow and tools as they occur. Thank you for your time. We would appreciate it if you would also come and take a look at our poster after the session is over. And with that, we would love to answer any questions you might have about this collaboration or the pipeline. Thank you.