 Hello, and welcome to the workshop on Encyclopedia Software at the Galaxy Community Conference 2021. In this workshop, Ima Leet from the Galaxy team will cover the Encyclopedia Workflow for Data Independent Acquisition Analysis. I'm Pratik Jattab, and I'm here to offer an overview on Data Independent Acquisition in general, and the software Encyclopedia in particular. This software was developed by Brian Searle when he was at the University of Washington and was implemented in Galaxy by James Johnson from Minnesota Supercomputing Institute. We'll have a short interview with James Johnson, also known as JJ, on the benefits of using Encyclopedia through Galaxy. In traditional data dependent acquisition, also called as DDA, a proteomic sample is digested into peptides, ionized, and then analyzed by mass spectrometry. Peptides with precursor intensities about the noise threshold are selected for fragmentation, thus generating tandem mass spectra termed as MS-MS spectra. These MS-MS spectra can be matched to peptide sequences in a database. Although this approach is powerful, due to stochastic nature of data acquisition during mass spectrometry, the mass spectrometer samples for peptides for fragmentation with a bias towards those with the strongest signal. Thus, DDA presents a challenge in reproducibly quantifying low-abendance peptides. On the other hand, data independent acquisition, also called as DI for short, all peptides within a defined mass range are subjected to fragmentation, and this analysis is repeated over the full mass to charge range. This results in accurate peptide quantification without a bias towards predefined peptides of interest. If you see in this figure, DI MS continuously collects fragment ion intensities for all eliting peptides by using wider isolation windows, such as 10 Dalton's. All the ions within this comparatively wide isolation window are isolated and simultaneously fragmented. Shown here is the schedule of MS-1 spectra in black and the isolation windows of MS-2 spectra in red. The simultaneous isolation and fragmentation of the multiple peptides results in a complex MS-2 spectrum consisting of ions from all isolated peptides. Fragmentation intensities along with the MS intensities can be used for quantification. This is possible since MS-2 ion intensity is available throughout the entire elution profile. As a result, data independent acquisition provides a broader dynamic range in quantification and improved reproducibility for identification, as well as quantification, and this also results in lesser missing values, which is extremely valuable for statistical analysis and also a better sensitivity and accuracy for quantification. For example, if you observe here, a peptide shown in red might not get detected by data dependent acquisition, but might get detected because of the approach taken by data independent acquisition. Due to the complexity of the data, particularly ion information from multiple peptides and the need to parse out quantitative information from the continuous data acquisition across the elution profile, many bioinformatics tools have been developed. Software tools for implementing DI data analysis can be broadly divided into two classes, depending on the need for a spectral library or library-free approaches. These include software such as OpenSwath, Spectronaut, Encyclopedia, DI-NN, DI-Ampire, and Pican. For a more comprehensive and current understanding of the status of bioinformatics for DI analysis, readers are recommended to read the following manuscripts and video materials. Turning our attention to Encyclopedia software. This software was developed in Mike Macaulay's lab by Brian Searle, and here is the recommended acquisition strategy for Chromatogram Library data collection. As a start, experiment groups shown here in orange and blue, and this could be control and disease, or any other two conditions that one would like to compare. These orange and blue samples are pooled, and then they are run using the gas phase fractionated DI analysis, after at least six injections of the same matrix to ensure consistent chromatography within the GPF runs. Next, a Chromatogram library is generated from the pooled samples, and lastly used to interpret quantitative results for each biological replicate. So the inputs that are required for Encyclopedia workflow are a protein fastafile, a spectral library, and this could be generated from your data-dependent acquisition data, a gas phase fractionated DI data using narrow windows, and this goes into Encyclopedia so that it can generate a Chromatogram library, which is used as a template by the experimental DI data to offer an output of proteins and peptides and its quantitative, associated quantitative data. One of the variation of this that has been recently introduced is instead of using the DDA library, or the spectral library generated from DDA data, one can also use a protein fastafile and subject it to deep learning-based ProCET software, which basically gives you a predicted peptide library which has got predicted fragmentation as well as predicted range in time, amongst other features, which now can be used as a template or as an input into Encyclopedia. You have your protein fastafile, a predicted peptide library generated from this protein fastafile, the gas phase fractionated DI data that goes into Encyclopedia so that it can be used to generate the Chromatogram library. This is the approach that we will be discussing in this tutorial. For more information about Encyclopedia, readers are recommended to visit the following website or read manuscripts and video materials. The dataset used in this tutorial is a mixture of proteins from T4 bacteriophage infecting its host E. coli, as well as non-host species such as Salmonella typhumarium and bacillus subtlis. This dataset will be referred to as the IPRG dataset throughout the workshop, since DDA dataset of these samples was used for the Proteomics Informative Groups, also called as IPRG, Metaprodemic Study of 2020. This resulted in generation of a protein fastafile, the ProCET generated spectral library that we discussed earlier, six gas phase fractionated raw files and four biological replicates of bacteriophage infected samples. Lastly, we would like to acknowledge some of the researchers who have been instrumental in making this tutorial possible. We'd like to start with Susan Weintraub from the University of Texas from San Antonio. She was generous enough to provide us with the dataset that we have used as an input dataset here. Secondly, we'd like to thank Brian Sir who is currently at Ohio State University and his inputs on implementing this software that he has developed into Galaxy have been very, very important. We'd also like to thank Matt Chambers who helped us to implement MS Convert into Galaxy so that raw files could be converted into appropriate format using appropriate parameters. We'd also like to thank Tim Griffin, Subina Mehta for their help in making this tutorial possible and Emma Liedt who has worked extensively on making this tutorial. Lastly, we'd like to thank James Johnson from Minnesota Supercomputing Institute and also support by Sasuke Hiltman and Beyond Groening from Galaxy Europe. Now we will have a short interview with James Johnson from the University of Minnesota. He is a senior developer at the Minnesota Supercomputing Institute and we'll ask him about the benefits of using Encyclopedia in Galaxy as well as the challenges that he faced while implementing Encyclopedia in Galaxy. So we have James Johnson who is a senior developer at Minnesota Supercomputing Institute. Hello JJ. Hello. So JJ, as we are aware, we are talking about Encyclopedia implementation within Galaxy as part of this tutorial. I had a question to maybe if you can give us some developer's perspective on this implementation. So the question I had was you're aware that there is a desktop application for Encyclopedia that is available for users. I just wanted to get your opinion on what is the advantage of using Encyclopedia within Galaxy? Well, the desktop application is very convenient for somebody that is running a single application that's not too large so that it will actually fit on their personal computer. But the advantages Galaxy gives is that it allows you to scale up your work and it allows it to integrate into other things that you do. So let's consider scaling first. If you have a lot of large files that you are going to process with Encyclopedia, you would have to move them all to your personal computer that you're running the desktop application on and that you would have to run them one at a time and wait through and click things at the appropriate time to make the application work. Most of us, once we get our process down, would like it to be more automatic than that and Galaxy provides that. So one could just submit multiple jobs at the same time and let Galaxy run them wherever and whenever it can with as much memory disk space as it needs. The second aspect that we mentioned is Galaxy provides a nice integration with other applications that you might run. So for instance, in processing Encyclopedia data, usually you need to convert the files first from the raw format coming from the sequencer, excuse me, the mass spectrometer and then get them in the right format so then they're ready for Galaxy, which is good or for Encyclopedia, which is going to have to run a couple different stages. First of all, it needs to generate a spectral library and then second of all, it'll have to run the searches on the individual data sets in search for peptides and proteins. Right. So yeah, I can definitely see a lot more advantages wherein you can basically start with the mass spec data and then have your results without having to transfer your data within various machines or applications. So that sounds great. The other question I have, I mean, you have implemented quite a few software tools within Galaxy. What were the major challenges that you face to have Encyclopedia in particular implemented in Galaxy? Well, first I'm going to provide a little compliment to the application developer. Often when people make desktop tools, they make assumptions about how the code will be run and in doing that, sometimes the person will intermix the code for the user interface and the actual computation. And in this case, Encyclopedia has a nice clean break between the computation code and how you access it either through the desktop graphical user interface or via command line. So that made it pretty easy for this to be run as a command line application in Galaxy. The second thing the developer did was not make the assumption that only one user running one application at a time was going to use this. That often happens again with desktop applications, often manipulating user preferences every time somebody runs something, which of course in Galaxy, if you're submitting 100 jobs simultaneously, they can't all be trying to manipulate that same common file. So all of this is kept separate in Encyclopedia, which is great. However, it did take some effort to try to figure out how Encyclopedia actually goes about conducting a workflow if you use the desktop application. When you do that, you give it multiple files and you click on the search to live function, I believe. And then later you give it some other files and say go ahead and search for my proteins within that, making use of that spectral library. What you don't see or what I couldn't see looking at the command line interface was the assumptions that are in the desktop application about how files are named, what intermediate files need to be kept between stages of the process. So that took me a while to discover that, both by talking with the developer and also by inspecting the desktop application code very carefully. And then one other aspect that comes up, of course, is with a new application, there are new data formats that have to be passed between applications. So that involved adding a couple special spectral library data types to Galaxy. In this case, one's called D-Lib, which is a spectral library that is used from the search to live aspect of gas fractionated files. And then finally, when you do the final protein search, it will produce a E-Lib, which is an expansion of the D-Lib file. So it was important to include these as new data types within Galaxy so it could handle the transition between various application tools within Galaxy. Excellent. Thank you very much, JJ, for giving a perspective, a developer's perspective into the Encyclopedia workflow as such. And as a user of Galaxy, I'd really like to thank you for doing that. And I'm sure users who are going to use this in this workshop will hopefully appreciate the work that you've done into this. So let's move on to Emma, who will now take us on to using this particular Encyclopedia to within Galaxy. Thanks, JJ.