 Hello, my name is Tim Griffin. I am going to give the introduction to our workshop for GCC 2021 on pandemics research using mass spectrometry. So, real quickly, overview and objectives of our workshop, we are going to introduce Galaxy as a framework for a solution for data analysis across the omics domains, focusing on MS-based proteomics. We hope to provide some hands-on experience in using Galaxy in these applications, demonstrate how Galaxy can be used for mass spectrometry-based proteomics, as well as a little on multi-omic analysis using COVID-19 studies as representative data, and lay the foundation that we hope you may take this away and use Galaxy in your own work and or if you are a developer, contribute your tools to the Galaxy community. So, who are we? We're at the University of Minnesota. The other instructors that you will be hearing from today include Prateek Jagdpap, Subina Mehta, Andrew Ryszewski, and An Nguyen. We are part of the Galaxy for Proteomics, or Galaxy P team at the University of Minnesota. It is a very much a global effort. A number of other contributors have made this possible, some of whom I've listed here, as well as a few of the funders that have helped make some of this work possible over several years. So the topics of the workshop, mainly we'll be focusing on data, mass spectrometry-based proteomic informatics type of data. We'll also be using this applied to COVID-19 research, so obviously very timely application, and you'll see a little bit about this idea of metaproteomic, so more of a multi-omics approach. But in the end, what you're going to see is how we use a mass spectrometer to determine the linear sequence of the proteins that are present in a sample. So we're going to determine the sequences of amino acids, ultimately from peptides that match the proteins, and that is how we're going to determine proteins are present in these samples. So as a bit of a primer, before we move on, I'm going to give just a little background in the mass spectrometry-based proteomics, how it works, and the data type that you're going to be dealing with today. So when we talk about proteomics or protein, or mass spectrometry for proteins, there are a number of ways we can measure the molecular weight or the mass of proteins. We can look at the intact protein, or we can take those proteins, which is, again, this linear chain of amino acids that are equivalently bonded together. You can take that protein or a protein mixture, can use trypsin, which finds arginines and lysines within that sequence, breaks it down into peptides. So smaller amino acid chains that come from that protein, and then we're going to actually detect these peptides, fragment, and sequence the amino acid sequences of these peptides within the mass spectrometer. So we're going to break them into peptides and then build them back up to determine proteins. So how do we do this? So the general approach is to take a mixture of proteins from any given source, and, again, digest these into peptides using trypsin, take those peptides and fractionate them using liquid chromatography, which acts sort of as a molecular turnstile. So similar to taking a crowd through a turnstile at a large vent, we take this crowd of peptides through chromatography, which simplifies the mixture, separates out the peptides, so that they get introduced through an ionization method, electrospray ionization, into a mass spectrometer. The ionization method puts a charge on the peptides as well as puts them in the gas phase. So these would all be different peptides of different sequences and different sizes that are detected at any given moment in the mass spectrometer. They're isolated. Those isolated peptides are then fragmented, and those fragments are then the masses of the fragments that came from the starting peptide are recorded. This happens very quickly in a matter of tens of microseconds, such that the mass spectrometer records this fragmentation pattern for one peptide. It remembers those that were just there a moment ago, goes and gets the next peptide, fragments that peptide, to create another MSMS or tandem mass spectrum, with these fragmentation spectra being unique to the sequence, the amino acid sequence of the peptide that was selected. So how do you then put an amino acid sequence to this fragmentation spectrum? So this unannotated MSMS or fragmentation spectrum is put through a sequence database searching program, where it takes proteins within a database that could be predicted or known proteins that have been sequenced, takes all those potential sequences, creates theoretical spectra for those fragmentation patterns in the database and then does a match, and such that those that are observed for those that come from peptides, they will be matched to a theoretical fragmentation pattern that matches the observed. So that when this works, as we're showing here, what we get now is an annotated mass spectrum, where we have amino acid sequences mapped now to all these fragment pieces, and that gives us the full sequence then of whatever the peptide was that created this tandem mass spectrum. So once we match a peptide sequence to our MSMS spectrum, we then can relate that back to the protein because we started with proteins, digested them to peptides. So we now can, if we desire, we can take the peptides we've identified and infer a protein. Usually what will happen is that a protein, if it's present, we will detect and sequence more than one peptide off of that protein, so that gives us a bit more confidence if that protein is actually present. So you'll usually see multiple peptides when identified that map back and help you infer identity and presence of proteins. So as the topic of this workshop is, this is a very much a bioinformatics-driven analysis of this data. So what you're gonna learn about today is that there's a number of tools that are necessary to do this identification of these peptides and the inference of proteins. And we'll take it a step further is that you're going to see the application of this where we're gonna need to also identify what organism does a particular peptide sequence come from? How do we validate that these are actually confident matches? And so it's very much a multi-step bioinformatics challenge. So how do you approach that? The way that we have approached that is by extending Galaxy. And without going into too much depth on Galaxy, what Galaxy gives us in this application area is a very nice environment, sort of this unified environment to take in data, to integrate and implement tools, different software tools that can be then integrated together to work together, implement this on powerful, scalable computing resources, whether it's in the cloud or locally. And this all comes together within the Galaxy environment and is also driven by a user interface that makes this usable by non-expert programmers or through the API where it can be operated as well for those with a little bit more expertise as programmers. So it brings all of this together as this really nice platform to carry out these complex bioinformatics and multi-omic data analysis workflows. So one thing that you're going to hear a fair amount about today is this idea of defining workflows and also history. So as I've kind of alluded to, one of the key pieces here is that as these analyses become more and more complex, they go from a single software tool to needing multiple software tools. And those software tools then need to be sort of chained together or integrated in a way such that the output from one software tool acts as the input for a subsequent tool. And you would like ideally to be able to develop an integrated workflow where one can put in input data and in an automated way have this data go through this workflow. And each subsequent tool takes in an input, brings an output that may serve as an input to the next tool and ultimately get to an end result that that's the desired results file that one can utilize and export off of Galaxy or even visualize on Galaxy. So that's the idea of workflows that you're going to see here. It's really these different tools that have been plugged in together and work in an integrated way. And in Galaxy, what we can do is actually save those with all the various parameters of each tool in an optimized state so that one can share and access these and make this meet more reproducible. We also will see histories. And a history is basically a workflow that is acted upon data and it's an archived record of that analysis by using that workflow, one has analyzed data and it saves all of the input data or all the intermediate data as well as the final output. So you can see sort of what happened and again, share that with others helped for these analyses to be much more reproducible as well as transparent. So those are a couple of keys that you're going to learn about today. And with that, I'm going to end the introduction section. And the next thing you'll hear about is a bit more about the specifics of the application of these workflows and tools to COVID-19 data. Now that we have learned about the mass spectrometry methods that are used to detect peptides and proteins. Let us start looking into galaxy workflows that we have developed for the analysis of COVID-19 mass spectrometry data sets. On the website here, you'll find galaxy workflows that we have developed for COVID-19 mass spectrometry data set analysis. These workflows were developed to detect peptides both from clinical as well as cell culture samples. And using these workflows, we have been able to detect peptides that have spanned the entire SARS-CoV-2 proteome. We've also developed metaproteomics analysis workflows that can be used to detect potential co-infecting microorganisms in COVID-19 patient samples. In order to understand a little bit more about the COVID-19 detection methods, let us go back to the times when methods were used in the early pandemic to detect COVID-19 from patients. Swap samples were collected and then were used to extract RNA from these samples. This RNA was converted to DNA. And then this DNA was amplified using SARS-CoV-2 specific primers using polymerase chain reaction. And then the PCR results were used to detect the severity of infection. This was primarily used for active infections. For prior infections, tests that were used were basically antibody tests, wherein blood samples were collected from individuals and the presence of antibodies against a prior infection was used as a way to interpret whether the individual was exposed to SARS-CoV-2. During these early times of the pandemic, mass spectrometry researchers also looked into the possibility of using mass spectrometry as a method to detect peptides in clinical samples. As you can see here, multiple labs from around the world collected samples from multiple body locations, including gargling solutions from the upper respiratory tract, nasopharyngeal samples, as well as, in some cases, deep lung tissue samples from deceased patients. Using these samples as well as cell culture samples, we were able to construct a panel of peptides that could be used for data analysis. In cell culture samples, for example, one could look into the deeper proteome of SARS-CoV-2 virus since you could extract cell lines that were infected by a virus and one could collect samples at various time points. So in order to generate this peptide panel, we started with some clinical samples, as has been shown here, as well as some samples which used in silico approaches to predict the presence of the proteins from the genome that was available in the early days. And these methods were basically used to detect any potential candidates for diagnosis or potential candidates for vaccination. As I mentioned earlier, we also looked at some cell culture data sets and these helped us to get a deeper understanding of the proteome of SARS-CoV-2. So using these multiple data sets, we were able to generate a peptide panel, wherein in the first step, we use these data sets, either cell culture or clinical data sets, to generate a 639 peptide panel using database search flow workflow methods. In the second step, we use the 639 peptides to interrogate whether these peptides were present in clinical samples. As a result of this, we were able to look at the entire proteome shown here in this outer circle, each of these sections showing a protein that is potentially expressed in the SARS-CoV-2 proteome. And you can see in the inner circle are peptides that we detected from our studies. So this is the 639 peptide panel that I mentioned earlier. We could also go back and look at our clinical data sets and what is shown here in the second inner most circle here is the peptides that were specific to clinical samples. Using this analysis, which are published in clinical proteomics this year, we were able to find out peptides that could be used for detection in clinical samples. The other part of workflow development that we were interested in was to find out if we could detect any co-infections in COVID-19 patients. And the reason why we got interested in this was because in most cases, especially in cases wherein patients have recovered from COVID-19 infection, there are secondary infections that sometime linger because of the weakened immune system or weakened lung function. For example, the patient could be infected due to a prior infection or also could be infected during hospitalization. These infections called as nosocomial infections can also affect antibiotic treatment plans since some of these organisms can be antibiotic resistant. In most cases, culture-based methods are used to detect the presence of any secondary infection. However, that might take time and might lead into progression of the disease. There have been quite a few cases of fungal infections after COVID treatment in India in the recent pandemic wave that it experienced. So in order to study and find out if there are any other microorganisms present, we looked at the gargling solution sample, oral pharyngeal sample, and respiratory tract sample that were previously published and were initially used to detect COVID-19 peptides. And we used a workflow that was developed within the Galaxy framework that could be used to detect any microorganisms that are present. And this was published in the Journal of Protein Research this year. So in our analysis, we found streptococcus pneumonia, which is known to cause pneumonia and respiratory tract infection. We also found some other organisms such as lactobacillus, pseudomonas species and acetylbacter species. So as a result of this work, we have now Galaxy workflows that are available, which can be used not only for data that is published, but also for any new data sets that are available from various parts of the world. We have also published this workflows in the two manuscripts that I mentioned earlier. And now it is time for Subhina, Andrew and An to take you through the hands-on session for detecting peptides from SARS-CoV-2 positive samples, as well as detecting microbial infections or secondary infections that could be caused due to co-infections during the pandemic. Thank you. Hello, now is the time for the practical portion of this workshop, where we will take you through the business of detection of COVID-19 peptides in patient data. This section will be presented by a trio of researchers in the Galaxy P team at the University of Minnesota, Myself Subhina Mehta, Andrew Rychewski and Ann Nguyen. To begin, you'll need to create an account at usegalaxy.eu using a valid email address. Once you have registered, you can use the usegalaxy.eu to access materials and you will need for the workshop. Within the usegalaxy.eu webpage, bring your cursor to the toolbar at the top of the screen and click on share data and then histories and COVID-19 inputs, which will bring you to a history containing all the necessary data to bring in your analysis. Click on the plus sign to import the history into your account. Now this will import the history into your account. Now go ahead and click on share data again. This time click on workflows and then COVID workflow. This is the automated workflow we'll be discussing. As before, import the workflow and start using the workflow. You should be now looking at the workflow in your account. To get the ball rolling, go ahead, click on play. Now you'll see the workflow in your center pane. Please select the appropriate inputs that we need for running this workflow. Select sample one data set as a raw file in the drop-down menu and look at all the tools that have been queued up. Now, once you have established that, run the workflow. Now your tools will queue up on the right-hand side pane. So I hope you were able to run the workflow on the data set provided. If not, we can go over that again. Let me basically give you a background information regarding what we just did. The data we used today was from a group which performed quantitative proteomic analysis on oral nasopharyngeal swabs used for COVID-19 diagnosis. They had patient data from both positive and negative samples. The question we are trying to answer is to determine the presence of COVID-19 as well as any other co-infecting organism. Here is the galaxy view of the workflow you just ran. It might look a bit complicated, but we have just used few text manipulation tools to make the tools compatible with each other. Here is a somewhat easier version. Let me go step by step. Firstly, the main inputs of the workflow are the protein-fastly database comprising of human proteins, contaminants and the COVID-19 proteins and the MS-MS data. The first step is to perform database searching for which we use the search query peptide shaker combo. We will be going into each of these tools later on. First, we use the peptide report from the peptide shaker to extract confident peptides, of which we perform unipept analysis to get taxonomy. We look at the organisms that have at least two peptides for selecting these peptides and perform PEP query analysis on. PEP query is the validation tool we use. We eliminate the presence of human and contaminant proteins by providing that database. Once we get the output, we perform a filtering step in which we extract peptides based on P-value and confidence. These peptides are then subjected to another step of validation, which is spectral validation, for which we use the mcidentml file, which is a peptide shaker output to create an mz2sqlite database. This sqlite database can be then viewed with the help of the multi-omics visualization platform within the galaxy. The platform has inbuilt lorikeet visualization to visualize the spectra. We then manually examine the spectra and extract those peptides and then perform unipept analysis to determine the presence of COVID-19 as well as other organisms. So having done all that, we dare say you have noticed that the workflow is not finished. This is typical. As considerable time is generally needed for these tools to run, in the meantime, we would like to you to be able to analyze some data for yourself. So you go back up to the toolbar at the top of the screen, click on share data and then histories, and then finally open up the COVID sample one history. Import this history by clicking on the plus sign. We will interrogate this data shortly. Thank you for- For my desk bar. Hello, my name is Andrew Rucheski, and I'll be getting into some of the main tools of the workflow with you. First, there's the paired duo of search GUI and peptide shaker. Search GUI is a collection of peptide search engines that run together with eight to nine built-in algorithms, such as X tandem, Comet, MSGF plus, et cetera. We provide it with all the identification parameters for database searching, such as the spectrometer resolution, fixed modifications, variable modifications, et cetera. Search GUI then generates a report. Peptide shaker takes search GUI outputs and ranks and filters the peptide spectral matches based on their quality, selecting those peptide spectral matches that are better than a 1% false discovery rate. Peptide shaker then creates tabular peptide report files, follows an MZIdentML file. We use the peptide report and the MZIdentML file for downstream processing. The next major tool is called UNIPEP, an open-source application that was purpose-built for metaproteomic analyses. It can perform both functional and taxonomic annotation, but for this tutorial and workflow, we are performing only taxonomic annotation. We select organisms with at least two peptides, as this is a general standard for confidently assigning peptides to a target in protein mass spectrometry. Generally, this strategy is used to assign peptides to a protein, but in our case, we assign peptides to an organism in the community. We then extract those peptides for peptide validation analysis. There are two main steps we use for validation. First is the PEP query search engine, where we eliminate the presence of host proteins and extract validated peptides for our duly identified organisms. With that accomplished, we then manually assess the spectral quality of these validated peptides. This is necessary as peptides which can pass quite PEP query to not always have the highest quality, making them ill-suited for reliable automated detection. This figure summarizes how the query search engine works. It was generated by the creators of the tool, the Zhang Lab at Baylor College of Medicine. First, our peptides of interests are compared with the MSMS data to generate PSMs with matching spectra. These are then compared to a reference database, the score based on whether there are better matches to the PSM spectra amongst the reference peptides or not. Then the PSMs are compared to a library of randomly generated peptides to see whether a match to the spectra can be found at random, increasing the robustness of this match. Finally, if engaged, the PEP query performs what is called unrestricted modification searching to see whether the reference proteome can form a better match to our PSM spectra, allowing for post-translational modifications to be counted. Peptides of interests with PSMs that have no match to the reference proteome with or without post-translational modifications and no matches to random peptides are considered to have passed. When the query is finished, it returns a number of tabular files, including the PSM rank. We can then filter the contents of this output to eliminate those peptides that show too close a match to the reference database, which in our case is the human host proteome supplemented with common mass spectrometry contaminants. Theoretically, those peptides with a P value less than 0.05 correspond to peptides which haven't had a match in the reference database. When the more stringent unrestricted modification search is performed, there is an additional column called confidence which designates the peptides as being yes or no. We will filter the PSM rank file based on those peptides that are designated as a yes. These are our validated peptides. However, there is one last step to validation and that is manual evaluation of the peptide spectra. Hello, my name is Andrew and I will take you through the final validation step of the COVID-19 data. Spectra of peptide past the validation by PEP query will be manually annotated by MVP, a multi-omic visualization platform that includes the Bolarikid visualization tools. Bolarikid utilized the SQLite conversions of MCIDEM5 generated earlier in the workflow. We first manually look at the signal-to-noise ratio of the product ions MSMS spectra and spectra-containing product ions with at least a pre-fold higher in intensity than noise peaks were retained. Next, we inspect the degree of completeness of the B and Y ion series with passing spectra determined to have at least a three consecutive B or Y ions. Peptide with spectra fulfilling these two requirements will be considered valid target for the detection of SARS-CoV-2. So here we present two examples of spectra. The first spectrum doesn't have three consecutive B or Y ion, which is why we rejected. The second spectrum, on the other hand, has more than three consecutive Y ions and product ions are noticeably more intense than the noise peaks, so we accept it. Next, let's visualize the process and see how it works. In the completed history, click on data set 10 MC to SQLI to expand it and then click on the background icon to visualize the data. From there, you can click on the MVP window, which will open the large heat visualization window. Rather than go through all of this data, you can select load from Galaxy, select the list of validated peptide and then click use for filtering. This will narrow down the list of peptide from several thousands to a handful. From there, you can select a peptide of interest, click on PSM for selected peptides and then inspect the spectra presented below. So with all of this background presented, we encourage you to run the workflow by yourself. Take a look at the PepCurry PSM rank five. Visualize the MC to SQLI output using MVP and annotate the passing peptide by yourself. Now you decided what you think is the COVID and co-infection status of this sample. With this, we conclude our presentation. We would like to thank you for attending and engaging in this workshop at the 2021 Galaxy Community Conference. Below is a collection of links that you may find useful, which include our contact information, a specialized Galaxy 4.0 mix platform host by Galaxy EU, repositories of our workflow, as well as training materials for the use of the Galaxy platform. And finally, publications detailing our COVID-19 ProTill mix research. We hope you enjoyed the workshop.