 Introduction to proteomics, protein identification, quantification and statistical modeling. Before diving into this slide deck, we recommend you to have a look at the following. Proteins are macromolecules that have many important functions in a cell. Protein-coding genes are transcribed into mRNA, which is translated into amino acids. The amino acid chain forms secondary, tertiary and to obtain a functional protein. One gene may generate different proteins due to alternative splicing and post-translational modifications. Therefore, the proteome level shows a higher molecular complexity. The proteome is defined as the entirety of proteins expressed by a genome or by a cell tissue at a given time. The study of proteins is important as their identity and abundance is only partially predictable from DNA and mRNA information. This is due to alternative splicing, post-translational modifications, protein turnover and subcellular localization. Mass spectrometry is the standard method for proteomic analyses of complex samples. In the classical bottom-up approach, proteins are enzymatically digested into peptides. Peptides can be analyzed with high sensitivity and throughput in a mass spectrometer, MS. The peptide mass is measured as mass to charge, mis, ratio. Only charged peptides can be measured in the missed tens of thousands of mass spectra are generated per sample. Each spectrum consists of many mis and intensity pairs. Different mass spectrometry-based proteomic techniques exist. The most common one is explorative or shotgun proteomics. It aims to identify as many proteins as possible from a sample. It comes in two flavors, data-dependent acquisition, DDA, and data-independent acquisition, DIA. A second technique is targeted proteomics. It measures only a predefined set of proteins with very accurate quantification. A third technique is mass spectrometry imaging. It measures the spatial distribution of peptides or proteins in thin tissue sections. Plenty of software for DDA, NIA and mass spectrometry imaging are available in Galaxy. Here is an overview of all proteomics tools installed on the European Galaxy server. Other public Galaxy servers offer a similar or complementary proteomic toolkit. The Galaxy proteomics tools and training materials are constantly expanding and improving. This presentation will focus on explorative proteomics via the traditional DDA approach. Proteomics experiments consist of three main steps. The sample is prepared for the analysis in the MIS then the sample is measured in the MIS and last, the obtained data is analyzed. Typical sample preparation steps include protein extraction, reduction and alkylation, tryptic digestion and desalting. Before tryptic digestion, disulfide bridges are reduced and cysteines alkylated. This ensures that tryptic peptides are separated from each other and allows their mass-based identification. Trypsin cleaves the amino acid sequences C terminal of arginine and lysine. Desalting is a clean-up step to protect the instrument from contamination and clogging. Sample measurement in an MS consists of different steps. A high-performance liquid chromatography, HPLC, system is attached in front of the MIS it separates the injected peptide mixture according to their hydrophobicity. The peptides elute from the LC column into the MS within several minutes to hours. This reduces sample complexity and gives the MS more time for the measurement. The acidic LC buffer charges the peptides positively at their end terminus and the basic lysine or arginine amino acids on the C terminus. The LC column is directly connected with the iron source needle. There, high voltage and heat are applied to evaporate the ionized peptides into the gas phase. This process is called electrospray ionization. Inside the MS the mass analyzer separates peptides based on their MIS. The detector detects the peptide ions. Typically explorative proteomics is performed via liquid chromatography tandem mass spectrometry LC-MS-MS. While the sample elutes from the LC column, thousands of mass spectra are acquired. First, a mass spectrum of all peptides at this time point is measured. These mass spectra are called MS-1 spectra. From these spectra the N most abundant peptide peaks are determined. These top N peptides get fragmented. N is typically between 3 and 20. This example shows a top 3 method. The filter unit of the MS, a quadrupole, allows only these peptides to pass. One after the other is selected in the filter unit and then fragmented by collision with neutral gas molecules. This fragmentation breaks the peptide bonds and generates peptide fragments. The peptide fragments are measured again via the mass analyzer and detector. These spectra are called MS-2 spectra. After all top N peptides were fragmented and measured, another full MS-1 mass spectra is acquired. MS-1 and MS-2 spectra are acquired in this way during the elution of the sample from the LC. The analysis of the acquired mass spectra comprises several steps. First peptides are identified via their MS-2 fragmentation spectra. From these peptide identities the corresponding proteins are assembled. The MS-1 spectra are used for peptide quantification. Peptide quantities are summarized into protein quantities. The information about protein identity and quantity allows following statistical analyzes. This was the overview of a typical explorative tandem mass spectrometry workflow. Now we will dive into more details of the data analysis part. Many tryptic peptides of an organism have same or similar masses. Therefore, MS-1 spectra don't allow reliable peptide sequence identifications. MS-2 spectra allow peptide identification via the generated peptide fragments. The N-terminal fragments are called B ions and the C-terminal fragments Y ions. The differences between the fragment Ms correspond to the Ms of an amino acid. This allows manual interpretation of the spectra. However, this is a tricky procedure because in reality the MS-2 spectra contain more noise and side product peaks than shown here. Also, in explorative proteomics tens of thousands of spectra are required and make manual interpretation unfeasible. The manual interpretation process is automatized with so called de novo sequencing software. These algorithms have improved in the last years. No information about potential protein sequences in the sample are needed. The default software for peptide identification are so called search engines. They require information about all protein sequences of the analyzed organism as faster database. From this they generate in silico spectra which are then matched to the measured mass spectra. This process is often called peptide spectrum matching. It starts with a protein sequence database of all protein sequences of the analyzed organism. Analogous to the procedure in the sample, the protein sequences are in silico digested. This means that the sequences are cut after each trypsin and arginine. These in silico tryptic peptide sequences are then in silico fragmented. All amino acid bonds may potentially break and generate peptide fragments. Therefore, all possible fragments are generated in silico. From each in silico peptide fragment, the MIS is calculated. In case of amino acid modifications the MIS of the modification is added accordingly. Fixed modifications are added to each occurrence of the amino acid on which they occur. Variable modifications may not occur on every amino acid and therefore two MIS, with and without the modification, are calculated. These in silico generated MIS values are matched to the MIS values from each measured MS2 spectra. A matching score allows to find the best identifications for each MS2 spectrum. Potentially false matches may occur, therefore the false positive rate is controlled. This is done by adding decoy sequences to the protein sequence database. These sequences are generated by reversing or shuffling the real sequences and will not exist in the sample. In case such sequences are considered a good match to an MS2 spectra, this is a false match. One option to control the number of false positive matches is via a false discovery cutoff that includes the best matching scores with only 1% wrong decoy matches included. The protein sequence database is stored in a faster file. This is a text based file to store DNA, RNA or protein sequences in a single letter code. Each entry contains a header line and the sequence. The header line starts with a greater than sign and is followed by a unique identifier. A proteome is the set of proteins thought to be expressed by an organism. Only proteins that are present in the faster file can be identified. But the more proteins are present in the faster file the higher the chances for false identifications and the longer the computation time. Sources for proteome faster files include UniPro, NexProt, NCBI or DNA and RNA sequencing data. After having identified peptides they need to be reconstructed into proteins. This step is called protein inference and is not trivial. Unique or prototypic peptide sequences belong to one protein but other peptide sequences may belong to several proteins. These peptides are called shared or razor peptides. Most protein inference algorithms assign them to the protein that has the most other peptides. In the depicted example peptide 2 would be matched to protein 1 which is for sure present in the sample because it has one unique peptide. Different quantification approaches exist in MS based proteomics. In explorative proteomic approaches relative quantification methods are applied. They compare the amount of proteins between different samples. Label free and label based methods exist. Labels add specific mass tags to the peptides of different samples via metabolic or chemical ways. In label free approaches every sample is measured separately. Afterwards the protein amounts are compared between measurements. Chemical labeling techniques add a mass label to the digested peptide. Afterwards the samples are mixed and measured in one run. The different masses of the added labels allow distinguishing the origin of the proteins during data analysis. Depending on the labeling technique up to 16 different labels exist. In metabolic labeling amino acids with heavy isotopes are added to the cell culture medium of one condition. During cell growth these amino acids get incorporated into proteins. Thus proteins of the heavy condition can be distinguished by their normal counterparts via a fixed mass shift. For label free quantification all peak areas in the MS1 spectra are integrated. Peptide abundances are summarized into protein quantifications. This requires decisions about which peptides to include in the summarization. Only unique peptides. Only proteins with or without modifications. Last the protein abundance may be computed by taking the median, mean, weighted mean or some of all its peptides. MaxQuant is the most popular non-commercial software for quantitative proteomics experiments. It performs protein identification via its Andromeda search engine. Protein quantification of label free and many label based methods is supported as well. MaxQuant accepts raw data in vendor specific formats. Typical follow-up analyses include visualization, network and go enrichment analysis. Finding differentially abundant proteins between different groups requires statistical analyses. MS Stats is an open source software for statistical modeling of quantitative proteomics data. It is compatible with complex designs of label free and isobaric labeled quantification experiments. First Mississippi Stats performs several processing steps. Then it applies flexible linear models to detect differentially abundant proteins. MS Stats takes identified and quantified spectral peaks from common proteomics software such as MaxQuant as input. For DDA data MS Stats starts with peptide level data. It applies several feature selection and processing steps in order to account for proteomics specific data properties. Afterwards Mississippi Stats calculates new protein abundances and performs statistical modeling on them. In addition to the results of the proteomics software an annotation file is needed as input. In this file the experimental design is described. It specifies conditions and biological and technical replicates. In case MaxQuant results are used as MS Stats input, an additional column with a label type is needed. In a DDA experiment the value is L for all conditions. First the input data is converted into an MS Stats compatible table. For this step several parameters to filter and adjust the input data can be selected. Log transformation brings the peptide intensities into a close to normal distribution. Normalization aims to make the intensities of different runs more comparable to each other. The default normalization method is called equalize medians. It assumes that the majority of proteins do not change across runs. It shifts all intensities of a run by a constant to obtain equal median intensities across runs. Feature selection allows the use of either all or only the most abundant peptides for protein summarization. The table represents the intensity values for the peptides of one protein. The dark gray fields represent missing intensity values of peptides. Missing values and noisy peptides with outliers are typical in label free DDA datasets but influence protein summarization. For a reliable and robust statistical analysis, missing value imputation is recommended. Missing values in MaxQuant mean that they are missing because they were below the limit of detection. This means the values are not missing for random but for the reason of low abundance. Therefore, the values are only partially known and called censored values. This may also be the case for very low intensity values which might not be reliable and can be imputed. Censored intensity values are imputed via an accelerated failure time model. Alternatively, they may be replaced with a value obtained from the other measured intensities for the peptide and or the run. Protein summarization is by default performed via 2 keys median polish for robust parameter estimation with median across rows and columns. The calculated run level protein summaries are used for statistical group comparison. Any two conditions can be compared to find differentially abundant proteins between them. MS stats uses a family of linear mixed models for this. The model is automatically adjusted for the comparison type according to the information in the annotation file. This means MS stats accounts for technical replicates, paired designs or time course experiments automatically. Thank you for watching.