 Hello, my name is Witold Wolski and I work as a statistician and research scientist at the functional genomic center in Zurich and the computational mass spectrometry group. I will be presenting the R-package Prolevka, which is a comprehensive R-package for protein differential expression analysis. First, I will introduce protein quantification with mass spectrometry. Then I will mention some challenges when implementing the R-package, introduce the functionality and the implementation of the R-package, show how we use tidy data, our six classes and interfaces to make the use of the package easier. Then I will show usage examples, talk on this occasion on input formats and how to analyze the data on PSM precursor peptido form of protein 11. Finally, I will briefly summarize our results of benchmarking and comparing the performance of R-package with other packages and give out a look of further developments we plan. What is protein quantification with mass spectrometry? It is a mass spectrometric method to determine relative abundances of proteins relative to other similar samples. Typically you start with biological samples like cell cultures, tissue samples. You can treat these cell cultures and compare them to a control or compare healthy and diseased tissue samples. The proteins are extracted from the samples and then digested. The peptides are separated using a chromatographic method. They are then ionized and fragmented in the mass spectrometric instrument and the masses of these fragments are recorded. Protein peptides and proteins are then identified and quantified. Basically the recorded masses of the fragments are compared to theoretical peptide fragment masses computed from in-city protein sequence databases. We talked already about biological samples. I would like now to talk briefly about experiments which can be analyzed using these methods. As already mentioned, you can determine relative differences in protein abundance and their statistical significance. To do this, we report fault changes, fault discovery rates. Basically, a simple design of such an experiment is to compare a treatment to a control group of samples, but more complex designs are possible. One could also ask, does the treatment effect depends on the wild type or some knockout of the cells? Basically what we can also do is test differences. So in the first experiment, we would compare the treatment group to the control group in the wild type, cell lines. In the second experiment, we will compare the treatment group to the control group in the knockout, cell lines. In the end, we can also compare the differences obtained in the knockout to the differences obtained in the wild type group. To answer the question, does the treatment effect depends on the wild type or the knockout cell line? I want to mention one challenge which we're facing when analyzing protein abundance data. Basically, sometimes there is a large proportion of missing observations specifically in plasma samples. For instance, different set of proteins are identified and quantified in each of the samples. So only a few proteins are consistently measured in all the samples and then we have big groups of proteins which are unique to some subgroups of samples. Second, more technical challenge and actually a good thing is that there is a large variety of software which can be used for identifying and quantifying precursors, peptides and proteins and these software tools produce even a larger variety of output formats. To address these problems and also to group the relatively wide range of functionality implemented in our package, we group the methods in R6 classes and we have classes which group methods for plotting, methods for aggregating precursor data to peptide level data and further on. We have methods for abundance transformation grouped in the transformer class. We have a summary class which groups method for data summaries like computing the proportion of missing data, computing the number of peptides or proteins per sample etc. And to address the variety of software, we basically provide a configuration for each of the upstream software we have a configuration which makes supporting different upstream software varieties. If it comes to modeling, our package supports and implements several models like the Ropeca model linear modeling with variance moderation. We also provide adapters to external packages for instance to the product package which implements a probabilistic dropout model and basically the adapter to all these different models implements a contrast interface. Furthermore, we also provide various model building strategies to allow to fit linear models or linear mixed models to the data. A typical usage example is that we start with outputs from a software for instance from Fragpipe and then the data is converted into a tidy data table to the output of the analysis software the annotation is added which contains the explanatory variables which will be used then in the modeling and all the data ends up in one large data frame which then the methods in Prolevka know how to handle thanks to the configuration and so then the data can be aggregated for instance from precursor to peptide level from peptide protein level and each stage diagnostic plots can be generated for instance here we summarize the data and plot the number of proteins and the number of peptides in each of the samples or some statistics can be computed like the CV of the protein in each of the group and then we can show a violin sheet plot for showing the distribution of the CVs of all the proteins or generate a heat map visualizing the distribution of missing values in the samples. Finally to model one and to specify the model one would use our formula interface and then fit the model specify the comparisons among the groups one wants to test then run the tests and they're running the tests we obtain a contrast class and instance of the contrast class with then implements methods for plotting visualizing the results for instance a volcano plot which is a plot of the of the false discovery rate as a function of the of the differences among groups. We also benchmarked the methods implemented in our package using some benchmarking data sets where the ground true is known and we compared the methods of our package with methods implemented in other packages such as Proda, Miscurop 2 or MS Studs and and these results were published and this year in the Journal of Proteome Research and we showed in our publications that our the methods implemented in our package perform similarly and sometimes better than the methods in the other packages. We also what was important to us is that all the benchmarks can be reproduced and followed up and so on Github we have a Prolevka benchmark package with the vignettes which show how for instance we starting from a benchmark data set one can run the differential expression analysis using the MS Studs package or using the Proda package or our packages or our package. Using the methods in our package one can also generate HTML reports describing and the differential expression analysis experiment and these HTML reports consist of several section one such a report as shown here on the top right of the slide and it contains information project related information it introduces differential expression analysis and it sums up the design of the experiment and summarizes number of protein identification and the quantification results discusses missingness and shows coefficient of variation plots and clustering of the data or the more different it shows differential analysis results with volcano plots and tables which are actually interactive and so so one can filter the data for for a protein of interest and finally explains the output formats and gives pointers to follow up analysis like GCI over representation analysis. This package is also available on Github and now to summarize we use Prolevka to analyze LFQ TMT and PTM data and we use it to create QC reports and to estimate sample size we generate differential expression analysis reports in HTML and also export the data the results of the analysis to for instance excellent formats but also to summarize experiments. We use the Prolevka package for teaching differential expression analysis and our protein informatics course and we use it also for benchmarking new methods for protein quantification and or identification outlook and we further one we work on improving the test coverage of our package currently it's already above 80 percent we want to review the examples and the code examples in our package we want to move to simulate the data for testing to reduce the R package size and finally what we want to work on is also some basin modeling of protein quantification data to improve estimates and simplify handling of missing data. What is also important is that our package is available on Github it can be downloaded and contributions are very welcome. Last but not least I would like to thanks my colleagues from the Proteome Informatics Computational Mass Spectrometry Group Jonas, Christian and Maria as well as the entire Proteome X teams at the FGZ and the technology platform fund of the University of Zurich which provided some funding to develop the project the package and thank you for your attention.