 Right, very good afternoon or morning or evening or night, depending on where you are in the world. Welcome to this webinar of the joint webinar, I should say, of Epic Excess in the European Proteomics Association. For this afternoon, your chairs will be myself and my colleague Karel Barsnes. I will briefly introduce the Epic Excess Consortium as well as ÖPA and then without further ado, we'll get started with the speaker switches, of course, while you're all here. This session is being recorded, so we will make that available afterwards if you would like to share these talks with other people that will then be easy to do. We've also had a workshop using Ionbot earlier today and that session has been recorded as well. In case you missed that and you would like to see that, you will be able to see that as well and we will release that at the same time that we released this session recording. First and foremost, Epic Excess. What is this? Epic Excess is a European Proteomics Infrastructure Consortium funded by the Horizon 2020 Program of the European Commission. The idea is to provide access to proteomics analyses and it consists of a consortium of 18 institutes, in whom you can see maps there on the European map and provide access to our proteomics facilities to all life sciences researchers in the EU. What does Epic Excess provide? It provides protein identification, quantitative analysis, PTM mapping, computational proteomics and training courses and meetings, and of course, you're in one of them right now. The access sites are in 12 different locations in Europe and the access, if you are interested in obtaining it, is handled by filing a proposal on the Epic Excess website. You see the URL there at the bottom, it's easy, epic-excess.eu and then that proposal will be reviewed by people external to the project. We will then select those proposals that will go forward to actually getting analyzed. One of the other things that Epic Excess does, by the way, is do some research and these research activities are focused on improving the service provision and today the speakers have all been selected from the project and we'll talk a little bit about the research that they have been doing in Epic Excess to advance the capabilities of proteomics informatics. So that's another aspect of Epic Excess. Then what is UPA? UPA is the European Proteomics Association. I am actually the president of the European Proteomics Association and my name's, so my, I'm Lanach Masters, by the way, because it says Young Proteomics Investigator's Club under my name, that's because we're using the UPA WIPIC license for this webinar and the European Proteomics Association brings together all proteomic scientists in Europe and it has several components if you like, initiatives and working groups, but I've highlighted the two that are probably most relevant for today, which is first and foremost the European Bioinformatics Club, which is UPA and that's our bioinformatics initiative. It's a very open group that you can join very easily. If you Google UPA and Bioinformatics or Proteomics, you'll find them immediately and it's a really nice group to be a part of and they're very active and I warmly recommend you to seek them out and to contribute and then there's also WIPIC, that's the Young Proteomics Investigator's Club, see also my label and this is our Early Career Research Group, which is also extremely active, extremely open and always happy to accommodate anybody who wishes to contribute to UPA, to the community and to do some fun stuff together. If you would like to know more about UPA, I position there the URL of UPA. It currently is still a very old-fashioned looking 90s website, we will very soon upgrade that drastically, so if you want to see the retro version, you still have a short period of time in which you can go and look at it, but afterwards it will be up to date, so don't hesitate to go there and be all to buy or not so very beautiful website right now, but soon to be back. Epic Access organizes other webinars as well and so you can see a list of upcoming ones here, note that all times are in European Central Times, so we've got one coming up on 19th of May about proteomics and genomics integration on September 8th, which is top-down proteomics, which is also the topic of our first talk today, whoops that went fast, and then on the 3rd of November there is one on plasma proteomics, an evergreen topic in our field, and then finally we have a live workshop, an in-person workshop from 26th to 28th of September about new and advanced proteomics technologies, which is open to anyone interested, it will be in Tartu in Estonia, and the registration will be handled through the Epic Access website at the events page, very much like this one of course you already know, and it has been organized by the University of Cambridge together with the University of Tartu, and you can see a very nice picture of Tartu below, and we look forward to welcoming anyone interested there to hear more about what Epic Access is doing and all of the research that is happening within Epic Access, which to be quite honest is expansive and interesting. So with that I think we are done with the introduction, and I will give the word to my colleague Harald who will chair the first session. The one thing I still need to tell you though, sorry Harald, the one thing I still need to tell you is during the presentations we will collect all the questions in the chat, and at the end of the two talks in the first session we will handle the questions for both speakers. In order to identify which question you're, sorry, which speaker your question is addressed to, we do ask you to provide an ad, and then the first name of the speaker, you can of course see it in their label, so that we know whom you're asking the question to, and then if we have too many questions in the chat to answer during the presentation, or right after the presentation, then it's easy for the speakers to find the questions addressed to them, and they will reply to you in the chat afterwards as well. Okay, so don't forget to put the alt and then the first name of the speaker you wish to ask the question of, and now I will shut up and let Harald take over. Thank you Lanat. Yeah, so this two session, the first session is on the topic of a deeper look at the proteome at the smallest and largest scale, and as Lanat mentioned, we have two speakers, both giving half an hour presentations, and the first speaker is Kaiwon from Oliver Kolbaker's group, he's a postdoc there, and his talk will be on bioinformatics advances for top-down intact proteomics. So please go ahead, Kaiwon. Okay, thanks, thanks for the introduction, and hello everyone, I'm Kiwon Jung. As introduced, I'm a postdoc in Kolbaker lab in University of Tübingen in Germany. I found today's seminar is about computational proteomics, and I believe we have lots of breakthroughs in bottom-up proteomics, which will be covered in the following great talks. But in this talk, I would like to present some of exciting advances in top-down proteomics. Okay, wait, why do I see this? Okay, sorry. Okay, let me quickly review top-down proteomics as compared with the bottom-up proteomics. So this sausage link illustrates an intact protein molecule, and each sausage peptide. Kaiwon, I think you stopped your screen sharing. Oops, sorry. Yeah, now it's back. Okay, yeah, sorry. Okay, now you see the sausage, right? Okay, great. In conventional bottom-up proteomics, oh, why is this like this? Sorry. In conventional bottom-up proteomics, the protein is digested into short peptides, and from each peptide, MS2 spectrum is generated, right, one by one like this. On the other hand, in top-down proteomics or TDP, intact protein is analyzed as yeast without digestion. Top-down proteomics is therefore suitable to study proteophones, and proteophones are the different forms of protein produced from a gene with generic variations or post-core translation modifications. Oh, sorry. Since proteophones provide higher resolution and phenotype closer information than proteins, they are gaining more attention in medical and clinical researches. For us, computational proteomics people, the major hurdle in top-down or intact protein data analysis is the complex signal structure of protein from ions. Let's begin with peptide ion signals first, due to relatively small masses, peptide ions often have small charge ranges, and also they usually come with small numbers of isotopes. But in case of proteophones, they generally have wide charge ranges and many isotopes. So a single kind of proteoform ion results in more than 100 peaks in spectrum. This illustration shows only three charge states, but we often observe more than tens of charges from a single kind of proteoform. Then how this distinctive nature of proteoform ion complicates data analysis? Assume the LC features of the same color here represent the features from the same proteophones. They often coalesce and have very close NC values each other. And in real spectra, there is certainly no color, and it is very hard to distinguish which features are from the same analyte. When we see a specific MS1 food scan, the peaks are often very dense and redundant. MS2 spectrum from these highly charged precursors was quite complex and retains many peaks of different charge states. These complexities in different levels complicate various proteoform analysis, including identification and quantification. Then how to resolve this complexity? For example, could we make these entangled features into simple monoisotopic mass features and could we convert this complex MS1 spectrum into simple one consisting of monoisotopic mass peaks and also for this MS2? This entangling step is called mass convolution, and this convolution step is very crucial for most MS data analysis in top-down. Because the next analysis like proteoform identification and quantification predict that errors in the convolution are to be propagated in the following analysis, which I will discuss in the later part of this talk a little bit more. And many excellent convolution tools have been introduced in the last decades. This list only shows a few of many such tools. Threshold is, to the best of my knowledge, the first automated mass convolution tool. It is published almost 10 years ago, but it has been continuously improved, re-implemented, and used widely in many top-down software suites. Extract is most widely used in many top-down software suites, like ProciPCPD, and works well for isotopically resolved high-resolution spectra, while respect is optimized for isotopically unresolved spectra. TopFD is a convolution tool in topics of their suites that works especially well for the convolution of MS-2 spectra. Promex is a feature-leveled convolution tool used within MS Pathfinder team from PNNL. Unity is also gaining lots of popularity, especially in analyses of native MS or CDMS datasets. Also, many other vendor convolution tools have been introduced. They all come with their own strengths and specific applications, and also certainly with limits. One of the major limits in decommolution was the running time. Convolution itself often takes much more time than, for example, identification. And to reduce the running time, many tools reduce the range of possible mass range or charge that lead to reduced sensitivity. To address such problems, we introduced Flastic Conf on ultra-fast feature and MS convolution tool as a part of OpenMS software. Flastic Conf is based on a simple preprocessing of spectra to accelerate the decharging step and show not only very fast, but also accurate convolution results. The speedup of Flastic Conf is done in decharging step for which we use the simple local MS transformation. We come back to this signal installation, and decharging is to determine charge status of peaks. It can be done by measuring distance between consecutive isotopes or by measuring distance between consecutive charges. Flastic Conf basically takes the second method. If we calculate the distance between consecutive charges, they are given by these blue formulas. Obviously, the distance is dependent on the mass value m. When there are multiple masses, we have different charge patterns, which usually complicates the efficient charge determination of peaks and, in turn, efficient convolution. But if we simply take the load to all m's of all peaks, then the distance between consecutive charges certainly become independent of the mass value. There is no m in the formula, right? And their charge pattern is the same for different masses. So charge determination becomes a simple and very fast pattern finding in this log m's transformed spectrum. Thanks to this trick, Flastic Conf showed orders of magnitude faster on time than other commercial softwares like extract or respect. The left side compared the runtimes from different convolution tools for a complex mouse sample. Flastic Conf took only seven minutes to analyze the whole dataset. For both ms1 and ms2, it typically achieves less than 20 millisecond processing time per spectrum. And also, Flastic Conf output far fewer mass artifacts than others, as shown in the right side. The blue bars represent genuine masses, and other bars represent masses interpreted as artifacts. Convolution is certainly still the most important problem in top-down or intact ms data processing, but we also have different problems as in bottom-up. From the deconvolve of the features, proteoform quantification problem should be solved. And using deconvolved ms1 spectrum, more advanced data acquisition scheme could be deployed. And with convolved ms1 and ms2, proteoform ID and characterization method could be developed. Let me briefly go through each problem one by one. While feature level free quantification allows for accurate and experimentally simple quantification of proteoforms, not many dedicated quantification tools are found. We have a commercial software ProCyte PC in Proteus Discoverer, PCPD, to support label free quantification. And a free software called iTopQ is presented a few years ago, but it has not been maintained after publication. So we recently developed Flask DeconvQ, an open source tool for proteoform label free quantification. Flask DeconvQ takes four steps for quantification. It starts with detection of individual ion chromatograms from peaks in centroid ms1 data set. Using Flask Deconv, detected ion chromatograms from the same mass are grouped, and then the coiluted chromatograms are resolved by solving a simple least square problem. And lastly, the quantities of masses are calculated by summing up the area of the grouped result chromatograms. Let's look a little bit closer. Assume there are two proteoform chromatograms, color-coded with green and orange. The red chromatogram is the overlapping one. Flask DeconvQ takes the non-overlapping chromatograms, and from them generates theoretical shapes. And with these shapes, the overlapping red chromatogram is reconstructed by solving a non-negative least square problem. This shows an actual observed signal from an E. coli lysate data set. Green and orange features are from different proteoforms, and the red one is overlapping one. And the right side shows how Flask DeconvQ resolved collusion. And this resolving step gives an accurate quantity ratio as well. When we benchmarked Flask DeconvQ against just Flask Deconv, respect and I tap Q, with E. coli lysate, we observed Flask DeconvQ found the most jointly detected proteoforms, and also showed high portion of overlap. And most importantly, we looked at the quantification accuracy. The green boxes showed the coefficient of variation or CV values of low abundance masses, and orange boxes are the CV values of high abundance. For both cases, Flask DeconvQ showed minimum CV values, showing Flask DeconvQ is accurate, and also works well for low abundance proteoforms. And next, data acquisition is also an important issue in top-down proteomics. Let's think about a simple DDA acquisition in this MS1 spectrum. As shown before, red and green peaks represent the same proteoforms respectively. For instance, if we choose the most intense four peaks for fragmentation, they are the selected precursors, and they are all from the same proteoform. Different from the bottom-up cases, this criteria often ends up with the selection of redundant and often low-quality precursors. To tackle this problem, intelligent data acquisition methods like Autopilot and MetaDrive have been presented. Autopilot performs real-time convolution and identification to optimize precursor selection, and MetaDrive performs real-time convolution to select multiple precursors for fragmentation to boost quality of MS2 spectrum. But both methods were limited by rather long processing time. They should be fit in the duty cycle of the instruments. By leveraging fast convolution of flash deconf, we developed flash IDA on intelligent data acquisition scheme to boost proteoform ID sensitivity. Flash deconf runs inside of the thermal machines real-time. So the MS1 spectrum here, this red arrow, is provided by thermal IAPI real-time to flash deconf that performs real-time convolution. Then the deconform spectrum is further processed and converted to mass-quality spectrum from which high-quality messages are selected. The selected messages are converted back to MZ and fed back to machine again through IAPI. Flash IDA avoids co-illuted precursors and controls isolation window size real-time to select high-quality precursors. With this relatively simple workflow, we were able to make a large boost in terms of the number of proteoform identifications. We compared flash IDA runs, FI data sets, against standard IDA runs, ST data sets from E. coli lysate single runs. The letter digits 30 and 90 specify the retention time gradients. In the bar graph, blue ish bars are from flash IDA and red ish bars are from standard IDA acquisition. Flash IDA resulted in almost twice proteoforms for the same retention time graded runs, or alternatively it achieved similar proteoform count to standard acquisition with about a third of the instrument time. Next topic is the identification. We have actually many excellent softwares for proteoform identification, very popular tools like Procyed PCPD and TAPI. They are widely used and actively maintained. We also have P-TOP and MS test finder T. While they are less actively maintained, they provide sensitive and complementary proteoform identification to Procyed PD and TAPI. Professor Dave Tapp is preparing a paper on the comparative study of top-down proteomic softwares with us, and this diagram shows the analysis flow, where different ID tools, identification tools evaluated for the same samples. Please wait for this informative paper in which the detailed analysis about these tools will be provided. Now, given these brilliant tools, the question is, should we go with another one? Should we develop another one? Well, let me give you some explanation on why we may need one. Another analysis we are performing with Dave Tapp in the paper is to compare different deconvolution softwares. We use a single ID tool, which is topic in the bottom side, and use different convolution algorithms. Let me present a simple proteoform ID overlap analysis, which will lead to an unexpected surprise on proteoform FDR estimation. And this may convince you the need of another identification tool or another post-processor for proper FDR control. Let's say two proteins that proteoforms from two different runs overlap if their protein accessions are the same. It's pretty simple. And two proteoforms overlap if their protein accessions are the same, sequences are the same, and precursor masses are the same. I rarely saw the case where the first and the third conditions are met, but the second is not met. So the most decisive factors are the protein accessions and precursor mass. The evaluation metric was simply the overlap coefficient. From this Venn diagram, the overlap coefficient is the number of intersecting elements divided by the smaller set size of the two. When it is zero percent, no overlap is observed, and when it is 100 percent, then we have a perfect overlap. To compare the convolution tools, I minimized the variance from sample or identification tools. From the same E. coli orbit trap sample, we used flasticon and extracts for convolution and coupled with the same identification tool topic at proteoform level FDR of one percent. And we saw how their overlap coefficient changes. In protein level, we have 92 percent of overlap. Well, not so very bad. And for proteoform level, oh, it was quite low. It's 51 percent. This doesn't look very good. Well, but maybe this is the case in which flasticon and extract complemented each other. So maybe a win-win game. Then I removed this variance again by only taking the proteoforms from the same MS2 spectrum. They should be the same because they are from the same MS2. In protein level, it was 96 percent. A bit too low. I wish I had 99 percent or 98 percent because FDR is one percent. Then how about the proteoform level? It was only 49 percent, even lower than before. So this explicitly shows the precursor masses from two masses do not match seriously, which may be a very serious problem for both flasticon and extract. But what is even worse is that this shows that proteoform level FDR is failing so, so bad. Okay, then how about we introduce precursor intentionally? I took only flasticon and designed this following analysis. For this typical deconvolution ID pipeline, I added a devil block in which arbitrarily masses between 10 and 20 are added to or subtracted from the convolved precursor masses. For example, when the reported precursor masses are here, some random masses added were subtracted and it is used for the use for the topic identification of one percentage there. For this search, the number of proteoforms spectrum matches reduced only by 10 percent, not by 99 percent. And even worse, the number of proteoforms increased by 200 percent. This is kind of expected because all different random force precursor masses will be interpreted as novel proteoforms. So one can guess that most first parties in TDP searches are from precursor mass errors. This gets serious, especially when blind modifications are allowed and when many modifications are allowed per proteoform identification, the same problem occurs. Then why our great target decoy database cannot control this source of error? A simple way to see if decoy works is to compare score distributions of decoy hits and actual false positives. To obtain false positives, we search this E.coli data set against human protein database. This is FP1. This simulates the protein sequence error. And we took the false positives with the incorrect precursor masses from the previous devil block pipeline. This is FP2. And lastly, we have decoy hits from decoy only database search. This is the match score distribution of decoy hits, the higher the better. And the second shows the distribution from FP1. They are quite similar. So decoy will represent false positives in FP1, in other words, protein sequence error. And this one shows the distribution from FP2. If we see the x-axis scale, we can easily see decoy and FP2 have very different distribution. Not only this shows decoy cannot simulate false positives from precursor mass error, but also indicates that score cutoffs that do not take precursor convolution quality into account would have the same issue. Then what is the solution? We are trying to approach similar to the tool condenser from LucasCal group, which used the notion of decoy MS1 features to control MS1 feature match error rate. By using decoy MS1 features with proculator, they showed less than 5% feature-level FDR. Likewise, we could extract features from precursor convolution and use decoy precursor masses with proculator. This is an ongoing project in our group. So far, we thought about how to take precursor mass error into account for accurate FDR estimation and identification. Another way to resolve this FDR issue is just to remove all precursor mass errors. Again, going back to the first deconvolution problem. This is too dreamy, but we are trying to improve the convolution quality using deep learning to reduce and hopefully remove convolution errors. To measure convolution quality using deep learning, we encode the observed peaks so that they can be used in RNN and CNN-based classifiers. Let's revisit this signal from the proteome. These peaks of different charges will be moved to this three-dimensional space with mass, charge, and intensity axis. The first blue peak can be moved like this. The first peak comes to here corresponding to monostopic mass and others come one by one. And the next charge peaks appear in the next charge axis, and the purple peaks appear in the next charge. Now we encode these peaks into a sequence of tokens. The blue peaks form the first token, and the green peaks make the second token, and the last purple make the third token. Note that the tokens contain similar signal shapes. When the signal changes, for example, their isotope patterns change, and have more charge states, we have more tokens with different signal shapes. For very small masses, the sequence lengths will become very short. Lastly, for noise, we usually have very low relation between tokens. So overall, the inputs have variable lengths, and signal has high relation between tokens, and noise has low relation. So RNN might work very well for these kind of inputs. The second method to see the signal is to simply see it from above. Then the signal looks like this. It has a checkerboard pattern. If the signal changes, we have differently sized checkerboard, and for smaller ones, we have very small checkerboard. Lastly, noise makes rather scattered dots. So overall, target image size varies. Signal and noise have different patterns. So RNN looks a good method for this problem. RNN is already used for convolution in Topic Suite as well, with the name ENV-CNSCory. For training, validation, and test, we prepared half-simulated datasets. We have three classes instead of two. Correct class masses are obtained by collecting the convolved emission masses of very high signal-to-noise ratio. The noise masses are obtained by shifting the signals in correct masses by 10,000s in the row spectrum. Lastly, we also have mass artifact class that represents charge assignment errors. This is the RNN result. We used three different RNN models, simple RNN, LSTM, and GRU. When you ran three times with different sets, all showed very high accuracy exceeding 0.95. LSTM showed the best performance and GRU followed the next. One interesting observation is that RNN works much better for larger masses. This makes sense, because larger masses often have longer input sequences, and RNN can effectively exploit the information in low-input sequences. Then we tested with simple CNN model. For CNN, we tested with two maximum charge ranges of 10 and 50, and the results from CNN were actually better than RNN. The average accuracy was almost 0.97 when 50 charges were used. CNN also worked better with more charge states, but the difference between narrow and wide charge ranges were not that high. We also tried to take advantage of other people's work. One relatively easy way to do so is to use so-called transfer learning technique. We took an architecture called ResNet, which has been trained to classify pictures in ImageNet. ImageNet is a huge image database retaining more than 14 million images. We took this ResNet and trained it with our training dataset with fine training of architecture. When we evaluated the performance as compared with simple CNN, we always observed a small amount of performance boost in every single test. This shows that pre-trained architectures like ResNet can help to improve convolution quality filtering, because their ability to detect elementary shapes are very advanced. Even though this is still a preliminary result, we are quite excited to see that the information in the picture of, for example, what cats and dogs actually could boost the MS data analysis. If we could do with normal pictures, then for instance, we could use information in bottom-up for top-down. We pre-train using abundant bottom-up data for simple tests and use transfer learning for top-down where we suffer from lack of data. This, my hope. And here we show a few examples of correctly classified messages. They are all pre-cursor messages convolved by top FD and identified by top peak. Blue bars are correctly positioned peaks and red are incorrectly positioned peaks. Peaks are binned and intensities within each bin are summed. The blue bars form nice isotope pattern shapes. And they are the noise class messages. We see lots of red peaks, noise peaks, and blue peaks do not form good shapes. Note that they are still identified pre-cursor messages at 1% FDR by top FD and top peak. And this is another noise class peaks. This looks a very perfect shape. So it looks like a misclassification. But we found that this is actually mislocated peaks by two daltons. So we observed that deep learning can even find small monostopic mass errors quite easily. And lastly, we revisited this overlap analysis from previous one. The proteoform-level overlap coefficient for the same MS2 spectra was only 49%. We post-processed the pre-cursors using our deep learning filter. Then we had 82% proteoform overlap. This is very nice. But we also lost 18% of overlapping proteoforms. But when considering 72% from X-track and 48% from fleticone proteoforms were filtered down, this result still looks not so bad. So to sum up, in top-down area, many softwares and tools have been introduced for different aspects of data analysis. Not just single tools, we have commercial and free software suites for more comprehensive analysis of top-down data, including pro-side-PCPD, topic suites and mesh explorer, and metamorphosis. We contributed with our recently developed flash series tools. And hopefully, in the near future, we could release our flash software suite as well. With that, I would like to thank to my colleagues in Colvaha Lab and all the collaborators. I also thank to all the funding resources and deeply thank to the audience for your attention. Thank you very much. Now we will take any questions. Thank you. Thank you, Kai-Won. So as Lannet mentioned, we'll do the questions at the end of both talks in this first session. But I will urge anybody who has a question to put it in the chat now before you forget your question. And now we've looked at the large scales. Now we looked at intact proteins. Now we'll move on to the smaller scale of immunopeptidomics. And we'll have a talk by Arthur from who's a PhD and Lannet Martens group. We'll talk to us about MS2 Rescore for improved immunopeptidomics. So Arthur, when you're ready, just go ahead. Yes. Hello, everyone. I'm going to pick my pointer before we start. So hello, everyone. My name is Arthur. I'm a PhD student at the Compomics Lab. And today I will talk about MS2 Rescore and its application into immunopeptide identification and how we can get much more out of the immunopeptide identification by using machine learning and deep learning tools. So to give you guys a quick overview of what I'm going to talk about today, first of all, I'm going to start off with the specific challenges in immunopeptics, immunopeptidomics, so we know why we need new tools to raise the identifications in immunopeptidomics. I will talk about leveraging the peak intensity predictions and the retention time predictions from MS2 and DPLC and how we use that in MS2 Rescore. Then we'll show the applications of MS2 Rescore and its dramatic improvements in immunopeptidomics. We'll go into a more detailed analysis of the effects of MS2 Rescore and we'll finish off with some generic peptidomic applications. So to start off, the specific challenges in immunopeptidomics, what do we want to do in immunopeptidomics is identify immunopeptides that are presented from the MHC bound molecules, and these are these peptides here presented in the slide. And so first of all, we extract the MHC molecules with the peptide select attached and then we elute these peptides and we measure these peptides in mass spectrometry. But when we try to identify these peptides with standard search engines, you have to take into account that these peptides are basically non-triplic. So previously normally what we do is a triplic digest and so we ended up with only triplic peptides in our search space. Now, immunopeptides are basically non-specific or non-triplic. Furthermore, we have to take into account the variable length of these peptides. If you take a look at HLA class 1 immunopeptides, these ranges range from 8 amino acids to 12 amino acids. But if you take a look to HLA 2 peptides, these ranges get much bigger from 12 to 26. And to give you an example of what this does to the search space is normally if you take a protein with thousands amino acids and we do a triplic digest, we get about 115 triplic peptides in our search space. If we do a non-specific digestion of our search space, so in this case of the 1000 amino acid protein, we'll end up with 991 non-triplic peptides and this is only for one specific length. So amino acids, peptides of 9 amino acids. And if you then adjust for the variable length that we have in immunopeptides, so we add 8 to 12 amino acids, we'll end up with a search space of 15,000 peptides. And so if we would do this for the HLA class 2 peptides, it would get even bigger. And the results of this increased search space is that we ultimately end up with less confident matches of our peptides against our spectrum that we imagined with LCMS. And so ultimately with the FDR control, we lose a lot of our identifications. And so with basic or standard search engines, we end up with a lot less identifications or BSM matches in the end. So now how do we want to improve the identification rate in immunopeptidomics? Well, we will use the machine learning and deep learning tools that are presented in the workshop previously called MS2Bip and DPLC to provide peak intensity predictions and retention type predictions. And so what does MS2Rescore? MS2Rescore is a post-processing tool, which is very important. This uses all of the peptide spectrum matches from the search engine that you use. And then we'll calculate three sets of different features for all of the BSMs. First of all, the search engine features. These are the standard features that the search engine also uses to score BSMs. And we'll additionally add DPLC features based on retention time predictions. So this is, for instance, the retention time error. And then we'll also add peak intensity predictions from MS2Bip and calculate a whole range of different features for that as well. And then we provide all of these features to calculate to essentially re-score the BSMs that we have identified in our search base to ultimately end up with a lot more of identifications. But before we were able to do that for immunopeptidomics, we essentially had to retrain MS2Bip specifically for non-chiptic peptides. We saw that DPLC was quite robust to predict the retention time for non-chiptic peptides. But for MS2Bip, in the case of peak intensity predictions, we saw that non-chiptic peptides heavily altered the peak intensities. So we had to train a new model that basically got much better at predicting these non-chiptic peptides. And as you can see here, these are all the models that we have trained and tested. In the yellow one, you see the current triptych model of MS2Bip. And then you can see that the newly trained models do a much better job at predicting peak intensities for HLA1 and HLA2 data. What was more striking is that by providing immunopeptides to the training set, we also increased the patient correlations for the predictions between the observed and the predicted peak intensities for the triptych data as well, which was quite satisfying to see that actually we had models that were able to generalize the predictions for both triptych and non-chiptic peptides much better. In the end, we also tried this on the chymotrypsin digested data. And what we saw there is that a model that was trained on solely immunopeptides was not good at predicting the retention peak intensities for chymotrypsin digested data. So we had to actually provide chymotrypsin digested data to the training data before we were able to predict peak intensities for chymotrypsin data very well. And this shows that even though immunopeptides are non-chiptic in essence, it is still a quite big difference between chymotrypsin, which has also considered a non-chiptic to generate non-chiptic peptides. So there's still a big difference between these non-chiptic peptides. And so we provided all of these features, like I mentioned before, an MS3 score to percolator. And then we see that all of these features in blue, we have highlighted the MS2PAP features in green, the DLC features, and then in yellow, the search engine features. And we can see that all these features get quite a bit of weight. The reason why we have so much MS2PAP features is because MS2PAP calculates much more data points, which from which we can calculate much more features. But as you can see here, while retention type predictions is only one data point, so we can match the predicted retention time with the observed retention time, we can see that this is a very important feature when re-scoring the data. And what was also very nice to see is when we re-score HLA Class 1 peptides, which are fairly restricted in the length, you can see that also the peptide length is a huge, very important feature for percolator in re-scoring the data, which is actually very nice to see. And if we take a little bit more in-depth look at how percolator uses these features is here we can see, so on the left side we see the search engine score in relation to the retention time error. Here we see the search engine score in relation to the Pearson correlation, and here we see the retention time error with the Pearson correlations. And so what we see here is that by providing all of these different features, and so keep in mind that these are also only three features, but we have in total around 100 features that we provide to percolator, percolator is really able to separate the accepted targets from the rejected targets. And if you take a little bit more a look in the center here, previously if you would only use the search engine score, and we would tell that above a certain threshold we would take these targets as accepted targets, we would have taken a lot of invalid target speeds. So now when we provide the Pearson correlation coefficient, we can see that we can very much, we can much better separate the accepted from the rejected targets. And if you take a more a look at the distributions from the decoys, the rejected targets and accepted targets, then you can nicely see that the distributions of the decoys very well match the distribution of the rejected targets, while we can see a huge difference for the accepted targets. And this is for the retention time error and the Pearson correlations. And this shows very well why it is such an advantage to prevent, to provide these features to percolator for re-scoring, because it is able to very nicely separate your targets. Now to show you some of the results that we got from MS to re-score and post-processing immunopapidomics data, here you can see on the left hand on the y-axis the identification, so the higher we go, the more identifications we have of the more identified spectrum and the much more, if you go more to the left, then we have a much higher FDR threshold that we use. So if you take a look at 1% FDR, if we compare MS to re-score with standard percolator, so re-scoring with only search engine features, then we can see that for the 1% FDR, MS to re-score has a very nice increase in the number of identified spectrum, which was very nice to see actually. But then if you take a look at the 0.1% FDR, previously even with re-scoring with only search engine features, we were not able to get that much of identifications out of your data. Well now we have a huge increase in the confidence that we have in the matches and also a huge increase in the number of identified spectrum. And if you take a look at where we are here at the 0.1 FDR, nearly 80% of the peptides that were identified at 1% FDR were already identified at 0.1% FDR, which is quite a nice sight to see. And then if you take a look at the uniquely identified peptides, which is the most important thing in immunopeptideomics because you want to find these new immunopeptides in terms of sequence, then again we can see if we take a comparison with only search engine re-scoring, so only using search engine features and standard pre-scoring that we can get a very nice increase of around 40% in uniquely identified peptides at 1% FDR, which we nearly get a 300% increase in identified peptides at the 0.1 FDR. So really we increase the specificity that we have in the samples and identify a lot more So now we provide the opportunity to not only have more identifications at the 1% FDR, but also to use the 0.1% FDR if you want to be more sure of the identifications that you have. And then to take a look at what the gained peptides look like, so in immunopeptidomics for the certain HLA pattern, we really see certain patterns in the data in uniquely identified peptides. And so on the left-hand side this was the pattern that was identified in the previous paper for the C-12-03 HLA type. And when we look at the peptides that we gained by MS2 re-scoring, so only the peptides that MS2 re-score identified in relation to search engine re-scoring, we see that this pattern is nicely represented by all of the gained peptides. And when we take a look at the lost peptides, so the peptides that MS2 re-score tosses away, we can see that these are essentially much more random patterns that are explained here. We can also see that the information, the sort of bits in these patterns are much lower. So essentially what we throw away is more likely to be false positives and we gain a lot of nicely peptides that really fit the HLA pattern. And then we also did this for the state of the art search engine, which speaks for immunopathidomics. And then again, we can see that also here we had a nice increase in identifications for the 1% FDR, although slightly lower than, for instance, when MS2 re-score max quant data. And this is because peaks is much more suited for immunopathidomics. But then when we take a look at the 0.1% FDR, we again see a very nice increase, twofold increase in the number of identifications that we had before re-scoring. So re-scoring really can get a lot out of your data. And to show you a little bit more of what MS2 re-score is capable of and not in terms of identifications, but we can really see that for all of the HLA banners MS2 re-score is able to have constant improvements of the identification rate. But what is very nice to see is that for HLA patterns that previously we did not have any identifications for, for instance, this one at the bottom here, in blue you see the low re-scoring. So this is basically what comes out of the search engine. You can see that by applying MS2 re-score, we nearly get a 10% identification rate at the 0, at the 1% FDR, where we previously didn't have any identifications. So even for the harder to identify HLA patterns, we can now really start to map the HLA patterns and try to get a better overview of the HLA patterns that are in there. And we did the same for a different collision energy. So we split up the identification rates and the gains of MS2 re-score by different collision energies. And so what we saw is that the overall identification rate dropped when we used suboptimal collision energy values, which is in this case 35 for immunopathy domains. But we can see that the gain of using MS2 re-score is here the biggest, nearly 60%. And this is a little bit explained better on the bottom left here. So by using higher collision energies, we get a very, we get a lower explained volume current, which means that our spectra in a sense gets a little bit worse. And MS2 re-score accounts for this. So MS2 re-score really learns that the spectra with a collision energy of 35 is getting less and less accurate by shifting its weight more and more to the DLLC feature. So we can see that the weights of the DLLC features here in green are getting much bigger. So by shifting its weight, MS2 re-score can still get a lot out of your data, even though the circumstances were suboptimal, in this case, for instance, for the collision energy values. And this is a really nice way of recovering a lot of data that we previously have been lost. And the same is true for abundance. So we took a look at the peak intensities of the MS1 spectra, and we divided this in 10 bins. And so for the identifications with the high MS1 precursor intensities, the gains were not really that big, although we had some nice gains. The biggest gains were actually seen for the low-abundant peptides, peptides that previously would have not been identified by the research. And so here again, we can see that we can recover a lot of the data, a lot of the identifications that previously would have been lost due to this low-abundant precursor MS1 spectra. And so MS2 re-score can really identify or recover these peptides as well out of the data and have a huge gain in these suboptimal ranges of any suboptimal circumstances. And then to my final things about MS2 re-score, MS2 re-score is not only limited to immunopeptidomics with the new models, as I showed you in the beginning, we also trained on chymotrypsin data, which is another non-tryptic enzyme. And so we applied MS2 re-score on generic peptidomics data. And here again, we saw some nice increases in number of identified spectra for the Arabidopsis thaliana biopeptides, which are also non-tryptic, and even for the human urine peptidome. We still got an increase of 60 or 150 percent for the one PSM. And then what was the most striking result is that almost 90 percent of the data at the 1 percent FDR was already identified at the 0.1 percent. Again, showing that we can gain a huge, we have a huge gain of sensitivity and specificity by re-scoring with the peak intensity predictions from MS2 PIP and the retention time predictions from TLC. So to conclude, MS2 re-score can substantially increase the number of identifications in immunopeptidomics. We enable, by using MS2 re-score, the use of much stick to FDR thresholds. MS2 re-score can really get the most out of your data, even in sub-optimal values, and can recover low peptides as well as whenever you have a little bit more, a little bit worse spectra. And MS2 re-score is not only limited to immunopeptidomics, but we can branch out the peptidomics data as well and have some really nice gains for these datasets as well. And with that, I conclude my talk about MS2 re-score, and I would like to thank everyone in the group that has helped with this very nice tool. Thank you. Thank you, Alteur. You have almost 10 minutes left, if there's more you want to say. Yes, I talked a bit too fast. There are not many questions. So if you have any questions, please be sure to put them in the chat. There's one question that's already been answered in the chat, but I think we can repeat it, the people that haven't followed the chat. And that's for Q1. It's from Animesh about how clean does the sample has to be for flash decon to work? I'm wondering if it can pick up antibodies from serum directly. I think this can be a question of a dynamic range of data itself. I mean, if there are peaks, maybe flash can pick up the signal or other tools as well. But if this antibody has just so low abundance that the dynamic range exists like 10 to the 5, for example, then there's no way any convolution tool can pick the signal up. But if there's not the case, maybe we can try. But I don't see, I don't know, usually when people analyze antibody they just have very pure ones, right? So yeah, so yeah, for now I cannot answer that before I actually see the data set. But if the complexity is about the number of peaks, then flash decon can pick up the signal quite efficiently. Yeah, so that is true. Thank you. And now there's also a new question here for Arthur. I missed the workshop this afternoon. How do you train the software for immunopeptidomics? Do you use the peptides published in literature? Yes, so we used some publicly available data sets on immunopeptides that we used to retrain the MSPIP models on. So we had better peak intensity predictions for immunopeptides. And we also looked at DLC, so the retention time predictions, which already did quite a good job at predicting non-syptic peptides. So we did not train, had to train any additional models for DLC. And then MSP Rescore is basically just providing all of these features to percolate and then rescoring everything. Yeah, thank you. I don't see any other questions. Anything I missed, Lennart? No. Only people say thank you for your presentation, so that's also good. We are way ahead of schedule, though. So what do we do now, Lennart? I propose that we simply continue. I think that is a good idea. On the other hand, since we're ahead of schedule and since we've been doing this now for an hour, maybe it's not a bad idea to have a short bio break of about five minutes so that people can get a drink or can visit the bathroom for a moment. And then I propose that we'll be back at 25 past four and then we'll continue. Short break before we continue. So welcome back, everyone. We'll continue this webinar with the last two presentations, which are actually nicely matched and they talk about the large scale, if you like, interpretation and reuse of proteomics data. And so our two speakers are Matthias Wilhelm from the Technical University of Munich and Juan Antonio Vizcaino from Emble EBI. And both will talk about interesting resources and verification of data. And the first of these will be by Matthias. Matthias, you can start sharing your screen meanwhile. Yeah, super. There we go. And so he will talk about proteomics DB and the prosyth tool suite that has been developed at Technical University of Munich and how that both disseminates and probably uses fair data in various ways. So Matthias, without further ado, I'll give the word to you. Thank you for the introduction. Thank you. And thanks for the invitation. I'm happy to be here. So as as introduced, I'm attempting to talk a bit about unfair data, how fair data forster proteomic research and sort of trying to exemplify this a bit on proteomics DB and prosyth to bigger projects we have been working on over the past years. And particularly proteomics DB, I somewhat not necessarily painfully, but it sort of opened my eyes when I made these slides a couple of days ago, that the development of proteomics DBs is by now two years old, essentially. So about two years ago, we started developing proteomics DB, which is an in memory database for hosting initially just large amounts of proteomics data. And we use it really initially as a mechanism to disseminate data we have acquired in the context of figuring out which proteins are likely present in which organ in the human body. Since then, in proteomics DB, plenty of different research projects happened and we disseminated different tools and analysis with it. So proteomics DB typically follows sort of two streams, one where we focus largely on data processing and dissemination, and the other one where we focus on data integration utilization. And it will go into both aspects throughout this talk now a bit. Briefly, proteomics DB started 2012 with this project of figuring out which proteins may be in which quantities expressed and which part of the human body. And then we developed a different tool chain looking into protein FDR, but also into how we can sort of extract the most information out of profiles. And on this side, and that was almost always coupled with extensions on data integration and utilization, where we integrated different data streams into proteomics DB, looking at target spaces of drugs, for example, here on the bottom left hand side, or also then on how the expression patterns of proteins is in other organisms. All of this would have not been possible without prior research. Proteomics DB particularly is built on the shoulders of giants here. So without access to fair data and fair software in the same regard, proteomics DB and similar efforts would have not been possible. So we rely heavily on the availability of standards, ontologies, on annotations largely coupled in knowledge spaces such as Uniprot, Campbell, but also gene ontology, CAC and String, which we really rely on understanding our proteomics data. But we also made a huge, we benefited hugely from efforts from the proteomics change consortium, particularly also Pride here, which provides access to, I guess in the meantime, petabytes of proteomics data and similar efforts sort of in transcriptomics and also phenomics. All of this obviously also coupled with the availability of open source software, or at least free to use software, such as MaxQuant, which really sort of used heavily for processing data and bringing this into proteomics DB, but also accompanied by this standards and how do we represent peptides, how do we represent modifications and so on. So without the work of many, many people in this realm, an effort like proteomics DB would have really not been possible. And today, proteomics DB sits on quite an extensive amount of data because of that. So proteomics DB covers in the range of 300 human tissues, fluids and cell lines. By now, we have also extended proteomics DB not only present data on human origin, but also cover about 40 obitopsis tissues and cell lines and also cover about 40 mouse tissues and cell lines. And soon there's also a rice proteome available in proteomics DB. We have in the beginning largely focused on sort of the pure expression data of proteins across those tissues and cell lines and body fluids. But in the meantime, we also store other information about proteins, so how do they behave when trying to melt them. So this is in the context of ZETSA experiments. Proteomics DB supports the storage of turnover measurements, so how quickly do proteins get synthesized or degraded in the system. So that's also particularly interesting these days for developing novel drugs such as protax. We also store a large chunk of those response data, so measuring precisely at which concentration of a drug a cell line is going to respond or not respond. And obviously then also comes with that as a lot of protein expression data and transport expression data. The database scheme here depicted on the right hand side sort of spun a bit out of control over the last years. And I guess that's a particular challenge also in research projects like this, where you start a project like this as a PhD project and it grows and grows over time. And then you add a module here and they add a module there and add a new feature there. So at some point this will quite a bit of time to homogenize this a bit. But you see that in the in the database scheme we are covering quite an extensive range of data on Proteomics DB. Proteomics DB today is still hosted by a group and at the TU Munich, the group of Professor Kretschmar and Proteomics DB is run on a database scheme which is developed by SAP HANA or database technology rather. It's an in-memory database. So the data we provide access to on Proteomics DB comes on SITS and a server which has six terabyte of main memory. So there's actually two of these servers by now with a continuous integration testing behind this with a couple of development infrastructure and machines behind this. And then also an auxiliary compute node allowing us to perform predictions using GPUs or CPUs. We've used Proteomics DB not only for largely for disseminating also our own data or data which was acquired in the context of my previous employment at Professor Bernard Hüster's chair, where on top of this data layer there sits a presentation layer where we try and provide access to the data in in useful visualizations for the scientific community. So while benefiting from efforts such as PRIDES to make the raw data available in the past and build extensively on top of this layers which allow users to investigate this data in detail, combine data from different streams and sort of validate or come up with new hypothesis for their research. Today Proteomics DB supports Proteomics data and Transcriptomics data and the Phenomics data we have briefly talked about earlier. We also integrate data from target identification measurements where experiments are done which are aimed to figure out which drugs interact with which proteins and then also meta data in the form of such for example protein-protein interaction maps. Because of that in Proteomics DB you can essentially explore all of these these data in real time. So you can for example start from the expression of a protein here depicted with the body map where in the color scheme superimposed you see where your protein of interest is expressed most or least and then one can go straight into potential drugs interacting with this protein. One can go also into the phenotypic response of cell lines when using those drugs and one can look at expression patterns of multiple proteins in the form of a heat map visualization and can also look at the target spaces of multiple drugs combined but also dig down into the data of every single spectrum essentially. To go a bit more in detail what is possible this start with the very low level information so we can dig to every single spectrum stored in Proteomics DB essentially. So you have a visualization like this where on the mirror plot which you've likely seen in an earlier presentation already where up to the maybe to the top at least here to the top you see an experimental spectrum on the bottom you see a reference spectrum that reference spectrum may originate from synthetic peptides but may also originate from prediction tools in our case that originates from poset. You can actually look at every single identification and try and figure out whether the protein you may be interested in is confidently supported by an identification of a peptide which uniquely maps to this protein. In Proteomics DB we support the annotation of the spectrum for various fragmentation techniques and particularly the addition of the reference spectrum of which we store in the range of a hundred million in Proteomics DB allows you to get also without extensive prior experience on how mass spectra should look like. You can get an intuitive feeling on whether this looks to be correctly identified if the top one matches the bottom spectrum. The visualization we use to show both proteomics and transcriptomics expression data is look at one single protein is this body map which as mentioned earlier super imposes the expression information on a body map using a color scheme indicating in which tissue or which cell line which is the map to its originating tissue this protein is expressed the most. This is all stored in RDF like data model which allows us also to easily combine proteomics and transcriptomics expression data. We'll come to this also in a later slide. So here what we see is essentially expression data from various different cell lines and obviously all of these cell lines weren't acquired sort of by one single lab so what we do here in Proteomics DB we look out for interesting publications and a variety of journals and if the data is made available we actually go to Pride reprocess this through a homogenous pipeline and make the data here available in Proteomics DB. So we benefit a lot by the efforts of Pride and Proteomics Change to make such data available and also benefit a lot from annotating or from authors annotating which raw files belong to which cell here to be able to map what protein belongs to which cell in the end. What one can also do in Proteomics DB then is not only to look at the expression of a single protein but also multiple proteins since we combine all of this in one common database scheme we can then also easily ask questions like whether there is a particular expression pattern of multiple proteins visible when investigated across a larger number of cell lines or tissues. So what you see here is a classic expression heat map of a group of proteins and PSMB so these are all proteins which make up the proteasome and what one can nicely see here is that there appears to be a different varieties of the proteasome expressed across the different tissues and cell lines here plotted across the x-axis and one can see that there's a proteasome which is commonly known as the constitutive proteasome which is sort of present in the non immune associated tissues and cell lines and then particularly in cell lines which are known to have a correlation with the immune system we see that a certain group of proteins exist as well three particular proteins is exchanged against the induced version of that so indicating that these cell lines and then are likely very active in generating M2A peptides. As mentioned we use this proteomics to also do a lot of data dissemination for the bigger projects which have been done in the group of Professor Pfister and so we store information about Kinobeats assays so here we measure the binding of drugs to kinases and we store an a range of 10 to the 6 of those in proteomics DB where again one can dig down to every single dose response measurement and see exactly at which concentration of a drug one sees a half maximal binding of that protein. We also support the visualization and storage of ZSAR data so not only base melton data so at which temperature a protein is likely going to melt but also the visualization of its base melting behavior versus the melting behavior when a drug is present in the same system so and then typically the difference between two curves is an indication of whether there's binding happening between this drug and this protein which may also give a slight indication about the potency of this and we also store these protein turnover data you can also dig down into every single measurement of this. Investigating whether a protein may be an interesting target for a drug which was designed to specifically degrade that protein so specifically for this drug you may be interested in proteins which have a high turnover so which show a rather short degradation time and because then the chances of an active protec are rather high. All of this data can also be investigated sort of from a different angle so earlier now we looked at a perspective from a protein we can also look at the data from a perspective of a drug right so here when the violent plots indicate the binding profile of the targets for two different drugs here we're looking at imatinib and barfitinib on the y-axis we see the minus log 10 eC50 plotted which indicates and for a particular protein of interest whether this is a useful drug to inhibit this protein or not so the number here indicates how many proteins show a and then as we always tried in proteomics DB for data dissemination providing access to level of granularity of your most your most interested in one can dig down from these high level representations and to the binding curve and the respective binding measurements here with the black dots and of every single binding curves of those we have also made use of the availability of large phenotypic screens so we've integrated data in proteomics DB from big cell sensitivity screens done by available by the CCLE or done by the like Sanger Institute and have made all of this data also available in proteomics DB again similar to the data we reused from pride that was all processed using a the same processing pipeline to make sure that model fitting and so on is done exactly the same and then in proteomics DB you can essentially select a cell sensitivity screen of interest and then investigate either for a particular cell line of interest or for a particular drug of interest whether there is a particular cell line or drug and respectively standing out with respect to sensitivity or resistance so if you're interested for example in a cell line which shows no response when given a certain drug of interest you're studying in your PhD thesis for example and you're interested in resistance mechanisms this view these parallel coordinates here and would show a concentration dependent behavior and thus might be good candidates for figuring out potential resistance mechanisms of that drug or vice versa you can also use this to figure out cell lines which are very very sensitive to this drug if you then for example would like to turn this line into a resistant one by continuously growing the cell with low dosages of this drug and similar to earlier we start with very high level selections or we may just select the gdsc data set here we may select imatinib as a drug of interest and every single line here will then depict one single cell line of interest and can then select these cell lines which show a high relative effect and a good model fit here indicated by the r2 and relative effect and then investigate every single model fit manually and investigate how confident this model fit is by also looking at every single measurement of those something which we believe is rather unique here is that this visualization as far as we are aware is one of the only bonds in the web these days still which allows you access to these large cell viability screens so as far as i'm aware and there's i don't know any other tool which allows you to dig deep and and specifically filter for cell lines or drugs of interest for these genomics data sets the main goal of protomysdp then is also to integrate these different streams of information into sort of higher level features and i've depicted here a couple of those right so if we integrate for example and the proteomics and the transcriptomics data we can make use of this to and come up with an alternative missing value implementation strategy and this is also what we've done so earlier what we what we and others have shown is that there seem to be a larger number of proteins which show a fairly good correlation specific correlation between mRNA and protein so what this allows us to do now is to make use of the transcriptomics data and acute missing proteomics expression data based on the mRNA profile here we've done this for a variety of different implementation methods we've done this for i'm just random sampling so that would be just picking a random protein expression from the distribution of the rest of your proteins minimum of the samples distribution just picks the minimum and protein expression observed for particular sample and using this as an imputed value or we use the mRNA guided version and if we actually do this what we see is that the mRNA guided missing value implication actually performs the best and results in the lowest mean absolute error when done on a set of proteins for which we know the expression in a cell line what we can also integrate as protein expression profiles and cell viability data so we can actually use this and fit elastic net regression models one per drug and um omics type which allows us to investigate in more detail what potential biomarkers there are for um predicting whether a cell line is likely going to be sensitive or resistant for a given drug so what this elastic net regression model does in the background it does um feature selection so it tries to figure out a small meaningful subset of proteins which seem to be predictive for um example the a u c of a um cell viability screen and then see that for um situximum um f a 2 appears to be a um likely marker for resistance when using situximum um ultimately what we are planning on doing here is and we are in the on the verge of expanding this much more as to allow users to also upload their own data to this and then apply the learned models to um the data to the data uploaded by the user um to have a feature which may predict which drug is likely going to be um very effective for the cell line or which may not be very effective for the cell line this can potentially be expanded and may open up also avenues for personalized medicine in the future one can also combine the protein expression profiles and cell viability data in a in a different way so earlier we were using these elastic net models what one can also do is build alternative um ontologies for cell lines so what we heavily make use of in proteomics db is ontology is allowing us to connect for example cell lines to organs however just because a cell line may originate from from a liver does not necessarily mean it also behaves like a liver cell so what we can do is we can actually build these um ontologies of how cell lines and tissues are um connected to each other based on their molecular profile so we can use the expression data and the the way how cell lines behave when um um drugs are used um to build um molecular fingerprint um driven ontologies for um cell lines and what we see here then or what we can then also use this for is to suggest potential treatments for new cell lines because they may be in a sub branch of this tree which may all be rather and sensitive to a particular drug of interest so today um proteomics db really expanded from just human and proteomics to a large range of um different omics types um with um different capabilities of integrating these different omics types and that was really largely the result of of um previous efforts to make all of this available but so a single lab would have never been able to really do all of these studies themselves and build a resource which covers that many organisms and that many omics types um by themselves so this with without access to to data this would have really not been possible over the last um a year roughly we've also spent a lot of time and effort on re-implementing proteomics db and modernizing its user interface so given its 10 years of continuous development um the initial user interface we provide appears a bit outdated these days so what we have essentially done is re-implement all of this into a more modern framework we're using vue js for this which also really allows us to more rapidly prototype new on applications um so this is what proteomics db um looks like if you go to proteomics db.org slash mu on these days we have a much more we believe clean and modern um feeling user interface which also allowed us to integrate software developed for example by uniprot so what you see here is um formally known as the prot vista viewer on these days i think it's available via the name lighting gale which essentially shows the protein information um on a um as as a um here depicted as the x axis as the protein sequence and then superimposed or below this on certain tracks you find annotations for example which part of the sequence um we have evidence based on peptide level or which of the which part of the sequence are annotated to known domains um and on top of this there's also a really nice integration of a 3d visualization of the protein which is connected to the um feature viewer on the top which allows us to investigate in particular for example where a domain is located in the 3d structure or where a particular modification or maybe also mutation is located in the structure to figure out whether for example a phosphorylation may have actually an effect on the activity um of a protein the newer version also allows um the integration of alpha for two predictions so essentially this viewer works for pretty much any human protein of interest and essentially for all proteins covered by the recently developed resource and so we've tested this recently on a rice protein so even there this feature viewer actually allows you to um investigate the protein structure of rice proteins in 3d model in this process we have also completely revamped the um spectrum viewer um which now actually pulls predictions of prosid in real time so you can for any peptide of any peptide and present in proteomics db and that will fetch a prediction by prosid and then show this then um straight because of these efforts um we um also reworked our um the verification of we are starting to verify proteomics db so um we've um made the re-implementation of proteomics db available so the the source code of the woo interface you will find under this link um we also and we developed or revamped our api um which now essentially provides access to all the data stored in proteomics db so including the transcriptomics data including the phenomics data that was not possible earlier and as mentioned we we heavily benefit from the availability of of of your research um which allows us to really integrate this and in proteomics db with other sources of information um one other aspect um where um we also benefit from from pride is um a project called protein tools where we synthesized a large number of um synthetic peptides essentially with the goal to represent the human proteome by this and um all of this data is also available in pride and as far as the 2020 statistics are actually also turned out to be the most downloaded projects in pride um and some of the other data sets we have made available in proteomics db also appear really high in this list you know we are very happy to see this and very happy to see that um the data here is is um we used again without efforts such as pride and one will certainly go into more detail in the following talk this would have certainly not been possible um we also benefit from the availability of your data in doubt um for the development of um our deep learning framework project so in the past we have spent a considerable amount of effort in um developing a prediction tool which um allows to predict at which retention time a peptide is likely going to be observable in a measurement and what the fragmentation spectrum of this of this um peptide will look like we've done this in the past for um not labeled peptide here I want to briefly highlight um proset and and the benefits from public data on t and t labeled peptides so we have recently extended the proteome tools resource by um the t and t labeled equivalents of the synthetic peptides we have made available earlier we've retrained our proset model for these t and t labeled peptides um here slight side note is that this is a model which supports both hcd and cid fragmentation in one model um and can as as we see that the prediction um accuracy is is quite high um what we were interested in as whether this model which was not trained exclusively for t and t labeled peptides is also applicable to i-track labeled peptides or t and t pro labeled peptides and given that the proteome tools project is actually in fact um um already over so we don't have any funding for this anymore we benefited here from going into pride and looking for interesting data sets which have um used i-track or t and t pro labeling we pulled out this data um and compared the predictions made by proset t and t to this and somewhat surprisingly we see that proset t and t actually works quite well for i-track labeled peptides and also t and t pro labeled peptides so here on the left hand side so you see two mirror plots examples for i-track eight plex and t and t 18 plex and yes the intensities don't match perfectly which is somewhat expected given that we have trained this on t and t classic so the t and t 6 10 11 plex labeling um but it actually still shows appears in correlation of open items so this is actually still quite nice and also the retention time prediction for surprisingly well when using proset t and t we have then also asked the question well if we can predict these peptides decently well can we also improve this rescoring mechanism which was introduced earlier and what we see also here is that this proset t and t model actually works um surprisingly well on i-track and t and t pro labeled data so here see um bar plots indicating in blue the overlap between and analysis where um no rescoring was done or where this rescoring mechanism was used in green you see the peptides which are added by this rescoring um and we see here on psm and peptide level for this i-track data set an increase in the range of 40 to 50 percent which also actually has a drastic effect on the number of protein groups identified and here we are looking usually a body fluid so um don't don't expect super high numbers here and the same we also see for t and t 18 plex so um without access to the data validating such hypothesis would have not been possible so we are super grateful for all the efforts and also by the authors of the original studies to make all of the all of the data available um ultimately there is i guess still um a bit of work to do and i guess ultimately what we are interested in is predicting all relevant peptide properties so we these days have good predictors for fragmentation we have good predictors for retention time but we may not really have good predictors yet for which peptides are going to be visible and we see the first predictors appearing which um attempt to predict the relative intensities of peptides um but here also we benefit a lot from having data publicly available because most of these properties can only be learned particularly when using deep learning with access to a large amount of that there are still a couple of challenges involved here and particularly from my perspective um for the msdb and pros I see two big challenges um generally in proteomics I think we still suffer from a huge number of different protocols used so the way our proteins are extracted the way data acquisition is done the way data analysis done is pretty much different for any two data sets you look at in pride um which limits um the ability to integrate different different data sets in one resource so it feels like as a community we should at some point really start a more concerted effort to um bringing down the number of protocols used in the in the web lab and to really benefit from having all of this data available with this comes also um still this problem that most of the research published these days um is often not very extensively annotated so what we often run into when trying to integrate data into proteomics db is that we find a useful publication we think this is a cool data set to integrate in proteomics db and then we can't really do that because there's no information provided which raw file belongs to which sample which sample belongs to which experiment and what were the exact conditions under which um this specimen was generated so what exact drug was used which dosage of the drug was done um which um treatment time was used and that seems to be um I I proceed that this is um likely only be possible if you start enforcing such metadata um upon um submission of such data sets again the especially the the trial lab component so learning properties of those and integrating data from this would really benefit from from a common standard thing um with this I do believe I'm somewhat within the time limits um I I'm for most of what I've showed here I I'm I'm only the messenger so there's plenty of people walking behind this on this and so I'm in the luxury position position of really only showing the results of this here you see a picture of my new group so I've recently started my own group at the TUNIC I'm super grateful for this I'm super happy to work all with all these exceptional people I'm also grateful for for funding that thank you and like this there will be time for questions later okay thank you very much Matias we'll keep the questions until the end of the session and we will ask questions of both you and honan there's already one in the chat but we'll keep that for for a date and so without much further ado I I think it's time to listen to honan we've heard you mentioned pride several times Matias obviously pride is one of the cornerstones of data transmission and sharing if you like in proteomics and uh it is interesting to see what Juan Antonio has to tell us about pride data reuse including and on top of what we have seen already so honan of course entirely yours you should unmute honan that will help sorry so thank you lenard and also especially thank you for thank you Matias for being really a good introduction for many of the topics that I want to talk about in my talk so this is the overview of what I want to cover in my talk I will give a short introduction to pride on proteomics change then I will make a kind of educational route about the types of data reuse that exist I think it's interesting for those of you that are maybe not that express in the field then I will talk about one problem that just has been mentioned by Matias improving metadata notation in public data sets and then I will highlight but just very briefly some efforts that we are doing locally to disseminate and to reuse public proteomics data okay so again Matias was a good introduction pride is uh largest database worldwide for storing mass spectrometry based proteomics data sets pride stores all types of data including the raw data the identification quantification results all proteomics approaches are supported and at the moment it contains approximately 26 and a half thousand data sets so because pride is so widely used is recognized in europe as an elixir core data resource but we are not working in isolation I wanted to mention that in the field we are actually quite likely because it doesn't happen in other fields we are quite collaborative and where with our colleagues in the US and in Japan and China we set up the protein exchange consortium trying to implement a standard standard data submission and data dissemination practices between the main protein repositories so this was started now around 10 years ago and data sharing has generalized in the field this is because of two reasons because of of course the push of journals and funding agencies for open science practices but also because the field has really perceived that these resources are stable and and and can really deal with this this amount of data so pride is the resource in europe and this is just to show you I always saw this kind of charts in my presentations just to say that this is where protein exchange started like 10 years ago and 10 years later really the amount of data sets that are submitted per month has grown enormously so I think we are kind of stabilizing the in the last year or so or we are stabilizing more or less but you can see that this is the number of submitted data sets per year and at least till 2021 we are still growing 5800 data sets were submitted just during last year almost 500 of course pride is the world leading resource starting more than 83% of all protein exchange data sets so I'm not going to talk about the details about how we do this because but believe me it's a very tough job to have the infrastructure to be able to to deal with with this amount of data and also to support researchers in the field so the second part of my presentation I wanted to really do as I said a educational kind of lecture about the types of data reuse that are there and this is really the main point that I want to highlight and it was nice to hear previous presentations also the last one for Matias that's really one of the reasons why we are doing this one of the reasons why pride change exists is to really is to really enable data reuse and it is very nice to see that in the last few years the data reuse has really increased a lot in the field for for many of these applications that I will mention next so a few years ago and this is where I started the educational part of my of my talk Harald and Leonard and others we wrote a review about the data reuse in proteomics this review was written like six years ago now there are some things that you know have ball for it but still some of the concepts there are can can perfectly be applied today so we came at the time with four ways to or four categories to use the data use reuse reprocess and repurpose and I'm going to explain the last three of them starting by reuse that's to highlight the possibilities of having that data in the public domain so the first type of use that I want to highlight is what we call reuse and the way we we define this was that the information or the data is is not only extracted and copied but reused in new experiments with the potential of generating new knowledge and I want to highlight that these two categories of data reuse in this context the building of a spectral libraries and the benchmarking of tools and software I mean many of you are familiar with spectral libraries there are a lot that are created and by for instance different proteomics repositories by NIST they're really kind of considered to be the the gold standard by many in the field but this is only possible because there are a lot of data sets in the public domain that enable the construction of these spectral libraries and of course the popularity of spectral libraries has increased in in recent years because they can be used to analyze data independent data independent acquisition experiments and then of course there is the use of benchmarking software and tools this is what we also have to do all by informaticians where we build a new tool we need to compare it to all the tools that existed in the past and for doing this there are of course the need to to apply to two data sets so that's why many data sets are reused to benchmark new algorithms on software so again comparison with previous tools is essential for developing or assessing new tools so usually the raw data is used as the base so a new analysis performed but it's not always the case and this is really there are a lot of examples and there are basically I don't know dozens even hundreds of publications where where public data sets that are used for benchmarking purposes and these are just a few examples so the third category of of data reusing in proteomics is what we call a reprocess in this case what happens is that the the data are reprocessed or re-analyzed with the intention of obtaining new knowledge or to provide an updated view of the results it mainly serves the same purpose of the linear experiment and one example this is a subgram data set can be reprocessed with a different algorithm or search engine or using an updated sequence database because of course the protein sequence database is the the changing time and one particular of these experiments is what they're called meta analysis approaches where data coming from a lot of experiments together sorry the data coming from a lot of experiments are put together to extract new knowledge and of course some examples for example data integration studies with with different purposes so there are two different approaches the data using the data as submitted or as available in the supplementary material of minus fits but also every analysis of all the data together and this again there's some examples of papers that have been published in in in recent years that just do that they go to to to pride they download data sets that are somehow related and all the data sets are re-analyzed in a consistent manner and at the end the extracted common knowledge is of course higher that the analysis of each original dataset individual and one particular type of data reuse that is nice is what is what is called repurposing so in this we can have this category meaning that the data are considered in light of a question or a context that is different from the originally published study so there are different categories of these but we could highlight proteomics studies for instance or discovery of novel post-relational modifications for instance using like open modification searches tools so in the in the context of proteomics including this slide to introduce what proteomics is for those of you that don't know about it but the means data is in this kind of approaches is combined with genomics and trans or animal transcriptomics information typically by using sequence databases generated from DNA sequencing effort RNA set experiments rubber sec approaches or for instance london calling RNAs so the the search database is is built using this kind of data coming from mainly sequencing efforts there are many applications for for this type of studies but one of them that has been really popular using public datasets is improvement in general rotation efforts so of course general rotation is something that is very dynamic and that changes and new types of experimental evidences are are look all the time in order to improve in the gym models in many cases and instead of generating new data in order to be able to do that you can again come to pride download the datasets that we consider to be the most interesting ones and reanalyze them in a consistent manner using this kind of databases generated in the way that I explained before so this has been done for human datasets and there are some even resources that having a kind of setup doing this like a small open reading frames or the london calling encyclopedia and and there are also individual papers which are papers that have been having base on on this kind of data reuse approaches it has been done for human and this has been really historically quite relevant but in the last year as this has been extended to of course model organisms like mouse rat thexothla and actually other microorganisms as well like macroactin tuberculosis or licoactin pigory rise etc again these are some publications that have been published in in in recent years again by many different research groups when where this has been done and in the last few years it has kind of a new category we didn't include this in this arena paper when we wrote it like that I mean it can be done in combining these these three categories that I mentioned before that we use when again the data is put in a different context reprocess when the data is reanalyzed or the purpose the data is reanalyzed for a different purpose and this again was mentioned by Mattias in the previous in the previous talk and are of course the wide variety of applications of machine learning learning approaches in proteomics and in this case like again it was highlighted before by Mattias but also by Artur bright public data sets are often reused as training sets and of course this can be used to improve different aspects of the proteomics workflow like digestion or liquid tomography or mobility or prediction of better fermentation as it was mentioned before by by Mattias or peptide protein identification etc but what is also more interesting is that there are other type of data integration studies where these kind of approaches are used to predict kind of biological properties and there are more and more and more papers that that are published and many of them are actually quite interesting and I think this is quite important for for proteomics as a whole not only not only being focused on you know in the proteomics data analysis workflow which of course is very nice and is needed but also to focus in in using proteomics data to to make also biological conclusions so just to kind of summarize this part or not summarize but at least to come with some conclusions at the end there's a lot of public data reuse that has been already done but of course there are some bottlenecks that I would like to highlight here the first one is that proteomics data can be quite complex and they constitute a steep learning curve in some cases for researchers working in other fields for instance here in the campus where I work you know there are many people working in in transcriptomics and they are always complete always proteomics is so difficult and it's true there's a in some cases a steep learning curve for for researchers working in other fields then mass spectrometry raw data is big because of the the big sizes of the files and this is kind of a limitation in many cases but this has been solved by the ability of cloud infrastructure for data analysis and data analysis in general then of course software analysis methods are in some cases very dynamic that means that one tool could change very rapidly and of course there is some work in adapting new tools or different versions of tools to data analysis then there is the problem that's of MS vendors that Windows is often a requirement for analysis software and it doesn't really fit very well with this other statement that I mentioned before that in some cases because the data is very big there is the need to have like cloud-spending infrastructure in order to be able to analyze or reanalyze this data and the dependency on Windows is a problem for this but luckily this is changing and there are some tools in just in years that have made much more easy that I mean that because we don't have to rely on Windows as much as we have to in the past and then the last kind of bottleneck that I wanted to highlight is the lack of enough metadata notations in public data sets and again it has been mentioned before by Matthias and is the focus of the of the next section of my talk so which is about it to improving metadata notation of public data sets so I think that is important to give a little bit of historical perspective to say that really till very recently just a few years ago it was very difficult so there were really not so many data sets in the public domain it was only when protein exchange was started and there was this push by journals and funding agencies and also I mean by the community as a whole that really data sharing got popular in the field so at the time when we started protein exchange our main objective at the time was really oh we need to make data sharing popular so at the time also because proteomics is really a analytical discipline and the focus was not that much put on the metadata at the level of the sample what we wanted to achieve at the time again is that data sharing became popular and then kind of our idea was yes once we do this then we can raise the bar in terms of in terms of metadata so at the moment what is needed for performing a submission to PRIDE or any program exchange resource is a data set general description that includes information about you know the submitter the organism the tissues that are in both there and post simulation modifications data processing and similar things but they are all at the level of the data set and then we have all the data files like the raw msm files the results or all the type of files but what is really missing there is the link between you know the files and the samples this always been as I said missing because of the reason that I mentioned before it was not our decision it was also the decision of the community as I always say because we had a number of meetings to discuss this and as Matias again highlighted before this is essential what required for improving reuse of of PRIDE data sets so we have developed what was done in the last few years and again it was not done only by us but by many other people in the community something that at least was lacking as a first step the first step was to have a format that was common that was standard and that anyone could use to do that because there was there wasn't a such file format before so because we didn't want to basically reinvent the wheel we reuse what our colleagues in the ebi here the team the functional dynamics team that are in charge of our express but also special alas what they did in the past so they come out at the time with one format that was called mage tab and the format had two parts one what is called the investigation description format or IDF that describes the experiment of the data set we already have this information including each submitted data set to PRIDE because again this information was already requested from the beginning so we actually don't need this file because this file can be generated internally once the the the data sets are submitted and then we have the second part of the format that the admitted file that is called sdrf sample and made the relations in format that describes the individual samples and how they relate to the data files so again because we didn't want to reinvent the wheel a new flavor of sdrf was developed it's called sdrf proteomics it's a public admitted file and it's especially terrible for proteomics experimental approaches i actually didn't have to do personally that much with it i contributed a little bit but really not much this was what was led by just said very very well but also by many people in the community especially from from eubik since this event is also a event i want to highlight this and also by some people in the psi the proteomics standards initiative and there's been this github site that has been new to annotate data sets that are but were available in the public domain already and i think this has been a success some some changes with it in bright is to support the sdrf proteomics files the submission using the price submission tool but also we generate biosample IDs for each individual sample into different sdrf file and we can show this now in the in the pride work in the pride work interface so the status of things at the moment this was published last year again many of the people in this in this workshop are authors here the sort of things that we develop what we call a match that proteomics because again it's based on the renal match that for transcriptomics it's composed of two parts sdrf proteomics which contains the mappings of the samples with the files and IDF as the general information at the level of the of the data of the data set the sdrf proteomics can now be include included in data sets submitted to pride is optional at present but actually i have to say that people are already generating quite some of these data sets annotated files themselves so there are the moment more than 400 annotated data sets in pride already half of them come from the site from the github site that i mentioned before but the other half have already been annotated by users that were interested in providing this information to facilitate data reuse the files at the moment can be created using excel or analogous or analogous software of course this is what we don't want to stop there the idea is to make much easier the creation of these files again to enable easier easier data reuse of public data sets so we really want to promote the sports to the format coming from from popular tools in the field this is one way to to improve adoption the other way that we already tried but it's quite challenging we didn't succeed so far is to develop a web annotator tool that can be used by anyone in the community we had an initial version but we didn't think it was good enough because as soon as the data set contains many files it's actually quite challenging to make it work in the way that we like things to work so we will keep trying i think maybe at some point we'll be able to provide this to the community but at the moment this is a still working process and again once these kind of tools are more widely available our idea would be to to make the submissions including this information mandatory but they kind of been mandatory right now because you know we are receiving 500 data sets per month we cannot start a thing as relatively small we we cannot really help every single user to create the files in the right manner but we have of course this as one of the you know final goals that we would like to achieve okay and the last part of my talk in the last few minutes that i have i wanted to mention some efforts in house that we are doing in in the sense of data reuse of data sets and dissemination of proteomics data to what we call added value resources so again matias introduced many of the the concepts now i just want to highlight what is the general idea the general idea again is that individual submitters or large-scale projects they submit the data to proteomics resources of course right is in the center because again this is my fault the most popular resource and then this data is reused and make available in bioinformatics resources that actually they are not only accessed by proteomics people in a way that they are also accessed by other people that maybe they are not so experts in proteomics because we really need to to make proteomics data available in a wider context and these are resources like uniprot ensemble expression alas and experts the lncpedia ccs mhc atlas for immune peptidomics and of course and very importantly proteomics db as as matias saw before in the context of of pride what we are trying to do rather than you know set up new resources for proteomics data is really more difficult to sustain in the in the medium long term what we are going or what we're trying to do is take advantage of our location at the at the ebi and to work with some colleagues in the institute so that we can integrate proteomics data with other types of data omics data in popular bioinformatics resources so again proteomics data is made more accessible to biologists so we are doing at the moment this with all different resources protein sequences and pts with uniprot protein expression information with expression alas i will i will mention briefly some of the first that we are doing in this context and then we are doing also another project with ensemble genome browser for proteomics information with magnify which is a resource that stores mainly a metagenomics experiments and we are trying to integrate metaproteomics and metagenomics spiners there the projects all the projects are quite different is trying to do all at the same time we are again a small team and so there are kind of different stages of development right now i'm going to mention this is in a way complimentary to what matias has explained before for proteomics db and one of the resources we have work more closely is expression alas which is the ebi resource that so far it only stored gene expression information but in the last couple of years we have been able to integrate there also quite a number of protein expression information as well coming from generalists of data sets so now quantitative proteomics data sets are systematically integrated in expression alas again i don't have too much time to talk about it in detail but we are doing this for data depending acquisition data sets we have done this using baseline data coming from cell lines and tumor tissue data human baseline tissues and mouse and rat baseline tissues and of course we did this because it can of course kind of the the easier way to to start to start with we have also done some integration of baseline data sets so coming from data independent acquisition approaches we have done also some pilots on differential analysis but we are not still not confident of kind of promoting this too much because it's more complicated and there is we i think we have a stick to work a little bit more on differential differential data sets the word for is always the same and again this was highlighted again by matias we take data sets that are available in the public domain mainly from price they need to be curated because again in in many cases they don't have enough metadata notation for the original dimension before and then once they are manually curated they are put in the right format then the data sets are reanalyzed in this particular case we are using max one for dda data sets then there is some post processing that takes place to generate normalized protein abundances and then the data after some quality assessment and again some extra post processing is made available through ebi special alas so we have i mean we have done more about the the ones that maybe i can highlight here because there are more recent are we have done 24 human data sets containing 60 67 tissues and 31 organs and then we have also done 14 mouse data sets and nine rat data sets containing 14 organs and 34 tissues so organs is just a way to to group different tissues together to try to have the aggregation done at a different level than the tissues so i don't have too much time to talk about that in detail about this but once we have all the results it's possible to to make comparisons like what we can show in this chart well it is possible to do the global expression correlation analysis between the three species between human and mouse between human and rat and between mouse and rat of course for the tissues that are common so liver lung and testis and then it is possible to come up with this realization of of orthodox genes in the three species and how they are represented so it's impossible to do this kind of comparison on orthodox in the three species so if you want to know more about this i know this is very brief but there are papers we already published in scientific data about protein expression in cell lines and different tissues there is one that is almost now accepted for publication for baseline mouse and rat expression and there is the other one for baseline human tissues that is still under review and again i would like to highlight that we have also done this for data independent acquisition data sets in this case is again the same pipeline creation putting the putting the data in the right format in this case we have used open swath and a common spectrum library we use also by profit and trick and then there is some kind of again quality assessment and normalization and at the end the data goes into expression alas and i don't have time to talk too much about it again i'm running out of time but the actual manuscript is available in bio archive and it has been just accepted last week in scientific data as well so this kind of some of the visualization that can be seen in a special alas when it is possible to see protein expression and gene expression in the web interface and just to finalize also nothing that we are doing is in collaboration with the uniprot team is really to try to address the problem that ptn data is really underrepresented in uniprot for relations of course the first priority and what we are trying to do also for fair reference looking for to address the fair principles it will prove the traceability between the ptn's annotated in uniprot and the mass aspect data stored in pride so we have set up a systematic analysis violence for public ptn performance data sets in collaboration with our colleagues in peter dallas eric doge and also my colleague and john's university of leopold here okay after that we call as ptn has changed that the again is that the data that is going to appear in uniprot is linked to the experimental data in pride or peter dallas if that's the case so so far what i can report for this is a benchmarking paper we have tried to to do for improving the reliability of phosphorylation data and analysis that is available in biochive then we have done two first studies for price phosphorylation compositional phosphorylation and also we have developed a kind of an initial version of data formats and analysis guidelines looking for community agreements so because we see this as a a ongoing thing in the medium and long term okay and just to finalize this that many of these ideas came from a previous work in collaboration with peter bell trouse group also where he was at the abi but i'm running out of time so just to summarize pride archive unprotein as chance there is really increasing amount of data available in the public domain they really enable big data approaches in proteomics there are improvements in metadata that are required to enable a better reuse of data sets i mentioned the improvements here in the development of mage-tap proteomics file format and then working in different activities related to data reuse data transformation i guess explain our efforts in the context of quantity of different data sets with expression alas and for transparency and notifications with uniprot and these are just not our efforts but efforts by other people in the community and what is nice to see is that really there are new resources that are being set up in the last two three years that are basically based in reusing public proteomics they you can visit some of these resources and give it a go because again new resources that have is very nice to see that's in the last two or three years so just to really finalize i want to acknowledge all the people that work in the thing especially the price of that question today and to have the work done by jesset peri riverall and and also matias waltzer and of course our collaborators and ebi preparedness change and also take collaborators and of course all our fans of course to all of you who make the data available in pride and that enable this kind of style so yep thank you very much happy to take any questions i'm sorry i think i went over time for three minutes okay thank you very much honan and with that we are through the session matias is there as well so we have one question in the chat so far which is aimed at matias i guess you read it matias but i'll i'll read it out loud also for the recording is how are you doing the batch correction to compare across samples on slide 10 and maybe it might be useful to go back to slide 10 on your uh to share your screen and show slide 10 again so that we all know what this is about i remember because i read it when it came in but i'm sure other people may have forgotten slide then yes so let me just sort my screens and stuff and try and bring slide 10 to attention so this was slide 10 i guess right yep so um what was around so yeah so as as mentioned on one of my last slides um integrating data from different um resources or projects is is still quite a challenge right and i think the way how we do it in protein misty b is like likely not the gold standard um if if so what we do is um we essentially calculate i back well space on the um quantification data we have and then essentially bring that to the same scale using um a sort of similar normalization as done in transport comics where we essentially divide by the sort of summed up intensity of all proteins identified right so that aims to making sure that the distribution of protein expressions you have are located roughly at the same place so essentially what we do is we transform these i back values into i back values which are expressed by parts per million right so this protein is present in that sample um 250 parts per million with respect to all the proteins expressed in that sample that at least makes makes you have tested this that seems to make sure that the the plain numbers we get are in are independent of the depth at which the protein was analyzed right so when you do very shallow fractionation you may only identify 3000 proteins if you do heavy offline fractionation you may identify 11 000 proteins so this normalization seems to take care of this effect what we can't really do at the time yet systematically is account for differences in for example digestion right so if you use different digestion protocols where um somebody is doing overnight digest and some other and some other project there was a one hour digest using some pressure cooker we can't really account for that and i think without the necessary metadata on this i think it's also very difficult to come up with normalization strategies which could account for that we we simply don't have that annotation yet systematically for a large amount of projects and available to us yeah so the batch corrections which which go with regards to effects introduced by the wet lab right so the differences in in doing digestion and differences in doing um fractionation and so on um we currently can't account for and this is why i would also if you look at protein is to be an expression values i wouldn't look for the small differences right so if there's large differences they are likely real as if there's small differences i would be careful with those what we do in proteomics db is if the data allows so um you see here on the on the bar plot some of them have an error bar and that's essentially shows you the min and max expression we have observed for that particular tissue in case we have multiple samples projects experiments covering this so this may give you an indication of what range and given the different protocols used you may expect yep okay so if uh maybe i can comment a little bit on that so i because i didn't have time to explain in detail so i just i just i can just explain the approach that we follow it's not by any means perfect because that i don't think there's the perfect way to do this but on top of what matias mentioned that normalizing for the normal uh the total number of intensity per round what we do afterwards is that we basically get a distribution of intensities across the the proteins and then create the different bins we put the proteins in in different bins and then we assign you know one one value to each pin we have tried different different uh you know number of bins at the end we came up with five as a kind of optimal number that works well for us and it helps to make the data more comparable in some cases also reduces the number of batch effects uh i'm not saying this is perfect there has the limitation that of course the data um you know the data course um the sample preparation can be different especially between experiments where there is fractionation and those that where there is no fractionation but this approach at least helps a little bit uh on making the data kind of more comparable i would say this is not a very sophisticated approach it's just something that we also tried of course other other methods like combat and lima and this kind of our packages that are basically developed for trans-cryptomies data but found actually that the simple kind of categorization in in bins it worked better at least for us and in the data sets that we have them so so that's basically what what i wanted to what i wanted to say just just add a small note on this and i guess we are facing the same problem here like so that in most cases if you're looking at at data from two different experiments they will likely have no sample shared right not even the same cell line used so how do you normalize two data sets which don't have in quotes anything in common right and you can't it's it's very tough to define something like an m combat normalization scheme if you don't have something where you could argue that they should be the same yeah that's a it's a very interesting problem that will that will have quite a few people bust their heads over the next years but it's an interesting one so if anybody on online feels triggered to solve it by all means go for it everybody will be eternally grateful i'm sure okay there is a large question from honan sorry from harrell bassness to matias and honan the question is can you comment on how to deal with the potential changes resulting from reprocessing data compared to the original published findings is there not a risk that the reprocessed results can potentially even substantially differ from the manually curated or validated findings that were included in the original publication and how should the field deal with this growing separation of data generation on the one hand and data interpretation sorry how you could have actually said that yourself i suddenly realized but then again you'd like to talk so why not i can start i guess and then we can sort of try and ping pong i guess a bit right um a very interesting point and maybe face this particularly with proteomics db in the past where um we for example the paradoxes um um proteome paper um the analysis was done with maxfront and based on the maxfront results and the figures for the paper were generated and then we incorporated the results into proteomics db but given that in proteomics db we have to do our own fdr estimation given that we want to make sure to the best of our possibility that we have a one percent across the database that certain things will be cut away right so there is necessarily a difference between the analysis done on a single file or single experiment right versus when you put this income in light of a database um collecting information from multiple resources it feels like there is no way to do to to circumvent that right and we may just have to accept this to a certain degree because the alternative approach is that we import everything in proteomics db and we have done this for a recent mouse study and then export the expression values from this and base the figures on this but then technically we are not allowed to import anything for mouse ever again right because otherwise we will risk kicking out proteins or peptides or adding new ones um once we have additional identification with or less confidence in some identifications right so it feels like and again one of these challenges where we are more clever brains may find a solution to um this is open open issues yeah i mean i think that that problem is difficult it's very difficult so i mean to me is more i mean first of all i think is worth saying that this is not a problem unique to proteomics i mean there are other omics technologies and especially you know the the more kind of the less mature one omics technology is there there is more uh a space for uh getting different results when new software arises so it just to really say that this is not a problem just for proteomics is for for omics in general and to me the way to approach this is to be uh transparent so to say you know we obtain these results uh when we we did the analysis this way and this involves providing enough metadata but also providing maybe even you know the exact version of the tool that was used in a in maybe in a container something that's you know can be reproduced by others and um just also develop infrastructure that can really highlight the differences between the different versions of analysis which indeed is very challenging already for for for for mass-spec in general so um i don't know how to deal with this from a kind of philosophical perspective uh because i guess that there are many different possible scenarios but definitely i think one one way is to enable you know the community to make the decision we shouldn't be making the decision ourselves in a way and and the way to enable this is to to be very transparent and to uh and to enable you know reproducibility of the analysis and or at least that people have access to you know all the important pieces that that need to be uh considered when you know making a decision whether you know something is really novel or it's out of a reanalysis or not i mean this i mean at least this might take in on the problem which i think is basically impossible to solve but um but maybe other people have other other points of view yeah i mean i agree it's it's a big it's a big challenge for sure um i mean to the last part of my question in terms of separating this the generation of data now is is i mean the way it's been done in the past at least is you you generate your data you write publication based on data and then you you publish right but the team to move to be more a trend that you now put the data into pride and then you look at the data in a bigger context in order to interpret the data so so how do we how do we deal with that because if if one day there's a little enough anymore to interpret the findings then that is i think that at least from my perspective harrell the people that are doing that it's only but again i could be wrong so this is just my opinion it's a small subset of the community as a whole you know many people they go from project to project from publication to publication because this is the way science works you need you need to get your psd you need to get a new publication you need to get a new grant you need to get a new project it's only a subset i mean many of the people that are in this in this call the really the people that can really afford to go back to the data as a store and to you know and to to to do new findings with it i don't think it's the normal lab in a way the normal kind of experimental lab that can afford to do this so you don't think there will be a requirement in the future to require bigger and bigger data sets because they are available i mean if you can have so maybe maybe for a subs i mean maybe you know a subset of the community will require that but in my opinion there will be others that they will not change the way of working at least this is i mean at least substantially because also the their their lab is not built in this kind of expertise if you want to have a lab that's one to they need to be able to operate the latest mass spec plus all the separation and you know all the the things that are needed to perform an optimal protein mass experiment plus you know all this it's really very difficult to have that skill set in the same place um but again this is again only my my my point of view matias yeah so i wouldn't not not i don't i don't have any sort of statistics on this and i guess that may be generated by one but it feels like that publishing a story which is just based of big numbers becomes more tricky these days right because it's more of the same right in some way so if we if we look at um 30 other tissues of some other organism right it's the same again right it's nothing so it feels like it's more more difficult these days to just simply go for a large number and going to be that the cool proteomics applications are the small ones investigating one particular problem and not go sort of in in in sort of branch out a lot if that makes sense yeah okay it's probably a bigger discussion for some future time yeah i was going to say if this were a real meeting this is where i would advocate taking this discussion into the bar unfortunately that might be a little bit difficult the three of you are a little bit far away from each other to to find a convenient bar however the next time we meet in person we can do that and actually i have some very last slides about epic excess before i send you all on your way which offers you a potential opportunity to go to a bar so let me just very quickly show you again the next epic excess online webinar will be about integrating proteomics and genomics technologies on thursday may 19 and you can go to the epic excess website to register for that webinar just like for this one and then of course the bar thing is the live workshop on the 26th or 28th of september in tartu where again you're very welcome and you can register online on the epic excess website so with these two things reminded for you i think we can put a stop into this webinar thank you all very very much to the speakers for being here today thank you very very much to the audience for being here and i look forward to encountering you either at one of the next epic excess webinars and or in real life at one of the many meetings that will hopefully spring up again in our field so thank you all very much and enjoy the rest of your day