 So hello everyone, sorry for being late, but I'm very happy today that we have Juljanos and sorry, I cannot pronounce your name. That's fine. Yes it. Yes it. Yes. To to have a bite size talk today, I will hand it over to you to introduce yourself if you if you want to. Yeah, so I am Juljanos, Juljanos Pfeubler. I am a postdoc at the Zuse Institute Berlin, and I was working a long time at my doing my PhD at the University of Tübingen and the Freie Universität Berlin, and was a long time open MS contributor and maintainer. And a lot of the pipeline that we will present today is is course based space and open MS and it will be about mass spectrometry. Okay, I can present also myself. Um, I am Jasset Percibero, I am the team coordinator of Freie Database, which is the largest proteomics database of NBLEDA, and I work with also one of the developer appointments. Yeah. And, yeah, if there are no further questions in the beginning. I also can start sharing my screen. Would that be okay. Franziska. Yes, right. You co host, you should be. Okay. Let's see. Do you see the presentation or do you see some of my PowerPoint things. We see the presentation. Perfect. I think it will be about a next flow workflow this time. And it's not a super recent addition to the next to the NFCOR community platform, but it was there for a long time or a little bit longer time, but we just recently released a 1.1 version. And that is much more stable and much more up to NFCOR standards and we thought that would be a great time to introduce it. It is a workflow for, as the name implies, quantitative mass spectrometry data analysis. Just following all the NFCOR standards. It is meant to be very reproducible and also applicable for large scale analysis. For example, on those big public repositories like pride where yes it is from. Yeah, what are the areas of application of our workflow. It is planned as a most in one workflow for the analysis of quantitative mass spectrometry experiments in general. I don't mean metabolomics, proteomics, proteogenomics, but the current focus and the thing or the topic that we started with in the workflow is relative quantification of proteins or modified proteoforms based on mass spectrometry experiments. And that, and I will present first or ask the question, why do people do proteomics and how is it different from the usual genomics that we see here in NFCOR. Well, one one thing or one nice example that people always give is the difference between a caterpillar and a butterfly. So they share that exactly the same genome. Unless it's like some very slight mutations to the stage of life. They have a vastly different proteome, not only in the like amount of proteins that are expressed, but also the types of proteins that are expressed and how they are modified and all of this gives a much better representation of the actual phenotype of an organism or a cell. So the proteomics can be used in addition, or instead of genomics or even transcriptomics. And they are one technique to to get the quantities of proteins, for example, in the cell is mass spectrometry. And like I would say the most common technique to do that is via liquid chromatography coupled mass spectrometry. Which means you you digest your sample with the proteins first, you put it into an eppendorf tube. So you subject this to a liquid chromatography to split them up by some chemical physical chemical properties to make the analysis easier. So you squeeze them, put them into the mass spectrometer, and the mass spectrometer can then measure the ions and the amount of ions that are there based on their property or their behavior in a magnetic field. And what you get out is so called book, so called mass spectrum, where you, you can see the intensity or which is related to the amount of ions that were there for a specific mass or to be specific, a specific master charge ratio. And the problem with this is that those mass spectrometry experimental setups can be very complex. People add new types of mass spectrometers new types of experiments that they want to how they invent new labeling strategies and how to compare different samples and so on. So here I give a little overview over the most common strategies and quantitative proteomics and want to highlight which of them are supported by quantum as overall it's we only support relative quantification since absolute quantification can usually only be done with certain standards and a few people do it unless they have a really exciting project going on. But also in relative quantification you have two big subtypes, you can have labeled relative quantification and label free relative quantification. Label free is usually cheaper, because the labels are expensive, but the analysis is sometimes a bit more complex. And here you can have so called data independent acquisition data dependent acquisition. You can have it feature based spectral counting. Regarding labeling, you can label the proteins or the protein pieces the peptides in vitro or you feed your organism or cells, certain amino acids labeled amino acids in vivo. And our focus for labeled quantification was the team so called TNT and I track strategies, which are very similar in analysis types. So if you have any of any data set that was gathered with one of those green strategies here, quantum as will be should be very useful for you. Here's an overview of the pipeline. Everything starts with the spectra in MZML or in raw format. MZML is open. It's sometimes a bit more verbose bigger. It's an XML based format. But we can also read raw files from the thermo Fisher instruments through conversion. We do some pre processing on the spectra, as well as on the protein database that you give the pipeline to say which proteins you think that are in your sample and you would like to identify and quantify them. And then we have three different branches, let's say that depend on which strategy was the experiment was based on. We have this data dependent label free branch in blue data dependent isobaric labeling TNT or I track in red and data independent in green. The top ones are usually done by open MS tools. The framework for mass spec analysis, while the lower one is done by DIN. It's a separate package where we were in close collaboration with the author to make it as efficient as possible in a distributed environment, computing environment. I will go over the steps one by one. At first, a bit more details about the input. I mentioned the mass spectra already. The second one is an experimental design that you need. And we highly recommend to use the sample the data relationship format. It's a community developed and tab separated format that for the data sets, for example, that are in pride. And we annotated a lot of them manually for our reanalysis. It contains information about the contents of the samples like organism. Yeah, labeled or not the experimental setup, but also the biological question like which condition a sample belongs to the protein database is the usual easy faster format. And can be either directly downloaded from SwiftProt or Tramble or manually created by some proteogenomics studies you did before. It can be with or without so called decoy proteins that we need later for false discovery rate estimation. In the preprocessing we, as I said, convert and index all of our spectra and the default format will be our MZML. So everything you have else will be converted into MZML before keep that in mind. We combine information from the STRF and the next flow parameters. So currently it's a bit mixed where you can set certain parameters because we also wanted to support very simple designs. Yeah, where you actually a lot of the information is implied. We do some sanity checks conversion convert them into designs for the specific tools, but also units for specific tools or certain vocabulary for specific tools. And regarding the database, we can also generate the decoys for you. It's usually done by reversing or shuffling sequences in the database. Then we perform identification with common so called database search engines. Currently you can select between MSGF plus and Comet or both of them, in which case they will be combined probabilistically by an open MS tool called consensus ID. We then offer a re-scoring mechanism that uses more features than just the similarity between predicted spectrum and observed spectrum. This is currently only possible by the SVM based tool percolator, but we are heavily developing or trying to integrate deep learning based scores from, for example, MS2 re-score. The false discovery rate estimation is done based on the well established target decoy approach. We offer FDRs on multiple levels. PS, the peptide match, peptide spectrum match level, peptide level, protein level, protein group level, and on different scales, either for a specific sample only or for experiment for the whole experiment. And we can do the so-called PICT FDRs that were recently published and show a bit more sensitivity in large-scale experiments. For the quantification of the peptides, in label-free quantification we use the open MS proteomics LFQ tool, which is also the main part of the old NFCore proteomics LFQ pipeline, which means if you're using that one, this is fully integrated and superseded, so you may switch to quantum MS. And this performs the following tasks. It does the identification of quantifiable features in your mass spec data. This can be done targeted by looking for specific IDs or untargeted by just looking at isotopes and illusion shapes. Then does retention time alignment. It links the identifications to get the best matches over all samples. And then you can optionally also transfer identifications to features that do not have an identification or you can re-quantify parts of your MS experiment. If in all samples, but this one, or in the most samples, but this sample, there was a feature but you couldn't find one in this one, then you can extract the last part of the signal. In isobaric labeling, it's much easier because it's just based on the intensity of so-called reporter ions. We support most TMT and itrack plexus, which means the plex just tells you how many channels you can multiplex into one sample, which means this means how many samples you can have in one mass spec run, let's say. And we also support so-called SPS, which introduces a third fragmentation level for mass spec. Then when you have quantified the peptides, you usually want are interested in the actually proteins that they come from. Therefore, we have two different inference techniques implemented in the Bayesian one with the open MS tool epiphany, but also a simple rule-based aggregation of peptides to proteins. And regarding quantification, we support the common strategies top three peptides per protein. IBAC is a common strategy that normalizes by the length of the protein, for example. Those come from open MS, but we also have support for statistical post-processing tools like MS-STUTS and TRIGLA, which then they have much more elaborate statistical models, and they also include significance testing between comparisons of samples, conditions, contrasts. For the third branch that is based on PTIAN, the data-independent acquisition branch, we made it fully parallelizable via multi-step analysis. So first you do an in silico library prediction and the pre-analysis for every sample, and then only after you do an empirical, like data-based library generation and a final analysis on the full experiment. It is also compatible with MS-STUTS. This means you could have relatively, you have a, of course, the output will not be comparable in the quantities, but it will be comparable in the format compared to other branches of the workflow. And we also converted, as all the other branches, it can be converted into MCTAP, which is for smaller experiments, human-readable, TAP-based format for the quantities and identifications of such experiments. And you can use it immediately for upload to PRIDE, for example, or publication, which is usually recommended by the journals. A bit more details on our general outputs. As I said, we have this MCTAP for all the quantitative and identification-related information. The MCTAP in general contains, on the right side here, metadata, a protein section, a peptide section, peptide spectrum match section, and for metabolomics also, small molecule section. It's used by, it's a community standard, so it's used by a lot of projects, and it's very helpful to have it for upload or journals. Then we, from our statistical post-processing, we can get heat maps or volcano plots for the comparisons between conditions that we can specify in the parameters of the workflow, for example. But we also have a full PMultiQC report, which is based on a plugin that we wrote for multi-QC specifically for proteomics. It includes quality control heat map over all samples, but also detailed plots per sample, and a detailed and searchable table of the results that is connected to an SQL backend. Those are some examples of our outputs. On the first picture, you can see the experimental design that you have given and how it was interpreted by our tool. They can get very complex in proteomics experiments because you can also fractionize your data, fractionate your data or your samples. With the usual biological and technical replicates, you can get quite complex. In the lower part, you can see a heat map of some aggregated quality control metrics for specific samples, and things like how many percent of contaminants were identified, the average peptide intensity, how many miscleavages in your digestion, we could find what was the rate of identifications from the overall number of spectra, and so on. Then some more detailed information about specific samples, for example, the number of spectra on each level, like fragment spectra or service spectra, how many of them were identified by each of the search engines, how many were identified after consensus ID, and so on. Lastly, one application that this workflow already had was a reanalysis of a large part of PRIDE. We really sat down and were annotating with a large group of PRIDE into the sample-to-data relationship format, which meant a lot of looking into papers, contacting authors, and so on. But which also means that you can now just, if you want to reanalyze something in a different way, you can just download the data from PRIDE or give URLs, which next low, of course, handles to the FTP. And, yeah, reanalyze it with different settings because the SCRF is already available. We then reanalyzed each entry with our quantum S. The good thing is we could analyze many of them because we made it very robust, bare default settings, and also supporting a lot of different experiment types as you have seen. And then, yeah, in the end, we just combined and visualized the results, in this case, per data set or per tissue, because a lot of data sets are very specific for a certain tissue. And, yeah, we're currently writing a publication on that. That was one of the first applications, yes. And I think that's it from our side. We're happy to answer all of your questions. Thank you so much. I'm just going to remove the spotlights. So if there are any questions from the audience, you should now be able to unmute yourself and ask the question right away. So are there any questions from the audience? If not, I actually have a question, maybe a bit selfish. It's very nice to see that we have some pipelines at least that are not NGS based. So I was wondering what made you choose NextFlow and NFCore for making this pipeline? The first thing was the incredible integration of all those large scale, high performance computing clusters and clouds that we have not seen in other workflow managers. Of course, a little bit, yeah, bias, because I knew some people from NextFlow. But yeah, I think it turned out to be the best choice in hindsight anyway. So, yeah, the NFCore team was very helpful in implementing all of this. And the AWS tests were also super nice because we, as a university, barely have any capability to test it on Amazon Clouds or something that always costs. And yeah, I think it gets a better reach also to industry by supporting, yeah, clouds. And maybe in the same vein, did you find any problems that were specifically there because it is not NGS? And because we're often very geared towards NGS? Yes, of course, yeah, it's not big problems, but some of your templates, let's say, they have a lot of, not a lot, but what was it for example? I think you were in the beginning, you had a fast QC parameter that was always supposed to be always there. And we of course had to remove it. Now, whenever a template update comes, you have to remove it again and things like that. But yeah, minor things. Okay. Are there any more questions from anyone? I would have one. Hi, great talk, thank you. I was wondering what you mentioned the small molecule MS experiments as a future possible application of the quantum MS. How far is it thought out or where does this stand? Yeah, so implementation wise, so we just, we have a colleague that created such a workflow based on very similar tools that we already have. That means the open MS ones, but also some other tools like Sirius for small molecule kind of database search. The competitor language snake bank. And, but we at least we see that it's a very feasible workflow that we have. And now we want to see, it should be a rather simple translation of the workflow, but we also want to check with the existing metabolite or workflow. Yeah, to see if we can combine them. We still have to check how, how compatible the SDRF and the MC top would be so that we for everything that we want to include into quantum as we definitely want. We want to start from an SDRF and have us output an MC tab or another future community community standard file format. And we think it should all be possible, since there's also an MC top and for metabolomics and yes it I think SDRF should have no problems at all to have some metabolomic specific annotations there. Yeah, I think we have started already to support metabolomics with SDRF. I mean we have the first call around it, how to do it. And as you said I think this is really important point. We have tried to put in quantum as the starting point on the end of the workflow should be a standard file formats for anyone who wants to join quantum as this will be the case for other use cases like proteomics like immunopathomics or any other use case that want to jump into mass pack quantitation in quantum as truly start by one standard file format something that the data out there is in that file format until end up into another standard file format which is this in this case is MC top but it could be something in the future slightly different. Great. Thank you both for the elaborations. Thank you very much. Do we have any more questions from the audience. It doesn't seem so then I would like to thank again you Janus and yes it. And of course, as usual, also the John Zuckerberg initiative for funding our bite size talks. So thank you very much everyone and see you. Bye. Bye.