 Okay, I think we can start now. So, welcome to everybody, to this proteins and proteome session that I'm very happy to co-chair with Julien Rack. So, Julien did a PhD in computational biology at the EPFL in the group of Vasily Hadziman and then a postdoc in the group from David Kvehrer at the Ludwig Institute for Cancer Research in Lausanne. And now he's a senior scientist in David's group, working in computational cancer immunology. And for those of you who don't know me, I'm Liddy Lane and I joined the SIB long ago in 2004. And I worked as a Swiss pod curator for a few years before creating the Califogruppe with Amos Berrock in 2009. And we developed next product, a knowledge based on human proteins. So in this session, we will have two 15 minute talk and five five minute talks. And this is very, very short to encompass all the aspects of the protein world. So let's start right now. Don't forget to attend the postdoc session afterwards because there is also very interesting protein related science to discuss there. As in all the other sessions, you are very welcome to post your questions in the Q&A chat box. If times allow, we will ask one or two questions after the 15 minute talks. But all the other questions will be addressed in the follow up mid the speaker session. To facilitate our work as chairs, please precise your name and indicate the name of the presenter to which the question is addressed when you post your question. So now we will start and I leave the floor to Julian. Thank you very much. So I'm very happy to introduce Marcus Muller, which is a senior scientist in the group of Mark Heberson from Vitality and also working with Michael Bassani at the SHIB. He will present research that they recently published to obviously identify non-pandemic and age rebound peptides. So the floor is yours, Marcus. Thank you. All right. So thank you, Julian. So the work I'd like to present here stems from a collaboration between the SHIB and Michael Bassani's group at the SHIB. And I'd like to thank especially Chloe Chong, who did this great work as part of her PhD. And also Michael Bassani's group and our collaborators from Lin Seng's group at Penn and Uwe Oles group at MDC and Didier Jonas group at EPFL and Ludwig for funding. The somatic genetic alterations that define cancer provide the immune system with the means to generate T-cell responses that are able to recognize and eradicate cancer cells. Now, unfortunately, this is not, this doesn't work for all the patients and for those patients, we try to help them with cancer immunotherapy. In Michael's Bassani's group, this works in the following way. We obtain biopsies from the cancer tissues from the patients. Then we use these tissues to do exome sequencing and calling the somatic variants. Then we do RNA-seq on these tissues in order to extract the long non-coding RNAs and the retroval elements here designed as TEs that are expressed in these tissues. We translate all these genomic sequences into protein sequences and store them in a fast day sequence database. If there is enough tissue available, we will also perform immunopeptidomics. That means we will isolate the HLA binding peptides, we will fragment them in a mass spec, and we will match these MSMS fragments against the proteogenomic database we obtained by exome sequencing and RNA sequencing. The most promising HLA peptides will then be screened for T-cell reactivity, and if you find T-cell that react, we will isolate them, expand them, and re-inject them into the patient. Now, if you do this type of proteogenomic MSMS searches, you deal with these databases, and there's an important characteristic of such a database, which we call here the Pi-0 over Pi-1 ratio. That will be important as you will see. So Pi-0 or Pi-1 is nothing else than the number of wrong PSMs peptide spectrum matches that the database is able to produce divided by the number of true PSMs. So if you take the protein coding database, this is maybe 30 megabytes in size, and in a good experiment, you can produce maybe 10,000 peptide spectrum matches. On the other side, the long non-coding RNA database is a little bit larger, but even for a good experiment, you will only be able to produce something of the order of 100 peptide spectrum matches, meaning that for the exome sequencing database has a low P-0 over P-1 value, whereas the long non-coding RNA database has a high such value, which means it is much more prone to produce false identifications of false positives. This is important. In this little toy example here, I did just in R, you can see this that if you, and this is what we do in proteogenomics, usually we combine all the databases into one file. If you combine the protein coding database with a low ratio and the long non-coding RNA database with a high ratio within the same database, search them, for example, with MassCode, MaxPont, or Comet at the FDR of 1%, which means that 1% of your PSMs in the result list will be wrong. Then you find something maybe a little bit astonishing. If you recalculate the error just for the protein coding group, you will find that this error is actually lower than 1%. In this example, just 0.5%, whereas the error in the long non-coding RNA part of the PSMs will be much higher. In this case, it's 20%, but it can be even higher than that. It can go up to 100% wrong identifications. This is just because these different pi0 or pi1 values of these different groups. This is not only important for this toy example. It's actually important. In this example here, there was a science paper published that claimed that 30% of the HLA1 binding peptides stem from proteomal splicing events. Proteomal splicing is something where the protein fragments reform peptide bonds in the proteosome, but not in the order they had in the original protein sequence. It has been shown that this exists in very rare cases, so it was kind of astonishing as this paper came out and claimed that 30% of the HLA peptides stem from these events. Us and not only us, we were a bit skeptical about this, and we published a paper in MCP where we reanalyzed their data. I think we could show that at least that these proteosomal splice peptides are mainly a statistical artifact due to this very high P0 over P1 ratio they used in this publication. Even though they never published a database, we can reasonably well assume that. Now what can we do in such a situation? The solution is actually very simple. Instead of using one global threshold for the database, we can use thresholds that are adapted to do this varying P0 over P1 ratio. Meaning for the protein coding part of the database, we saw that the error is a bit lower than 1%, so we can be a bit more tolerant there. We can decrease the threshold, whereas for the long non-coding part, the error was significantly higher than 1%, so we have to be a bit more restrictive there and increase the threshold. The question is now, which thresholds should we choose? There are several ways to do that. One way we chose in this Neon software was based on this theory for stratified FDR calculation or group specific FDR calculation, and it is based on the posterior error probability. I don't want to go too much into the details here, but you can show that if you set the posterior error probabilities at the same level for all PSM groups in your database, then the thresholds that correspond to this level will give you a maximal number of PSMs under the condition that the total error is smaller than a value of alpha. This is an exact theorem, and this came very handy here for our problem because it allows us to define these thresholds. We implemented that, so we ran MoxCon search without that, just with a global FDR of 1%, and for ComEd, we implemented this system of a stratified FDR calculation. There's a lot of things you have to actually do for that. It doesn't look as simple as it seems maybe, but you can do that, and then we compared the two searches. In the lower panel here, you see the results for the non-canonical peptide spectrum matches, so the non-coding RNAs and the retroval elements, and you see that the MoxCon search, you get a lot of identifications, but at a very large error of almost 50%, whereas for the ComEd search with this stratified FDR calculation, the number of identification quite drastically decreases, but also the error quite drastically decreases. The same is true for the retroval elements and also for the long non-coding elements. Even if you do that for ComEd, the error is still there. There's still a significant error there, and before we can further reduce this error, and we did simply just took the consensus spectrum between the MoxCon search and the ComEd search, so all the spectra that were found in both searches were kept, and this even further reduced the numbers, but also further reduced the error, and this is important here because in this approach, we would like to have a very high specificity and not so much a high sensitivity. Here are some results, so that's we found 100 or so retroval elements. There's always about this number of things you can find. We wouldn't expect thousands of them. The main contribution came from these LTR elements, these long terminal repeat elements, and the second most contribution came from these line elements. Some retroval elements also in uniprot, they're known to be protein coding, and they also contribute very significantly. For the long non-coding RNAs, most of the matches came from upstream, from known open reading frames, and about 20% of those were not. These non-canonical peptides sometimes have very interesting properties. For example, in this table on the left, we took, we checked the expression of these genes in GTEX, and for this first element here, which is a line retroval element, it is only, it is found in all our melanoma patients and only in the melanoma patients, and this will be an extremely valuable target for cancer immunotherapy, because it's not personalized. We could apply it to all melanoma patients, but unfortunately, it was not immunogenic. It didn't produce T-cell reaction. Then we also we screened a lot of peptides for T-cell reactions, about 800 peptides, and we could only found one non-canonical peptide, a long non-coding RNA peptide, that created the producible immune reaction, and the non-canonical peptide, and the non-canonical peptide, long non-coding RNA peptide, that created the producible immune reaction. However, this was a very interesting peptide, because it is downstream of a known open reading frame, that is a melanoma stem cell marker, and it could also be found in other melanoma patients, and it is not expressed as a protein in any of the tissues we are looking at. So this is a very promising target, which we are following up on, and this is the end of my presentation, and I would like to thank for your attention. Sorry, I lost my mic. So thank you, Marcus, for the presentation. Don't forget to put the question in the Q&R, if you have. So we have one question from Katja Behrenfeller. Does the long non-coding RNA database also contain ORF starting with non-canonfunctional start codons? As far as I know, yes. I had to check that, but I think yes. Okay, and so I also have one question, if no one else is writing for the moment, so would it be possible also to restrict the database by first predicting with the HLA prediction method, because for the class one now, they are quite accurate. So if you reduce the database size, then you probably have a lot less false positive. So would it be possible to do this a priori and then check for the MS spectra? I mean, certainly possible, and a lot of groups are doing that. I'm not a big fan of this because you never know how good the prediction actually are. We have examples where the prediction score is not particularly good, but we're pretty sure that these are actual binders. Sometimes you have missing LLs, in this situation, you can't use such an approach either. So I think if you use such a stratified FDR calculation, you can deal with fairly large databases. That's not a big problem. And I'd rather do this first and then filter later, then filter first and maybe lose some valuable identity. Okay. Thank you very much. Our next speaker is Lionel Brosa, who heads the Core Data Accuration team at SwissProte. And Lionel will show us how UNIProte recently improved the representation of the human metabolome using RIA. Lionel, it's your turn. Thanks, ladies. Thanks, Julien. I hope you can hear me and see my slides. So I am presenting an ongoing curation effort at the SwissProte Group that aims to improve the representation of the human metabolome in UNIProte KB. So as you probably know, UNIProte KB is a knowledge base of protein sequences and functional annotation. The SwissProte section integrates knowledge on proteins extracted from the manual curation from over 230,000 references. And the resource is freely accessible through our website. In addition to sequences and function, these protein records contain a wealth of information on including variants and diseases, protein interactions, protein localization, important sequence feature like domains and active sites, sequence homology, and much more. The function of enzyme is really specific since it includes a description of the biochemical reactions catalyzed by the proteins. So far these reactions were described using three texts and an enzyme classification proposed by the IUBMB based on EC numbers, which were not well suited for data integration. With the introduction of RIA, a knowledge base of biochemical reactions to represent these activities in UNIProte KB, each reaction gets a unique identifier and standard computationally tractable descriptors for the chemical transformation, but also for the metabolites that are involved in the reaction by using the KB ontology, the chemical entity of biological interest ontology. With this new possibility, we have undertaken a complete review of human enzymes that represent 20% of the human proteome. We have currently curated 2,897 human proteins using RIA. It represents more than 6,000 enzyme reaction pairs and already more than 3,000 unique compounds linked to human proteins. This representation of biochemical reactions improves navigation through proteins and metabolites, and you can now easily visualize directly in UNIProte KB compound structure. We have also developed new tools to improve advanced search within the database. You can search small molecules and other metabolites using names like Arachydonate in this example. You can also use the KB identifier for the Arachydonate, but also chemical structures using structure descriptors like the Inchiki. This is an example of a search with a structure descriptor for Arachydonate within UNIProte KB, and you see that it returns a bunch of entries, 200 support entries, from different organisms. Using different types of data integrated in UNIProte KB that I described before, as I mentioned before, you can then refine your search to human proteins that are associated with a disease and for which we have curated variants, and you can, moreover, for instance, look for a given sub-cellar localization looking at the proteins that are in the goatee apparatus. You end up with two proteins that fit all the criteria I mentioned before. A better integration of data also allows search through resources to use knowledge captured not only in UNIProte KB, but also in other resources like BG, which is dedicated resource for gene expression, and you can answer these type of questions to know in which tissues are expressed to genes that are metabolizing cholesterol. Sorry, you will find more details about this on my poster, number 236, and there were also more presentation on this aspect during these seed days. So in conclusion, I showed you that how we are improving the representation of the human metabolome in UNIProte KB. I hope I could convince you that it enhances integration and the exploration of the curated knowledge. Our aim is to deliver by 2021 the first draft of the complete metabolome for human proteins. With this, I would like to thank our funders, CERY, for the Swiss Federal Government, DNIH, also thank you for your remote attention, and also acknowledge all the people involved in the production of UNIProte. Thank you. Thank you very much, Lionel, for this very clear talk. Our next speaker is also from the Swiss blood group, it's Philippe Le Mercier, who heads the viral zone team. So from the very beginning of the COVID-19 pandemics, I know that Philippe's team has worked very hard to annotate the proteome of the SARS-CoV-2 and to apply different bioinformatics tools to try to better understand the viral transmission and pathology. And so now he will tell us about his main results. Okay, thank you, Lidi. So I'm Philippe, and I'm going to tell you about the SARS and the resource we developed in the viral zone. Also, how it has been annotated in UNIProte, and then you have to face it. And I'll show you some kind of prediction we made already in February on the virus. So in the viral zone, we made a special resource for especially the SARS-CoV, so we have a fact sheet which described the global biology of coronaviruses. We made the genome and expression, the part of proteome, an interactome, which is a purated interactome, means things that can be done manually with a function associated to it. Also, we have done a coronaviruses cycle. You can plug where are antiviral drugs acting on it on some kind of treatment. So you have potentially the big picture of the biology of the virus, and you have all the time links to database and bioinformatics tools. So the SARS-CoV is a variant which is enveloped. The name coronaviruses, the name of the family comes from the spike protein, which makes like a corona in electronic microscopy. And it has the longest RNA genome known in life because it's 30 kilobases. Most viruses, like influenza, are about 12 to 15 kilobases. This one is very big. The genome of coronaviruses is pretty much organized in two parts. So we have the first sequence, the 12th of January, which came out in a few days in NCBI. And we started in Uniprot to annotate right away when we added it. Most of annotation was done by similarity with the SARS of 2003, because of course this one was a new SARS. And you will see that all the genes have similarity with the old SARS, except of eight here, which is completely a new gene, for which we have no function actually. So the first part of the virus encodes for everything needed for replication. And the virus actually creates a little vesicle inside the cell to protect itself again, the detection of a double-stranded RNA and antiviral effects. So this polyprotein is expressed and cleaved by all proteins. And it expresses a lot of like 16 proteins, which will start the replication cycle. These two polyproteins are annotated in just two entries in Uniprot, which are there. And they were done manually, entirely manually. This is a topology I drawn for these polyproteins. You see that some of them are linked to the reticulum and the plasmid membrane. And this will actually trigger the vesicle formation at this place to protect the virus replication. Or in pink, this will take care of the cell antiviral system. And in gray, you have the ones which would make the replication and transcription. So you see it's a huge business of cleavage. And it's pretty difficult to find the right site and the right topology. But because it was quite similar to SARS, we were able to do that quickly. The second part of the genome is about the gene needed to make the virion, so the structural genes. And a lot of genes we are modulating host response as well as immune response of the host. So we have to change the gene model, NCBI bit rush, the gene model, the mist, or 9B or 13, and another one which was not relevant. So we modified it and we, of course, provided that for everybody. So they have a better gene model. So this is the kind of annotation we have. We also had to modify the naming because there were a lot of trouble because in the polyproteins, the protein was called non-structural, but it was called as well in the subgenomic here. So you have two, for example, NS6, which quit confusion. So we've seen it with experts in coronavirus and named them off there. So there's no confusion possible. So we have a function for pretty much all of this. You see, this is the control vocabulary which made in swiss pot and as we're in go to describe the viral protein functions, except for two ones, or even 7B, maybe link to Interferon, but you see off eight, we have no idea what it is. So I'm really curious to know what this protein is doing. The coronavirus life cycle has been drawn. It's here. So the viral particle outside would recognize the receptor that we trigger on those cytosys by the cell. It's actually the cell which engulfs the virus. It's taking the cell to take an inside. Then the virus will provoke a fusion of membrane release is messenger RNA. So it's a very big 30 kilobaseous messenger RNA that will be translated into the polyprotein first. The polyprotein will take care of the cell, will induce the vesicles. And then inside the vesicles, the replication will start. Subgenomic RNA will start to be done as well. So the first genome will be recycled to make more replication at a certain moment. The subgenomic RNA will produce all the structural proteins and then the full genome will be encapsulated and bed out. Okay. So we have annotated all the proteins. This is a representation of it. So all the subgenomic RNA have been annotated using AMAP, which is a highly high quality automated manual annotation of proteins. It's very convenient because we can then annotate all the coronavirus proteins with one rule. We have also created the interactome because there have been some publication of large-scale interaction which are of mild quality in my view. So here we have gone for all the literature since SARS 2003 and see everything that could be relevant and similar. So you see in light green, things have been shown up for SARS-CoV-2. The others have been shown for SARS-CoV-2 2003. Mostly except for receptor here, there is a new feature that the fluorine is able to leave the spike. The spike needs to be activated outside of the cell to be able to provoke fusion. So usually it's TMP-SS2, but also the fluorine SARS-CoV-2 is able to do that. It's a new feature. It binds the CO2 just like the other one and you will see that it could bind also other receptors. This is not shown definitely. And most of the interaction we see that have been shown with SARS-CoV-2 2003 are a modulation of ostentival system. For example, VST2 is a tethering which tethered actually the virus to the membrane when it's gone out so the virus is not able to move the new cells. And R7A just prevent that. On top of the antiviral cycle, we could plug the antiviral drugs and you see that I will not go too much into details because of no time, but there are drugs which will inhibit the spike maturation. So the virus will not be able to make fusion. You have fusion inhibitor that prevents the maturation of the underzone. You also protease inhibitor. For example, in HIV you have very efficient compounds for the protease of HIV. So it could be working with coronavirus as well. You have polymerase inhibitors and you have also anti-inflammatory antibodies and neutralizing antibodies which are mainly for clinical or for, of course, vaccines will be neutralizing antibodies, which would be the best. To be short, on all the serious trials which have been done, which you mentioned, none have done, have given good results in clinical. So in the future, only remdesivir have weak effect, which means that the virus will be daily with like 3% less. So it's not bad, but it's not there. But I think in the future, maybe one year or two, we could have better antivirals. So the resource links also to external resources. Of course, the COVID-19 resource, I will talk just after, you have also a SARS-CoV-2 resource of the SIB, which includes a Swiss model, for example, uniprot, cellosaurus and many other features. Next trend, which is really, I was hoping to see pretty much in life the phylogeny of the virus growing every day. PDB has made also a resource for all crystallization. There have been a lot of crystallization of the new virus, and CVI made as well for publication sequences. Elixir has done as well. I've put also some courses by a precursor of California, which explained the biology of coronavirus very well, and a few other links. So these resources have been widely popular. You see there are about 1,000 pages in the R-Zone, and these two 10 new pages are almost double the R-Zone visits in the past months. And usually in yellow, in orange, sorry, you have the 2019 visit of R-Zone. This is for coronavirus, by the way. And in blue, you have 2020. You see it's really right there. And now we are looking a little bit of uniprot annotation. So uniprot has a ready cycle, which is now about two months, because there's a lot of computation to be done with the protein and to be updated. So it's extremely long and it could be longer in the future. But this could not be efficient in the moment of pandemic, of course. So the consortium, SwissProt and ABI and PIR have thought a lot. They were able very quickly to mount a site which pre-release the entries. So it's outside of the normal release of uniprot. So it's updated pretty much weekly. And this gives access to up-to-date information for these important proteins, which are the SARS-CoV-1. It's available there. You can click. So there are not only the SARS-2, but also the SARS-2003. And many human proteins, which have been also updated, because they interact with these coronaviruses. Now, bioinformatics is not just a tool for biologists. It's also able to make predictions. And actually, by analyzing in the end of January, the spike protein, we find a little pro-site pattern that highlighted the RGD on the protein. And for me, RGD was talking a lot, because I've been netting a database of human-virus interaction for entry. And many viruses are binding integrin, which is triggered by this RGD. And you see, it's not present in any under coronaviruses. Just SARS-CoV-2 have it. And for all the seconds we have, it's conserved. Using SwissModer, because at the time there was no crystallization, we were able to show that it was, the RGD was at the top, so it was possible that it was really top-bound by integrin and play-rolling receptor, because in blue, here you see the receptor binding for ACE2. So it makes sense. So we are able to make an accelerated publication in that. And the research is still ongoing. We don't know if it's integrin. It has been not shown so far if it's true or fairs, but the prediction is tested. So we are waiting for the results. So we want to start to thank all the people in the SwissProt group for the COVID resource, integrin publication, and the Uniprot COVID-19 site, which has been involved a lot of effort, actually, also for clinical expertise, calling whether from HUE. And conclusion, I will say that if you want to see the big picture of the coronavirus, go to Valzone. We have a lot of information, a lot of links. The COVID gene model sequence and protein analysis have been truthfully performed there. And also, integrin may play a role in various attachments that what the informatics tells us, we will see. Thank you for your attention. Thank you very much, Philippe, for this amazing talk on SARS-CoV-2. For a matter of time, we will only take one question. And this is a question from Amos about this enigmatic protein of 8. You say that you would be very happy to find clues. So Amos found a paper in BioArchive mentioning that this protein could mediate immune evasion through potentially down-regulating MHC class 1. And I need to know what you thought about this study. If you are aware of this study. No, I'm not aware of this study. There's so many things in BioArchive that are really in trouble following everything. But it will make sense, actually, suspect something like this, modulation of immune system. We look to it. Thank you, Amos. And you have plenty of other questions that will be addressed in the session after. Okay, thank you. Okay, thank you. So our next speaker is Martin Reinders, who is a postdoc in the group of Robert Waterhouse. And so Martin works on a tour, which is named CrowdGo, and which tries to improve the predictions of protein function, which is a very difficult task. Yes, thank you. It is indeed a very difficult task, so I'm trying to make it easier. So CrowdGo is a gene ontology prediction tool for protein functions, where gene ontology, of course, is the primary way of summarizing protein functions. It uses a meta approach, meaning that it takes the predictions from existing tools, multiple existing tools, and tries to combine them to get to a better prediction. So these are the four tools that I use in this analysis for CrowdGo, and they are represented in a precision recall curve, where the precision is the relative amount of true positives that are predicted compared to false positives. And the recall is the amount of true go terms that we are actually able to retrieve. So these four tools, they all predict reasonably well, some of them better than others, but all reasonably well. However, if we compare them, if you compare the predictions, what we see is that for every go term, it is either predicted by one tool, two tools, three tools, or four tools. However, it very rarely happens that it is predicted by three tools, and almost never that it is predicted by all four of the tools. However, we can use this information of overlap and non-overlap to get a consensus prediction, to get to a better prediction. And this is what CrowdGo aims to do. So very briefly on how it works, okay, so can we increase precision and recall by combining existing predictions? And very briefly on how it works, let's say that we have three methods that we are trying to compare, and they all predict genontology terms, where, of course, genontology is represented in a hierarchical way, where the term on the bottom includes the function of all the ones that are parent terms on top. So what we can then do is if the first method predicts this go term, the second method predicts this go term, and the third method predicts this go term, is we can say, okay, the bottom two go terms that are being predicted are much more similar to each other than they are to the one that's predicted on top. And also the one that's predicted on top is much more of a general go term, with much less information than the ones on the bottom. So using this information, we can create similarity scores. We can group the predictions together from multiple different tools. And using that, we can apply supervised learning, where we can train an algorithm to try to recognize the probability of a prediction being a true positive or the prediction being a false positive. So that's the essence of crowd go. And if we then apply crowd go on our bench working data set, remember that the other four tools are used as an input to crowd go in this case. And we see that if we combine these tools in crowd go, which is the blue line, we get a much higher precision. So this is the kind of result that we are looking for. However, this is a very abstract result. We also try to analyze crowd go on a more real case scenario. So what we did is we re-annotated the proteome of Arabidopsis, and we re-annotated the proteome of the tomato, both available on Uniprot. So the first one you see here is a violent plot of go terms per protein. And the first one you see is the Arabidopsis protein as represented by Uniprot currently. And the second one is by SwissProt only. So remember that the SwissProt is a curated database. So those go terms attached to a protein are much more reliable than the go terms attached to the other Uniprot proteins. And what we see is that we only look at the SwissProt proteins for Arabidopsis. You get a much better distribution of go terms per protein than we do for the entire Uniprot. For the entire Uniprot we get a lot of proteins with very few go terms attached to it. And then if we re-annotate Arabidopsis using crowd go, we get a distribution that is very similar to the SwissProt protein. So in that way we can say, okay, the distribution is at least what we expect to see if we have reliable annotations. Now if we do the same thing for the tomato protein, which is a non-model species, and hardly if any has proteins available in SwissProt, so only in Uniprot, we get more of a Christmas tree instead of a proper violin structure. This means that most proteins have one or two go terms attached to it, which is not at all what we want. If we re-annotate tomato with crowd go, we still have a bunch of proteins that hardly have any go terms, but we get a much better distribution that's more similar to the one in Arabidopsis. So this result of the go term annotations distributions compared to to the precision and recall that I just showed makes us believe at least that crowd go is definitely an improvement in protein function annotation. So finally I just want to emphasize here that crowd go comes with pre-trained models using the tools that we just showed, but you can also make your own models using different tools. In theory crowd go can use any tool that is out there for go term prediction. You can do both of this using snake mate pipelines that we provided. It can be found on my GitLab and finally I want to thank the Waterhouse group where I do my post work and please come talk to my poster after this session where we can talk more about methods or if you actually want to use crowd go we can talk about that. Thank you very much for listening. Thank you very much Martin. So now we'll have David Lyon from the group of Christian von Merring at the University of Zurich and he will present another analysis about the gene ontology. So his tool is a go tool which is for the enrichment analysis specifically tailored for proteomics applications. So we listen to you David. Thank you. Hello and welcome to my talk about a go tool. So there's quite a few enrichment tools out there but I'd like to use the next moments to tell you how and why a go tool stands out. Classic enrichment analysis was tailored towards genomics data and then proteomics people simply applied it to proteomics data but there's a problem with this because proteins cannot be amplified like DNA. So this means that there's an abundance bias in proteomics data. The more abundant a protein is the more likely it is to be detected by for example mass spectrometry. When studying post translationally modified proteins so PTM proteins this effect can be even more pronounced since not every single copy of a protein will be modified in the same way and the starchyometry is often not 100%. When you compare modified proteins to the genome you often find enrichment for abundant proteins rather than modified proteins. For example, you'll get a lot of enrichment for the ribosomal proteins which is not what you want since you're not getting relevant and biologically meaningful terms. So what you should do is have an appropriate control group to compare to. We've created a simple method to control for this bias by scaling the background so the things you're comparing to the background distribution to mimic the foreground. So this method is specifically tailored towards PTM data. It makes the two groups you're comparing more equal in size and enables a more suitable comparison. So just as a side note we also provide other methods besides this abundance correction. So let's now look at a specific example we've taken yeast proteins post translationally modified by succinolation and are comparing these to three different groups. The largest amount of enriched terms result when comparing to the genome, the second largest when comparing to the observed proteome which are simply those proteins we can observe in the mass spectrometer and when using our abundance correction method we actually find no significant terms whatsoever. So why is this a good thing that we don't find any enriched terms? In yeast succinolation is an untargeted and non enzymatic process which means you would not expect to find any particular compartment or biological process to be involved. So there should actually not be any enriched terms for this data. This is an extreme example to showcase the method. In general when using abundance correction you can expect to find more biologically meaningful terms and less clutter. We don't only offer classic genontology terms but also other functional categories. Some of them come from text mining data such as diseases, tissues and PubMed publications. Part of these protein to function associations are not binary but they come with a continuous score. This means that for part of the data we use a Komolgorov Smirnov test instead of a classic Fischer's exact test. A distinguishing feature of this web tool is that we provide monthly updates of the resources which is particularly interesting for the text mining data but also for the other resources just thinking about all the biocurators putting work and effort into their tools. So other enrichment tools such as David which was referenced before in the talk are typically not updated after their initial publication and this is what the web interface looks like but there's also a REST API for programmatic access. The results are grouped by category and there's a compact as well as a wide formatted comprehensive view. The screenshot actually shows the results of a characterized foreground method using uniprot selection of COVID-19 related proteins. We use different tools and technologies to realize this project and are actually currently working on an interactive visualization of the results and I'd be happy to take any questions later on in the session. I hope I picked your interest in this tool and want to thank you for your attention and also my group for their support and of course my collaborators. Thank you very much David. So we will answer you, we will ask our questions in the in the next session. So the last, the very last talk of this session is by Roman Milunas who works in the group of Manfredo Quadroni at University of Lausanne and he will present us a new tool which is called Pumba and which is, which aims to help people to validate Western blood results using Mastec data. So Roman, it's yours. Yes, hello. I just had a problem putting on my screen so let's start now. I still cannot access. Okay, it's good. Okay, I would like to present you Pumba which is a web resource to verify antibody-based results and more precisely it's to verify results you produce by doing Western blocks. I just have again a problem, like I cannot change the slides. Do you still hear me? Yes, because I cannot see my mouse and I cannot share the slide. Now it's working. Okay, let's go like that. Okay, so Western blocks, they are very widely used. It's one of the most commonly used techniques in web labs actually and unfortunately they're also poorly reproducible and they're also prone to artifacts. So that's why we decided to build up a database where you use mass spectrometry data to verify antibody-based results. So to use the orthogonal technique to those antibody-based results. So we build up Pumba database which you can access by the link you see above and actually what we have there is gel migration patterns which then were analyzed by mass spectrometry. So those are the same gels that are used for Western blocks and we analyze them afterwards with mass spectrometry. We currently have nearly 7,000 proteins and four different human cell lines and we're on the way of adding more human cell lines and we would also like to add mouse cell lines. So let's look at the example. So what you can see here that's a Western block where we tried to identify fast KD2. So this protein fast KD2 and what you already can see it's not very pretty so that's normal for Western blocks they're mostly not very pretty. You can see on the y-axis it's a separation of the the proteins by their molecular weight and there are in this case three replicates. So you can see three bands and they're all at the same position and that should be actually fast KD2 but in this case it's as 55 kilodaltons. We know that because the internal standard is added to the Western block but the theoretical weight of this fast KD2 it's actually at 81 kilodaltons so there's a gap. So the question is are those bands that we see really fast KD2? To answer that we can look at Pumba. We load the protein from Pumba and we can see that for all the four cell lines we have the same pattern as we have in the Western block so we have the main identification below at the lighter weight than expected. So we can answer the question with yes this is fast KD2 but now the question is but why do we have it with this unexpected weight? So again we can look in Pumba at another graph that we're seeing. In this case on the y-axis again we have the separation by molecular weight but this time on the x-axis we have the whole protein sequence. So this is the protein sequence of this fast KD2 and the little lines you can see in the graph those are the peptides that were identified using mass spectrometry and what we can see here is that the peptides start to be identified at 118 and they are identified before the end actually. So there's a part of the protein that is not visible here so from this we can deduce that what we're looking at it's actually a major proteoform. So this was just a small example of how Pumba can be used to verify your Western blocks. There are more visualizations there and it's all interactive so I would like to thank all the people who worked on this project and you for the attention. Thank you. Okay thank you very much Roman and thank you everyone for the very nice presentation and for keeping on time. It was really great talks and also thank you for the attendees for your presence and for participating in multiple questions and now we can switch to the meet the speaker room where we will continue the discussion and you can ask for the question to the presenter so you should now disconnect from this Zoom and go to the meeting meet the speaker room to further discuss. See you there.